Compaq 21264 Users Manual

21264 to the manual 615b7092-278f-4959-9e48-c78f7bf66f21

2015-02-03

: Compaq Compaq-21264-Users-Manual-468233 compaq-21264-users-manual-468233 compaq pdf

Open the PDF directly: View PDF PDF.
Page Count: 356 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Compaq Computer Corporation
Shrewsbury, Massachusetts
Alpha 21264/EV67
Microprocessor Hardware
Reference Manual
Order Number: DS–0028B–TE
This manual is directly derived from the internal 21264/EV67 Specifications, Revi-
sion 1.4. You can access this hardware reference manual in PDF format from the
following site:
ftp://ftp.compaq.com/pub/products/alphaCPUdocs
Revision/Update Information: This is a revised document. It supercedes
the Alpha 21264A Microprocessor
Hardware Reference Manual
(DS–0028A–TE).
September 2000
The information in this publication is subject to change without notice.
COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL
ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAM-
AGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS
INFORMATION IS PROVIDED “AS IS” AND COMPAQ COMPUTER CORPORATION DISCLAIMS ANY
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WAR-
RANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE, GOOD TITLE AND AGAINST
INFRINGEMENT.
This publication contains information protected by copyright. No part of this publication may be photocopied or
reproduced in any form without prior written consent from Compaq Computer Corporation.
© Compaq Computer Corporation 2000.
All rights reserved. Printed in the U.S.A.
Alpha 21264/EV67 Hardware Reference Manual
COMPAQ, the Compaq logo, the Digital logo, and VAX Registered in United States Patent and Trademark Office.
Pentium is a registered trademark of Intel Corporation.
Other product names mentioned herein may be trademarks and/or registered trademarks of their respective compa-
nies.
Alpha 21264/EV67 Hardware Reference Manual
iii
Table of Contents
Preface
1 Introduction
1.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–1
1.1.1 Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–2
1.1.2 Integer Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–2
1.1.3 Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–2
1.2 21264/EV67 Microprocessor Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–3
2 Internal Architecture
2.1 21264/EV67 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–1
2.1.1 Instruction Fetch, Issue, and Retire Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–2
2.1.1.1 Virtual Program Counter Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–2
2.1.1.2 Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–3
2.1.1.3 Instruction-Stream Translation Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–5
2.1.1.4 Instruction Fetch Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–6
2.1.1.5 Register Rename Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–6
2.1.1.6 Integer Issue Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–6
2.1.1.7 Floating-Point Issue Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–7
2.1.1.8 Exception and Interrupt Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–8
2.1.1.9 Retire Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–8
2.1.2 Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–8
2.1.3 Floating-Point Execution Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–10
2.1.4 External Cache and System Interface Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.4.1 Victim Address File and Victim Data File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.4.2 I/O Write Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.4.3 Probe Queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.4.4 Duplicate Dcache Tag Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.5 Onchip Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.5.1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2.1.5.2 Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–12
2.1.6 Memory Reference Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–12
2.1.6.1 Load Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.1.6.2 Store Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.1.6.3 Miss Address File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.1.6.4 Dstream Translation Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.1.7 SROM Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.2 Pipeline Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2.2.1 Pipeline Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–16
2.3 Instruction Issue Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–16
iv
Alpha 21264/EV67 Hardware Reference Manual
2.3.1 Instruction Group Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–17
2.3.2 Ebox Slotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–18
2.3.3 Instruction Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–20
2.4 Instruction Retire Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–21
2.4.1 Floating-Point Divide/Square Root Early Retire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–22
2.5 Retire of Operate Instructions into R31/F31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–22
2.6 Load Instructions to R31 and F31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–23
2.6.1 Normal Prefetch: LDBU, LDF, LDG, LDL, LDT, LDWU, HW_LDL Instructions . . . . . . . 2–23
2.6.2 Prefetch with Modify Intent: LDS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–23
2.6.3 Prefetch, Evict Next: LDQ and HW_LDQ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 2–24
2.6.4 Prefetch with the LDx_L / STx_C Instruction Sequence . . . . . . . . . . . . . . . . . . . . . . . . 2–24
2.7 Special Cases of Alpha Instruction Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–24
2.7.1 Load Hit Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–24
2.7.2 Floating-Point Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–26
2.7.3 CMOV Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–26
2.8 Memory and I/O Address Space Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–27
2.8.1 Memory Address Space Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–27
2.8.2 I/O Address Space Load Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–28
2.8.3 Memory Address Space Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–29
2.8.4 I/O Address Space Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–29
2.9 MAF Memory Address Space Merging Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–30
2.10 Instruction Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–30
2.11 Replay Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
2.11.1 Mbox Order Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
2.11.1.1 Load-Load Order Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2.11.1.2 Store-Load Order Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2.11.2 Other Mbox Replay Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2.12 I/O Write Buffer and the WMB Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2.12.1 Memory Barrier (MB/WMB/TB Fill Flow) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2.12.1.1 MB Instruction Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–33
2.12.1.2 WMB Instruction Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–34
2.12.1.3 TB Fill Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–34
2.13 Performance Measurement Support—Performance Counters . . . . . . . . . . . . . . . . . . . . . . . 2–36
2.14 Floating-Point Control Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–36
2.15 AMASK and IMPLVER Instruction Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–38
2.15.1 AMASK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–38
2.15.2 IMPLVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–38
2.16 Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–39
3 Hardware Interface
3.1 21264/EV67 Microprocessor Logic Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–1
3.2 21264/EV67 Signal Names and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–3
3.3 Pin Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–8
3.4 Mechanical Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–17
3.5 21264/EV67 Packaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–18
4 Cache and External Interfaces
4.1 Introduction to the External Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–1
4.1.1 System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–3
4.1.1.1 Commands and Addresses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–4
4.1.2 Second-Level Cache (Bcache) Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–4
4.2 Physical Address Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–4
4.3 Bcache Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–7
4.3.1 Bcache Interface Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–7
Alpha 21264/EV67 Hardware Reference Manual
v
4.3.2 System Duplicate Tag Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–7
4.4 Victim Data Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–8
4.5 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–8
4.5.1 Cache Coherency Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–8
4.5.2 Cache Block States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–9
4.5.3 Cache Block State Transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–10
4.5.4 Using SysDc Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–11
4.5.5 Dcache States and Duplicate Tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–13
4.6 Lock Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–14
4.6.1 In-Order Processing of LDx_L/STx_C Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–15
4.6.2 Internal Eviction of LDx_L Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–15
4.6.3 Liveness and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–15
4.6.4 Managing Speculative Store Issues with Multiprocessor Systems . . . . . . . . . . . . . . . . 4–16
4.7 System Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–16
4.7.1 System Port Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–17
4.7.2 Programming the System Interface Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–18
4.7.3 21264/EV67-to-System Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–19
4.7.3.1 Bank Interleave on Cache Block Boundary Mode . . . . . . . . . . . . . . . . . . . . . . . . . 4–19
4.7.3.2 Page Hit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–20
4.7.4 21264/EV67-to-System Commands Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–21
4.7.5 ProbeResponse Commands (Command[4:0] = 00001). . . . . . . . . . . . . . . . . . . . . . . . . 4–24
4.7.6 SysAck and 21264/EV67-to-System Commands Flow Control . . . . . . . . . . . . . . . . . . . 4–25
4.7.7 System-to-21264/EV67 Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–26
4.7.7.1 Probe Commands (Four Cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–26
4.7.7.2 Data Transfer Commands (Two Cycles). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–28
4.7.8 Data Movement In and Out of the 21264/EV67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–30
4.7.8.1 21264/EV67 Clock Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–30
4.7.8.2 Fast Data Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–31
4.7.8.3 Fast Data Disable Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–33
4.7.8.4 SysDataInValid_L and SysDataOutValid_L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–34
4.7.8.5 SysFillValid_L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–35
4.7.8.6 Data Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–36
4.7.9 Nonexistent Memory Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–38
4.7.10 Ordering of System Port Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–40
4.7.10.1 21264/EV67 Commands and System Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–40
4.7.10.2 System Probes and SysDc Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–42
4.8 Bcache Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–42
4.8.1 Bcache Port Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–43
4.8.2 Bcache Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–44
4.8.2.1 Setting the Period of the Cache Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–45
4.8.3 Bcache Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–47
4.8.3.1 Bcache Data Read and Tag Read Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . 4–47
4.8.3.2 Bcache Data Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–48
4.8.3.3 Bubbles on the Bcache Data Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–49
4.8.4 Pin Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–51
4.8.4.1 BcAdd_H[23:4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–51
4.8.4.2 Bcache Control Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–52
4.8.4.3 BcDataInClk_H and BcTagInClk_H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–53
4.8.5 Bcache Banking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–54
4.8.6 Disabling the Bcache for Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–54
4.9 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–54
5 Internal Processor Registers
5.1 Ebox IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5.1.1 Cycle Counter Register – CC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5.1.2 Cycle Counter Control Register – CC_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
vi
Alpha 21264/EV67 Hardware Reference Manual
5.1.3 Virtual Address Register – VA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
5.1.4 Virtual Address Control Register – VA_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
5.1.5 Virtual Address Format Register – VA_FORM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–5
5.2 Ibox IPRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–6
5.2.1 ITB Tag Array Write Register – ITB_TAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–6
5.2.2 ITB PTE Array Write Register – ITB_PTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–6
5.2.3 ITB Invalidate All Process (ASM=0) Register – ITB_IAP . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.2.4 ITB Invalidate All Register – ITB_IA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.2.5 ITB Invalidate Single Register – ITB_IS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5.2.6 ProfileMe PC Register – PMPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–8
5.2.7 Exception Address Register – EXC_ADDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–8
5.2.8 Instruction Virtual Address Format Register — IVA_FORM. . . . . . . . . . . . . . . . . . . . . . 5–9
5.2.9 Interrupt Enable and Current Processor Mode Register – IER_CM. . . . . . . . . . . . . . . . 5–9
5.2.10 Software Interrupt Request Register – SIRR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5.2.11 Interrupt Summary Register – ISUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5.2.12 Hardware Interrupt Clear Register – HW_INT_CLR . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–12
5.2.13 Exception Summary Register – EXC_SUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–13
5.2.14 PAL Base Register – PAL_BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–15
5.2.15 Ibox Control Register – I_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–15
5.2.16 Ibox Status Register – I_STAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–18
5.2.17 Icache Flush Register – IC_FLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5.2.18 Icache Flush ASM Register – IC_FLUSH_ASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5.2.19 Clear Virtual-to-Physical Map Register – CLR_MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5.2.20 Sleep Mode Register – SLEEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5.2.21 Process Context Register – PCTX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5.2.22 Performance Counter Control Register – PCTR_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . 5–23
5.3 Mbox IPRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–25
5.3.1 DTB Tag Array Write Registers 0 and 1 – DTB_TAG0, DTB_TAG1 . . . . . . . . . . . . . . . 5–25
5.3.2 DTB PTE Array Write Registers 0 and 1 – DTB_PTE0, DTB_PTE1 . . . . . . . . . . . . . . . 5–26
5.3.3 DTB Alternate Processor Mode Register – DTB_ALTMODE. . . . . . . . . . . . . . . . . . . . . 5–26
5.3.4 Dstream TB Invalidate All Process (ASM=0) Register – DTB_IAP . . . . . . . . . . . . . . . . 5–27
5.3.5 Dstream TB Invalidate All Register – DTB_IA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–27
5.3.6 Dstream TB Invalidate Single Registers 0 and 1 – DTB_IS0,1 . . . . . . . . . . . . . . . . . . . 5–27
5.3.7 Dstream TB Address Space Number Registers 0 and 1 – DTB_ASN0,1 . . . . . . . . . . . 5–28
5.3.8 Memory Management Status Register – MM_STAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–28
5.3.9 Mbox Control Register – M_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–29
5.3.10 Dcache Control Register – DC_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–30
5.3.11 Dcache Status Register – DC_STAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–31
5.4 Cbox CSRs and IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–32
5.4.1 Cbox Data Register – C_DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5.4.2 Cbox Shift Register – C_SHFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5.4.3 Cbox WRITE_ONCE Chain Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5.4.4 Cbox WRITE_MANY Chain Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–38
5.4.5 Cbox Read Register (IPR) Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–41
6 Privileged Architecture Library Code
6.1 PALcode Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–1
6.2 PALmode Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
6.3 Required PALcode Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
6.4 Opcodes Reserved for PALcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
6.4.1 HW_LD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
6.4.2 HW_ST Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–4
6.4.3 HW_RET Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–5
6.4.4 HW_MFPR and HW_MTPR Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–6
6.5 Internal Processor Register Access Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–7
6.5.1 IPR Scoreboard Bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–8
Alpha 21264/EV67 Hardware Reference Manual
vii
6.5.2 Hardware Structure of Explicitly Written IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–8
6.5.3 Hardware Structure of Implicitly Written IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–9
6.5.4 IPR Access Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–9
6.5.5 Correct Ordering of Explicit Writers Followed by Implicit Readers. . . . . . . . . . . . . . . . . 6–10
6.5.6 Correct Ordering of Explicit Readers Followed by Implicit Writers. . . . . . . . . . . . . . . . . 6–11
6.6 PALshadow Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–11
6.7 PALcode Emulation of the FPCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–11
6.7.1 Status Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–12
6.7.2 MF_FPCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–12
6.7.3 MT_FPCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–12
6.8 PALcode Entry Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–12
6.8.1 CALL_PAL Entry Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–12
6.8.2 PALcode Exception Entry Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–13
6.9 Translation Buffer (TB) Fill Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–14
6.9.1 DTB Fill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–14
6.9.2 ITB Fill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–16
6.10 Performance Counter Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–17
6.10.1 General Precautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–18
6.10.2 Aggregate Mode Programming Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–18
6.10.2.1 Aggregate Mode Precautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–18
6.10.2.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–19
6.10.2.3 Aggregate Counting Mode Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.2.3.1 Cycle counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.2.3.2 Retired instructions cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.2.3.3 Bcache miss or long latency probes cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.2.3.4 Mbox replay traps cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.2.4 Counter Modes for Aggregate Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.3 ProfileMe Mode Programming Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.3.1 ProfileMe Mode Precautions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–20
6.10.3.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–21
6.10.3.3 ProfileMe Counting Mode Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.3.1 Cycle counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.3.2 Inum retire delay cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.3.3 Retired instructions cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.3.4 Bcache miss or long latency probes cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.3.5 Mbox replay traps cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–23
6.10.3.4 Counter Modes for ProfileMe Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–24
7 Initialization and Configuration
7.1 Power-Up Reset Flow and the Reset_L and DCOK_H Pins. . . . . . . . . . . . . . . . . . . . . . . . . 7–1
7.1.1 Power Sequencing and Reset State for Signal Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–3
7.1.2 Clock Forwarding and System Clock Ratio Configuration . . . . . . . . . . . . . . . . . . . . . . . 7–4
7.1.3 PLL Ramp Up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–6
7.1.4 BiST and SROM Load and the TestStat_H Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–6
7.1.5 Clock Forward Reset and System Interface Initialization. . . . . . . . . . . . . . . . . . . . . . . . 7–7
7.2 Fault Reset Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–8
7.3 Energy Star Certification and Sleep Mode Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–9
7.4 Warm Reset Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–11
7.5 Array Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–12
7.6 Initialization Mode Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–12
7.7 External Interface Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–14
7.8 Internal Processor Register Power-Up Reset State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–14
7.9 IEEE 1149.1 Test Port Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–16
7.10 Reset State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–16
7.11 Phase-Lock Loop (PLL) Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
7.11.1 Differential Reference Clocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
viii
Alpha 21264/EV67 Hardware Reference Manual
7.11.2 PLL Output Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
7.11.2.1 GCLK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
7.11.2.2 Differential 21264/EV67 Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
7.11.2.3 Nominal Operating Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19
7.11.2.4 Power-Up/Reset Clocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–20
8 Error Detection and Error Handling
8.1 Data Error Correction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–2
8.2 Icache Data or Tag Parity Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–2
8.3 Dcache Tag Parity Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–2
8.4 Dcache Data Single-Bit Correctable ECC Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–3
8.4.1 Load Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–3
8.4.2 Store Instruction (Quadword or Smaller) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–4
8.4.3 Dcache Victim Extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–4
8.5 Dcache Store Second Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–4
8.6 Dcache Duplicate Tag Parity Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–4
8.7 Bcache Tag Parity Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–5
8.8 Bcache Data Single-Bit Correctable ECC Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–5
8.8.1 Icache Fill from Bcache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–5
8.8.2 Dcache Fill from Bcache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–6
8.8.3 Bcache Victim Read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–6
8.8.3.1 Bcache Victim Read During a Dcache/Bcache Miss . . . . . . . . . . . . . . . . . . . . . . . 8–6
8.8.3.2 Bcache Victim Read During an ECB Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–7
8.9 Memory/System Port Single-Bit Data Correctable ECC Error. . . . . . . . . . . . . . . . . . . . . . . . 8–7
8.9.1 Icache Fill from Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–7
8.9.2 Dcache Fill from Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–7
8.10 Bcache Data Single-Bit Correctable ECC Error on a Probe . . . . . . . . . . . . . . . . . . . . . . . . . 8–8
8.11 Double-Bit Fill Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–9
8.12 Error Case Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–9
9 Electrical Data
9.1 Electrical Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–1
9.2 DC Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–2
9.3 Power Supply Sequencing and Avoiding Potential Failure Mechanisms . . . . . . . . . . . . . . . 9–5
9.4 AC Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–6
10 Thermal Management
10.1 Operating Temperature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–1
10.2 Heat Sink Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–3
10.3 Thermal Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–7
11 Testability and Diagnostics
11.1 Test Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–1
11.2 SROM/Serial Diagnostic Terminal Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–2
11.2.1 SROM Load Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–2
11.2.2 Serial Terminal Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–2
11.3 IEEE 1149.1 Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–3
11.4 TestStat_H Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–4
11.5 Power-Up Self-Test and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–5
11.5.1 Built-in Self-Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–5
Alpha 21264/EV67 Hardware Reference Manual
ix
11.5.2 SROM Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–5
11.5.2.1 Serial Instruction Cache Load Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–6
11.6 Notes on IEEE 1149.1 Operation and Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–7
A Alpha Instruction Set
A.1 Alpha Instruction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–1
A.2 Reserved Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–8
A.2.1 Opcodes Reserved for Compaq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–8
A.2.2 Opcodes Reserved for PALcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–9
A.3 IEEE Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–9
A.4 VAX Floating-Point Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–11
A.5 Independent Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–11
A.6 Opcode Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–12
A.7 Required PALcode Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–13
A.8 IEEE Floating-Point Conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–14
B 21264/EV67 Boundary-Scan Register
B.1 Boundary-Scan Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B–1
B.1.1 BSDL Description of the Alpha 21264/EV67 Boundary-Scan Register . . . . . . . . . . . . . B–1
C Serial Icache Load Predecode Values
D PALcode Restrictions and Guidelines
D.1 Restriction 1 : Reset Sequence Required by Retire Logic and Mapper . . . . . . . . . . . . . . . D–1
D.2 Restriction 2 : No Multiple Writers to IPRs in Same Scoreboard Group . . . . . . . . . . . . . . . D–8
D.3 Restriction 4 : No Writers and Readers to IPRs in Same Scoreboard Group . . . . . . . . . . D–8
D.4 Guideline 6 : Avoid Consecutive Read-Modify-Write-Read-Modify-Write . . . . . . . . . . . . D–9
D.5 Restriction 7 : Replay Trap, Interrupt Code Sequence, and STF/ITOF . . . . . . . . . . . . . . . D–9
D.6 Restriction 9 : PALmode Istream Address Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–10
D.7 Restriction 10: Duplicate IPR Mode Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–10
D.8 Restriction 11: Ibox IPR Update Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
D.9 Restriction 12: MFPR of Implicitly-Written IPRs EXC_ADDR, IVA_FORM, and EXC_SUM D–11
D.10 Restriction 13 : DTB Fill Flow Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
D.11 Restriction 14 : HW_RET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
D.12 Guideline 16 : JSR-BAD VA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–12
D.13 Restriction 17: MTPR to DTB_TAG0/DTB_PTE0/DTB_TAG1/DTB_PTE1 . . . . . . . . . . . . . D–12
D.14 Restriction 18: No FP Operates, FP Conditional Branches, FTOI, or STF in Same Fetch Block as
HW_MTPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D–12
D.15 Restriction 19: HW_RET/STALL After Updating the FPCR by way of MT_FPCR in PALmode D–12
D.16 Guideline 20 : I_CTL[SBE] Stream Buffer Enable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–12
D.17 Restriction 21: HW_RET/STALL After HW_MTPR ASN0/ASN1. . . . . . . . . . . . . . . . . . . . . . D–12
D.18 Restriction 22: HW_RET/STALL After HW_MTPR IS0/IS1. . . . . . . . . . . . . . . . . . . . . . . . . . D–13
D.19 Restriction 23: HW_ST/P/CONDITIONAL Does Not Clear the Lock Flag. . . . . . . . . . . . . . . D–13
D.20 Restriction 24: HW_RET/STALL After HW_MTPR IC_FLUSH, IC_FLUSH_ASM, CLEAR_MAP D–
14
D.21 Restriction 25: HW_MTPR ITB_IA After Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–14
D.22 Guideline 26: Conditional Branches in PALcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–14
D.23 Restriction 27: Reset of ‘Force-Fail Lock Flag’ State in PALcode. . . . . . . . . . . . . . . . . . . . . D–15
D.24 Restriction 28: Enforce Ordering Between IPRs Implicitly Written by Loads and Subsequent Loads
D–15
D.25 Guideline 29 : JSR, JMP, RET, and JSR_COR in PALcode. . . . . . . . . . . . . . . . . . . . . . . . . D–15
x
Alpha 21264/EV67 Hardware Reference Manual
D.26 Restriction 30 : HW_MTPR and HW_MFPR to the Cbox CSR . . . . . . . . . . . . . . . . . . . . . . . D–15
D.27 Restriction 31 : I_CTL[VA_48] Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–17
D.28 Restriction 32 : PCTR_CTL Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–17
D.29 Restriction 33 : HW_LD Physical/Lock Use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–18
D.30 Restriction 34 : Writing Multiple ITB Entries in the Same PALcode Flow . . . . . . . . . . . . . . . D–18
D.31 Guideline 35 : HW_INT_CLR Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–18
D.32 Restriction 36 : Updating I_CTL[SDE]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–18
D.33 Restriction 37 : Updating VA_CTL[VA_48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–18
D.34 Restriction 38 : Updating PCTR_CTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–18
D.35 Guideline 39: Writing Multiple DTB Entries in the Same PAL Flow. . . . . . . . . . . . . . . . . . . . D–19
D.36 Restriction 40: Scrubbing a Single-Bit Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–19
D.37 Restriction 41: MTPR ITB_TAG, MTPR ITB_PTE Must Be in the Same Fetch Block . . . . . D–21
D.38 Restriction 42: Updating VA_CTL, CC_CTL, or CC IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . D–21
D.39 Restriction 43: No Trappable Instructions Along with HW_MTPR. . . . . . . . . . . . . . . . . . . . . D–21
D.40 Restriction 44: Not Applicable to the 21264/EV67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–21
D.41 Restriction 45: No HW_JMP or JMP Instructions in PALcode . . . . . . . . . . . . . . . . . . . . . . . D–21
D.42 Restriction 46: Avoiding Live locks in Speculative Load CRD Handlers . . . . . . . . . . . . . . . D–22
D.43 Restriction 47: Cache Eviction for Single-Bit Cache Errors . . . . . . . . . . . . . . . . . . . . . . . . . D–22
D.44 Restriction 48: MB Bracketing of Dcache Writes to Force Bad Data ECC and Force Bad Tag Parity
D–24
E 21264/EV67-to-Bcache Pin Interconnections
E.1 Forwarding Clock Pin Groupings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–1
E.2 Late-Write Non-Bursting SSRAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–2
E.3 Dual-Data Rate SSRAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–3
Glossary
Index
Alpha 21264/EV67 Hardware Reference Manual
xi
Figures
2–1 21264/EV67 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–3
2–2 Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–4
2–3 Local Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–4
2–4 Global Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–5
2–5 Choice Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–5
2–6 Integer Execution Unit—Clusters 0 and 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–9
2–7 Floating-Point Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–10
2–8 Pipeline Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–14
2–9 Pipeline Timing for Integer Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–25
2–10 Pipeline Timing for Floating-Point Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–26
2–11 Floating-Point Control Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–36
2–12 Typical Uniprocessor Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–39
2–13 Typical Multiprocessor Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–40
3–1 21264/EV67 Microprocessor Logic Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–2
3–2 Package Dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–17
3–3 21264/EV67 Top View (Pin Down) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–18
3–4 21264/EV67 Bottom View (Pin Up) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–19
4–1 21264/EV67 System and Bcache Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–3
4–2 21264/EV67 Bcache Interface Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–7
4–3 Cache Subset Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–9
4–4 System Interface Signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–17
4–5 Fast Transfer Timing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–32
4–6 SysFillValid_L Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–36
5–1 Cycle Counter Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5–2 Cycle Counter Control Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
5–3 Virtual Address Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
5–4 Virtual Address Control Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
5–5 Virtual Address Format Register (VA_48 = 0, VA_FORM_32 = 0) . . . . . . . . . . . . . . . . . . . . 5–5
5–6 Virtual Address Format Register (VA_48 = 1, VA_FORM_32 = 0) . . . . . . . . . . . . . . . . . . . . 5–6
5–7 Virtual Address Format Register (VA_48 = 0, VA_FORM_32 = 1) . . . . . . . . . . . . . . . . . . . . 5–6
5–8 ITB Tag Array Write Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–6
5–9 ITB PTE Array Write Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5–10 ITB Invalidate Single Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–7
5–11 ProfileMe PC Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–8
5–12 Exception Address Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–8
5–13 Instruction Virtual Address Format Register (VA_48 = 0, VA_FORM_32 = 0) . . . . . . . . . . . 5–9
5–14 Instruction Virtual Address Format Register (VA_48 = 1, VA_FORM_32 = 0) . . . . . . . . . . . 5–9
5–15 Instruction Virtual Address Format Register (VA_48 = 0, VA_FORM_32 = 1) . . . . . . . . . . . 5–9
5–16 Interrupt Enable and Current Processor Mode Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5–17 Software Interrupt Request Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5–18 Interrupt Summary Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5–19 Hardware Interrupt Clear Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–12
5–20 Exception Summary Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–14
5–21 PAL Base Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–15
5–22 Ibox Control Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–16
5–23 Ibox Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–19
5–24 Process Context Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–22
5–25 Performance Counter Control Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–23
5–26 DTB Tag Array Write Registers 0 and 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–25
5–27 DTB PTE Array Write Registers 0 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–26
5–28 DTB Alternate Processor Mode Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–26
5–29 Dstream Translation Buffer Invalidate Single Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–27
5–30 Dstream Translation Buffer Address Space Number Registers 0 and 1 . . . . . . . . . . . . . . . . 5–28
5–31 Memory Management Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–28
5–32 Mbox Control Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–29
5–33 Dcache Control Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–31
xii
Alpha 21264/EV67 Hardware Reference Manual
5–34 Dcache Status Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–32
5–35 Cbox Data Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5–36 Cbox Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5–37 WRITE_MANY Chain Write Transaction Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–39
6–1 HW_LD Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–4
6–2 HW_ST Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–4
6–3 HW_RET Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–6
6–4 HW_MFPR and HW_MTPR Instructions Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–6
6–5 Single-Miss DTB Instructions Flow Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–14
6–6 ITB Miss Instructions Flow Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–16
7–1 Power-Up Timing Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–3
7–2 Fault Reset Sequence of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–9
7–3 Sleep Mode Sequence of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–11
7–4 Example for Initializing Bcache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–13
7–5 21264/EV67 Reset State Machine State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–17
10–1 Type 1 Heat Sink. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–4
10–2 Type 2 Heat Sink. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–5
10–3 Type 3 Heat Sink. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–6
11–1 TAP Controller State Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–4
11–2 TestStat_H Pin Timing During Power-Up Built-In Self-Test (BiST) . . . . . . . . . . . . . . . . . . . 11–5
11–3 TestStat_H Pin Timing During Built-In Self-Initialization (BiSI) . . . . . . . . . . . . . . . . . . . . . . . 11–5
11–4 SROM Content Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–6
Alpha 21264/EV67 Hardware Reference Manual
xiii
Tables
1–1 Integer Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–2
2–1 Pipeline Abort Delay (GCLK Cycles). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–16
2–2 Instruction Name, Pipeline, and Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–17
2–3 Instruction Group Definitions and Pipeline Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–18
2–4 Instruction Class Latency in Cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–20
2–5 Minimum Retire Latencies for Instruction Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–21
2–6 Instructions Retired Without Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–23
2–7 Rules for I/O Address Space Load Instruction Data Merging . . . . . . . . . . . . . . . . . . . . . . . . 2–28
2–8 Rules for I/O Address Space Store Instruction Data Merging. . . . . . . . . . . . . . . . . . . . . . . . 2–29
2–9 MAF Merging Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–30
2–10 Memory Reference Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
2–11 I/O Reference Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
2–12 TB Fill Flow Example Sequence 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–34
2–13 TB Fill Flow Example Sequence 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–35
2–14 Floating-Point Control Register Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–36
2–15 21264/EV67 AMASK Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–38
2–16 AMASK Bit Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–38
3–1 Signal Pin Types Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–3
3–2 21264/EV67 Signal Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–3
3–3 21264/EV67 Signal Descriptions by Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–6
3–4 Pin List Sorted by Signal Name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–8
3–5 Pin List Sorted by PGA Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–12
3–6 Ground and Power (VSS and VDD) Pin List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–16
4–1 Translation of Internal References to External Interface Reference . . . . . . . . . . . . . . . . . . . 4–5
4–2 21264/EV67-Supported Cache Block States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–9
4–3 Cache Block State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–10
4–4 System Responses to 21264/EV67 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–10
4–5 System Responses to 21264/EV67 Commands and 21264/EV67 Reactions. . . . . . . . . . . . 4–11
4–6 System Port Pins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–17
4–7 Programming Values for System Interface Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–18
4–8 Program Values for Data-Sample/Drive CSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–18
4–9 Forwarded Clocks and Frame Clock Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–19
4–10 Bank Interleave on Cache Block Boundary Mode of Operation . . . . . . . . . . . . . . . . . . . . . . 4–19
4–11 Page Hit Mode of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–20
4–12 21264/EV67-to-System Command Fields Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–20
4–13 Maximum Physical Address for Short Bus Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–21
4–14 21264/EV67-to-System Commands Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–21
4–15 Programming INVAL_TO_DIRTY_ENABLE[1:0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–23
4–16 Programming SET_DIRTY_ENABLE[2:0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–24
4–17 21264/EV67 ProbeResponse Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–24
4–18 ProbeResponse Fields Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–25
4–19 System-to-21264/EV67 Probe Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–26
4–20 System-to-21264/EV67 Probe Commands Fields Descriptions . . . . . . . . . . . . . . . . . . . . . . 4–27
4–21 Data Movement Selection by Probe[4:3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–27
4–22 Next Cache Block State Selection by Probe[2:0] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–27
4–23 Data Transfer Command Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–28
4–24 SysDc[4:0] Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–29
4–25 SYSCLK Cycles Between SysAddOut and SysData. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–32
4–26 Cbox CSR SYSDC_DELAY[4:0] Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–33
4–27 Four Timing Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–34
4–28 Data Wrapping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–36
4–29 System Wrap and Deliver Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–37
4–30 Wrap Interleave Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–37
4–31 Wrap Order for Double-Pumped Data Transfers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–38
4–32 21264/EV67 Commands with NXM Addresses and System Response . . . . . . . . . . . . . . . . 4–39
4–33 21264/EV67 Response to System Probe and In-Flight Command Interaction . . . . . . . . . . . 4–41
xiv
Alpha 21264/EV67 Hardware Reference Manual
4–34 Rules for System Control of Cache Status Update Order. . . . . . . . . . . . . . . . . . . . . . . . . . . 4–42
4–35 Range of Maximum Bcache Clock Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–43
4–36 Bcache Port Pins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–43
4–37 BC_CPU_CLK_DELAY[1:0] Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–45
4–38 BC_CLK_DELAY[1:0] Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–45
4–39 Program Values to Set the Cache Clock Period (Single-Data) . . . . . . . . . . . . . . . . . . . . . . . 4–46
4–40 Program Values to Set the Cache Clock Period (Dual-Data Rate) . . . . . . . . . . . . . . . . . . . . 4–46
4–41 Data-Sample/Drive Cbox CSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–47
4–42 Programming the Bcache to Support Each Size of the Bcache . . . . . . . . . . . . . . . . . . . . . . 4–51
4–43 Programming the Bcache Control Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–52
4–44 Control Pin Assertion for RAM_TYPE A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–52
4–45 Control Pin Assertion for RAM_TYPE B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–52
4–46 Control Pin Assertion for RAM_TYPE C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–53
4–47 Control Pin Assertion for RAM_TYPE D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–53
5–1 Internal Processor Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1
5–2 Cycle Counter Control Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
5–3 Virtual Address Control Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–5
5–4 ProfileMe PC Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–8
5–5 IER_CM Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–10
5–6 Software Interrupt Request Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–11
5–7 Interrupt Summary Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–12
5–8 Hardware Interrupt Clear Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–13
5–9 Exception Summary Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–14
5–10 PAL Base Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–15
5–11 Ibox Control Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–16
5–12 Ibox Status Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–19
5–13 IPR Index Bits and Register Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–21
5–14 Process Context Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–22
5–15 Performance Counter Control Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . 5–24
5–16 Performance Counter Control Register Input Select Fields. . . . . . . . . . . . . . . . . . . . . . . . . . 5–25
5–17 DTB Alternate Processor Mode Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . 5–27
5–18 Memory Management Status Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . 5–28
5–19 Mbox Control Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–30
5–20 Dcache Control Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–31
5–21 Dcache Status Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–32
5–22 Cbox Data Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5–23 Cbox Shift Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33
5–24 Cbox WRITE_ONCE Chain Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–34
5–25 Cbox WRITE_MANY Chain Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–39
5–26 Cbox Read IPR Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–41
6–1 Required PALcode Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
6–2 Opcodes Reserved for PALcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
6–3 HW_LD Instruction Fields Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–4
6–4 HW_ST Instruction Fields Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–5
6–5 HW_RET Instruction Fields Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–6
6–6 HW_MFPR and HW_MTPR Instructions Fields Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 6–7
6–7 Paired Instruction Fetch Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–9
6–8 PALcode Exception Entry Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–13
6–9 IPRs Used for Performance Counter Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–18
6–10 Aggregate Mode Returned IPR Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–19
6–11 Aggregate Mode Performance Counter IPR Input Select Fields. . . . . . . . . . . . . . . . . . . . . . 6–20
6–12 CMOV Decomposed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–21
6–13 ProfileMe Mode Returned IPR Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–22
6–14 ProfileMe Mode PCTR_CTL Input Select Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–24
7–1 21264/EV67 Reset State Machine Major Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–1
7–2 Signal Pin Reset State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–3
7–3 Pin Signal Names and Initialization State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–5
7–4 Power-Up Flow Signals and Their Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–7
7–5 Effect on IPRs After Fault Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–8
Alpha 21264/EV67 Hardware Reference Manual
xv
7–6 Effect on IPRs After Transition Through Sleep Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–10
7–7 Signals and Constraints for the Sleep Mode Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–11
7–8 Effect on IPRs After Warm Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–11
7–9 WRITE_MANY Chain CSR Values for Bcache Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 7–12
7–10 Internal Processor Registers at Power-Up Reset State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–14
7–11 21264/EV67 Reset State Machine State Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–17
7–12 Differential Reference Clock Frequencies in Full-Speed Lock . . . . . . . . . . . . . . . . . . . . . . . 7–20
8–1 21264/EV67 Error Detection Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–1
8–2 64-Bit Data and Check Bit ECC Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–2
8–3 Error Case Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–9
9–1 Maximum Electrical Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–1
9–2 Signal Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–2
9–3 VDD (I_DC_POWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–3
9–4 Input DC Reference Pin (I_DC_REF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–3
9–5 Input Differential Amplifier Receiver (I_DA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–3
9–6 Input Differential Amplifier Clock Receiver (I_DA_CLK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–3
9–7 Pin Type: Open-Drain Output Driver (O_OD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–4
9–8 Bidirectional, Differential Amplifier Receiver, Open-Drain Output Driver (B_DA_OD) . . . . . 9–4
9–9 Pin Type: Open-Drain Driver for Test Pins (O_OD_TP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–4
9–10 Bidirectional, Differential Amplifier Receiver, Push-Pull Output Driver (B_DA_PP) . . . . . . . 9–4
9–11 Push-Pull Output Driver (O_PP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–5
9–12 Push-Pull Output Clock Driver (O_PP_CLK). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–5
9–13 AC Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–7
10–1 Operating Temperature at Heat Sink Center (Tc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–1
10–2 qca at Various Airflows for 21264/EV67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–2
10–3 Maximum Ta for 21264/EV67 @ 600 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–2
10–4 Maximum Ta for 21264/EV67 @ 667 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–2
10–5 Maximum Ta for 21264/EV67 @ 700 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–2
10–6 Maximum Ta for 21264/EV67 @ 733 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–2
10–7 Maximum Ta for 21264/EV67 @ 750 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–3
10–8 Maximum Ta for 21264/EV67 @ 833 MHz and @ 2.0 V with Various Airflows . . . . . . . . . . 10–3
11–1 Dedicated Test Port Pins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–1
11–2 IEEE 1149.1 Instructions and Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–3
11–3 Icache Bit Fields in an SROM Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–7
A–1 Instruction Format and Opcode Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–1
A–2 Architecture Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–2
A–3 Opcodes Reserved for Compaq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–8
A–4 Opcodes Reserved for PALcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–9
A–5 IEEE Floating-Point Instruction Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–9
A–6 VAX Floating-Point Instruction Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–11
A–7 Independent Floating-Point Instruction Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–12
A–8 Opcode Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–12
A–9 Key to Opcode Summary Used in Table A–8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–13
A–10 Required PALcode Function Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–13
A–11 Exceptional Input and Output Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–15
E–1 Bcache Forwarding Clock Pin Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–1
E–2 Late-Write Non-Bursting SSRAMs Data Pin Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–2
E–3 Late-Write Non-Bursting SSRAMs Tag Pin Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–2
E–4 Dual-Data Rate SSRAM Data Pin Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–3
E–5 Dual-Data Rate SSRAM Tag Pin Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E–4
Alpha 21264/EV67 Hardware Reference Manual
xvii
Preface
Audience
This manual is for system designers and programmers who use the Alpha 21264/EV67
microprocessor (referred to as the 21264/EV67).
Content
This manual contains the following chapters and appendixes:
Chapter 1, Introduction, introduces the 21264/EV67 and provides an overview of the
Alpha architecture.
Chapter 2, Internal Architecture, describes the major hardware functions and the inter-
nal chip architecture. It describes performance measurement facilities, coding rules, and
design examples.
Chapter 3, Hardware Interface, lists and describes the internal hardware interface sig-
nals, and provides mechanical data and packaging information, including signal pin
lists.
Chapter 4, Cache and External Interfaces, describes the external bus functions and
transactions, lists bus commands, and describes the clock functions.
Chapter 5, Internal Processor Registers, lists and describes the internal processor regis-
ter set.
Chapter 6, Privileged Architecture Library Code, describes the privileged architecture
library code (PALcode).
Chapter 7, Initialization and Configuration, describes the initialization and configura-
tion sequence.
Chapter 8, Error Detection and Error Handling, describes error detection and error han-
dling.
Chapter 9, Electrical Data, provides electrical data and describes signal integrity issues.
Chapter 10, Thermal Management, provides information about thermal management.
Chapter 11, Testability and Diagnostics, describes chip and system testability features.
Appendix A, Alpha Instruction Set, summarizes the Alpha instruction set.
Appendix B, 21264/EV67 Boundary-Scan Register, presents the BSDL description of
the 21264/EV67 boundary-scan register.
xviii
Alpha 21264/EV67 Hardware Reference Manual
Appendix C, Serial Icache Load Predecode Values, provides a pointer to the Alpha
Motherboards Software Developers Kit (SDK), which contains this information.
Appendix D, PALcode Restrictions and Guidelines, lists restrictions and guidelines
that must be adhered to when generating PALcode.
Appendix E, 21264/EV67-to-Bcache Pin Interconnections, provides the pin interface
between the 21264/EV67 and Bcache SSRAMs.
The Glossary lists and defines terms associated with the 21264/EV67.
An Index is provided at the end of the document.
Documentation Included by Reference
The companion volume to this manual, the Alpha Architecture Handbook, Version 4, con-
tains the instruction set architecture. You can access this document from the following
website: ftp.digital.com/pub/Digital/info/semiconductor/lit-
erature/dsc-library.html
Also available is the Alpha Architecture Reference Manual, Third Edition, which con-
tains the complete architecture information. That manual is available at bookstores
from the Digital Press as EQ-W938E-DP.
Alpha 21264/EV67 Hardware Reference Manual
xix
Terminology and Conventions
This section defines the abbreviations, terminology, and other conventions used
throughout this document.
Abbreviations
Binary Multiples
The abbreviations K, M, and G (kilo, mega, and giga) represent binary multiples
and have the following values.
For example:
Register Access
The abbreviations used to indicate the type of access to register fields and bits have
the following definitions:
K=2
10 (1024)
M=2
20 (1,048,576)
G=2
30 (1,073,741,824)
2KB = 2 kilobytes =2 × 210 bytes
4MB = 4 megabytes =4 × 220 bytes
8GB = 8 gigabytes =8 × 230 bytes
2K pixels = 2 kilopixels =2 × 210 pixels
4M pixels = 4 megapixels =4 × 220 pixels
Abbreviation Meaning
IGN Ignore
Bits and fields specified are ignored on writes.
MBZ Must Be Zero
Software must never place a nonzero value in bits and fields specified as
MBZ. A nonzero read produces an Illegal Operand exception. Also, MBZ
fields are reserved for future use.
RAZ Read As Zero
Bits and fields return a zero when read.
RC Read Clears
Bits and fields are cleared when read. Unless otherwise specified, such bits
cannot be written.
RES Reserved
Bits and fields are reserved by Compaq and should not be used; however,
zeros can be written to reserved fields that cannot be masked.
RO Read Only
The value may be read by software. It is written by hardware. Software write
operations are ignored.
RO,nRead Only, and takes the value n at power-on reset.
The value may be read by software. It is written by hardware. Software write
operations are ignored.
xx
Alpha 21264/EV67 Hardware Reference Manual
Sign extension
SEXT(x) means x is sign-extended to the required size.
Addresses
Unless otherwise noted, all addresses and offsets are hexadecimal.
Aligned and Unaligned
The terms aligned and naturally aligned are interchangeable and refer to data objects
that are powers of two in size. An aligned datum of size 2n is stored in memory at a
byte address that is a multiple of 2n; that is, one that has n low-order zeros. For ex-
ample, an aligned 64-byte stack frame has a memory address that is a multiple of 64.
A datum of size 2n is unaligned if it is stored in a byte address that is not a multiple of
2n.
Bit Notation
Multiple-bit fields can include contiguous and noncontiguous bits contained in square
brackets ([]). Multiple contiguous bits are indicated by a pair of numbers separated by a
colon [:]. For example, [9:7,5,2:0] specifies bits 9,8,7,5,2,1, and 0. Similarly, single bits
are frequently indicated with square brackets. For example, [27] specifies bit 27. See
also Field Notation.
Caution
Cautions indicate potential damage to equipment or loss of data.
RW Read/Write
Bits and fields can be read and written.
RW,nRead/Write, and takes the value n at power-on reset.
Bits and fields can be read and written.
W1C Write One to Clear
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software
returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be cleared by hardware. Software write operations of a 0 do not
modify the state of the bit.
W1S Write One to Set
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software
returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be set by hardware. Software write operations of a 0 do not modify
the state of the bit.
WO Write Only
Bits and fields can be written but not read.
WO,nWrite Only, and takes the value n at power-on reset.
Bits and fields can be written but not read.
Abbreviation Meaning
Alpha 21264/EV67 Hardware Reference Manual
xxi
Data Units
The following data unit terminology is used throughout this manual.
Do Not Care (X)
A capital X represents any valid value.
External
Unless otherwise stated, external means not contained in the chip.
Field Notation
The names of single-bit and multiple-bit fields can be used rather than the actual bit
numbers (see Bit Notation). When the field name is used, it is contained in square
brackets ([]). For example, RegisterName[LowByte] specifies RegisterName[7:0].
Note
Notes emphasize particularly important information.
Numbering
All numbers are decimal or hexadecimal unless otherwise indicated. The prefix 0x indi-
cates a hexadecimal number. For example, 19 is decimal, but 0x19 and 0x19A are hexa-
decimal (also see Addresses). Otherwise, the base is indicated by a subscript; for
example, 1002 is a binary number.
Ranges and Extents
Ranges are specified by a pair of numbers separated by two periods (..) and are inclu-
sive. For example, a range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in square brackets ([]) separated by a colon
(:) and are inclusive. Bit fields are often specified as extents. For example, bits [7:3]
specifies bits 7, 6, 5, 4, and 3.
Register Figures
The gray areas in register figures indicate reserved or unused bits and fields.
Bit ranges that are coupled with the field name specify the bits of the named field that
are included in the register. The bit range may, but need not necessarily, correspond to
the bit Extent in the register. See the explanation above Table 5–1 for more information.
Signal Names
The following examples describe signal-name conventions used in this document.
Term Words Bytes Bits Other
Byte ½1 8
Word1216
Longword 2 4 32 Dword
Quadword 4 8 64 2 longword
xxii
Alpha 21264/EV67 Hardware Reference Manual
AlphaSignal[n:n] Boldface, mixed-case type denotes signal names that are
assigned internal and external to the 21264/EV67 (that is,
the signal traverses a chip interface pin).
AlphaSignal_x[n:n] When a signal has high and low assertion states, a lower-
case italic x represents the assertion states. For example,
SignalName_x[3:0] represents SignalName_H[3:0] and
SignalName_L[3:0].
UNDEFINED
Operations specified as UNDEFINED may vary from moment to moment, implementa-
tion to implementation, and instruction to instruction within implementations. The
operation may vary in effect from nothing to stopping system operation.
UNDEFINED operations may halt the processor or cause it to lose information. How-
ever, UNDEFINED operations must not cause the processor to hang, that is, reach an
unhalted state from which there is no transition to a normal state in which the machine
executes instructions.
UNPREDICTABLE
UNPREDICTABLE results or occurrences do not disrupt the basic operation of the pro-
cessor; it continues to execute instructions in its normal manner. Further:
Results or occurrences specified as UNPREDICTABLE may vary from moment to
moment, implementation to implementation, and instruction to instruction within
implementations. Software can never depend on results specified as UNPREDICT-
ABLE.
An UNPREDICTABLE result may acquire an arbitrary value subject to a few con-
straints. Such a result may be an arbitrary function of the input operands or of any
state information that is accessible to the process in its current access mode.
UNPREDICTABLE results may be unchanged from their previous values.
Operations that produce UNPREDICTABLE results may also produce exceptions.
An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints as
are UNPREDICTABLE results and, in particular, must not constitute a security
hole.
Specifically, UNPREDICTABLE results must not depend upon, or be a function of,
the contents of memory locations or registers that are inaccessible to the current
process in the current access mode.
Also, operations that may produce UNPREDICTABLE results must not:
Write or modify the contents of memory locations or registers to which the cur-
rent process in the current access mode does not have access, or
Halt or hang the system or any of its components.
For example, a security hole would exist if some UNPREDICTABLE result
depended on the value of a register in another process, on the contents of processor
temporary registers left behind by some previously running process, or on a
sequence of actions of different processes.
Alpha 21264/EV67 Hardware Reference Manual
xxiii
X
Do not care. A capital X represents any valid value.
Alpha 21264/EV67 Hardware Reference Manual
Introduction 1–1
1
Introduction
This chapter provides a brief introduction to the Alpha architecture, Compaq’s RISC
(reduced instruction set computing) architecture designed for high performance. The
chapter then summarizes the specific features of the Alpha 21264/EV67 microproces-
sor (hereafter called the 21264/EV67) that implements the Alpha architecture. Appen-
dix A provides a list of Alpha instructions.
The companion volume to this manual, the Alpha Architecture Handbook, Version 4,
contains the instruction set architecture. Also available is the Alpha Architecture Refer-
ence Manual, Third Edition, which contains the complete architecture information.
1.1 The Architecture
The Alpha architecture is a 64-bit load and store RISC architecture designed with par-
ticular emphasis on speed, multiple instruction issue, multiple processors, and software
migration from many operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers.
All instructions are 32 bits long. Memory operations are either load or store operations.
All data manipulation is done between registers.
The Alpha architecture supports the following data types:
8-, 16-, 32-, and 64-bit integers
IEEE 32-bit and 64-bit floating-point formats
VAX architecture 32-bit and 64-bit floating-point formats
In the Alpha architecture, instructions interact with each other only by one instruction
writing to a register or memory location and another instruction reading from that regis-
ter or memory location. This use of resources makes it easy to build implementations
that issue multiple instructions every CPU cycle.
The 21264/EV67 uses a set of subroutines, called privileged architecture library code
(PALcode), that is specific to a particular Alpha operating system implementation and
hardware platform. These subroutines provide operating system primitives for context
switching, interrupts, exceptions, and memory management. These subroutines can be
invoked by hardware or CALL_PAL instructions. CALL_PAL instructions use the
function field of the instruction to vector to a specified subroutine. PALcode is written
in standard machine code with some implementation-specific extensions to provide
1–2 Introduction
Alpha 21264/EV67 Hardware Reference Manual
The Architecture
direct access to low-level hardware functions. PALcode supports optimizations for mul-
tiple operating systems, flexible memory-management implementations, and multi-
instruction atomic sequences.
The Alpha architecture performs byte shifting and masking with normal 64-bit, regis-
ter-to-register instructions. The 21264/EV67 performs single-byte and single-word load
and store instructions.
1.1.1 Addressing
The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21264/EV67
supports a 48-bit or 43-bit virtual address (selectable under IPR control).
Virtual addresses as seen by the program are translated into physical memory addresses
by the memory-management mechanism. The 21264/EV67 supports a 44-bit physical
address.
1.1.2 Integer Data Types
Alpha architecture supports the four integer data types listed in Table 1–1.
Note: Alpha implementations may impose a significant performance penalty
when accessing operands that are not naturally aligned. Refer to the Alpha
Architecture Handbook, Version 4 for details.
1.1.3 Floating-Point Data Types
The 21264/EV67 supports the following floating-point data types:
Longword integer format in floating-point unit
Quadword integer format in floating-point unit
IEEE floating-point formats
– S_floating
– T_floating
VAX floating-point formats
– F_floating
–G_floating
D_floating (limited support)
Table 1–1 Integer Data Types
Data Type Description
Byte A byte is 8 contiguous bits that start at an addressable byte boundary.
A byte is an 8-bit value.
Word A word is 2 contiguous bytes that start at an arbitrary byte boundary.
A word is a 16-bit value.
Longword A longword is 4 contiguous bytes that start at an arbitrary byte boundary. A
longword is a 32-bit value.
Quadword A quadword is 8 contiguous bytes that start at an arbitrary byte boundary.
Alpha 21264/EV67 Hardware Reference Manual
Introduction 1–3
21264/EV67 Microprocessor Features
1.2 21264/EV67 Microprocessor Features
The 21264/EV67 microprocessor is a superscalar pipelined processor. It is packaged in
a 587-pin PGA carrier and has removable application-specific heat sinks. A number of
configuration options allow its use in a range of system designs ranging from extremely
simple uniprocessor systems with minimum component count to high-performance
multiprocessor systems with very high cache and memory bandwidth.
The 21264/EV67 can issue four Alpha instructions in a single cycle, thereby minimiz-
ing the average cycles per instruction (CPI). A number of low-latency and/or high-
throughput features in the instruction issue unit and the onchip components of the mem-
ory subsystem further reduce the average CPI.
The 21264/EV67 and associated PALcode implements IEEE single-precision and dou-
ble-precision, VAX F_floating and G_floating data types, and supports longword
(32-bit) and quadword (64-bit) integers. Byte (8-bit) and word (16-bit) support is pro-
vided by byte-manipulation instructions. Limited hardware support is provided for the
VAX D_floating data type.
Other 21264/EV67 features include:
The ability to issue up to four instructions during each CPU clock cycle.
A peak instruction execution rate of four times the CPU clock frequency.
An onchip, demand-paged memory-management unit with translation buffer, which,
when used with PALcode, can implement a variety of page table structures and trans-
lation algorithms. The unit consists of a 128-entry, fully-associative data translation
buffer (DTB) and a 128-entry, fully-associative instruction translation buffer (ITB),
with each entry able to map a single 8KB page or a group of 8, 64, or 512 8KB
pages. The allocation scheme for the ITB and DTB is round-robin. The size of each
translation buffer entry’s group is specified by hint bits stored in the entry. The
DTB and ITB implement 8-bit address space numbers (ASN), MAX_ASN=255.
Two onchip, high-throughput pipelined floating-point units, capable of executing
both VAX and IEEE floating-point data types.
An onchip, 64KB virtually-addressed instruction cache with 8-bit ASNs
(MAX_ASN=255).
An onchip, virtually-indexed, physically-tagged dual-read-ported, 64KB data
cache.
Supports a 48-bit or 43-bit virtual address (program selectable).
Supports a 44-bit physical address.
An onchip I/O write buffer with four 64-byte entries for I/O write transactions.
An onchip, 8-entry victim data buffer.
An onchip, 32-entry load queue.
An onchip, 32-entry store queue.
An onchip, 8-entry miss address file for cache fill requests and I/O read
transactions.
An onchip, 8-entry probe queue, holding pending system port probe commands.
1–4 Introduction
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microprocessor Features
An onchip, duplicate tag array used to maintain level 2 cache coherency.
A 64-bit data bus with onchip parity and error correction code (ECC) support.
Support for an external second-level (Bcache) cache. The size and some timing
parameters of the Bcache are programmable.
An internal clock generator providing a high-speed clock used by the 21264/EV67,
and two clocks for use by the CPU module.
Onchip performance counters to measure and analyze CPU and system perfor-
mance.
Chip and module level test support, including an instruction cache test interface to
support chip and module level testing.
A 2.0-V external interface.
Refer to Chapter 9 for 21264/EV67 dc and ac electrical characteristics. Refer to the
Alpha Architecture Handbook, Version 4, Appendix E, for waivers and any other
implementation-dependent information.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–1
2
Internal Architecture
This chapter provides both an overview of the 21264/EV67 microarchitecture and a sys-
tem designer’s view of the 21264/EV67 implementation of the Alpha architecture. The
combination of the 21264/EV67 microarchitecture and privileged architecture library
code (PALcode) defines the chip’s implementation of the Alpha architecture. If a certain
piece of hardware seems to be “architecturally incomplete,” the missing functionality is
implemented in PALcode. Chapter 6 provides more information on PALcode.
This chapter describes the major functional hardware units and is not intended to be a
detailed hardware description of the chip. It is organized as follows:
21264/EV67 microarchitecture
Pipeline organization
Instruction issue and retire rules
Load instructions to R31/F31 (software-directed instruction prefetch)
Special cases of Alpha instruction execution
Memory and I/O address space
Miss address file (MAF) and load-merging rules
Instruction ordering
Replay traps
I/O write buffer and the WMB instruction
Performance measurement support
Floating-point control register
AMASK and IMPLVER instruction values
Design examples
2.1 21264/EV67 Microarchitecture
The 21264/EV67 microprocessor is a high-performance third-generation implementa-
tion of the Compaq Alpha architecture. The 21264/EV67 consists of the following sec-
tions, as shown in Figure 2–1:
Instruction fetch, issue, and retire unit (Ibox)
Integer execution unit (Ebox)
2–2 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
Floating-point execution unit (Fbox)
Onchip caches (Icache and Dcache)
Memory reference unit (Mbox)
External cache and system interface unit (Cbox)
Pipeline operation sequence
2.1.1 Instruction Fetch, Issue, and Retire Unit
The instruction fetch, issue, and retire unit (Ibox) consists of the following subsections:
Virtual program counter logic
Branch predictor
Instruction-stream translation buffer (ITB)
Instruction fetch logic
Register rename maps
Integer and floating-point issue queues
Exception and interrupt logic
Retire logic
2.1.1.1 Virtual Program Counter Logic
The virtual program counter (VPC) logic maintains the virtual addresses for instruc-
tions that are in flight. There can be up to 80 instructions, in 20 successive fetch slots, in
flight between the register rename mappers and the end of the pipeline. The VPC logic
contains a 20-entry table to store these fetched VPC addresses.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–3
21264/EV67 Microarchitecture
Figure 2–1 21264/EV67 Block Diagram
2.1.1.2 Branch Predictor
The branch predictor is composed of three units: the local, global, and choice predic-
tors. Figure 2–2 shows how the branch predictor generates the predicted branch
address.
INT
UNIT
1
(U1)
Address
ALU 1
(L1)
Address
ALU 0
(L0)
Branch
Predictor
VPC
Queue
INT
UNIT
0
(U0)
Integer Registers 1
(80 Registers)
Integer Registers 0
(80 Registers)
Ebox
FP
ADD
DIV
SQRT
FP
MUL
FP Registers
(72 Re
g
isters)
Fbox
Dual-Ported Data Cache
Physical
Address
Mbox DTB
(Dual-ported, 128-entry) Load
Queue
Queue Miss Address
File
Arbiter
Victim
Buffer
IOWB
Duplicate
Tag Store
Probe
Queue
Cache
Data
128
Cache
Index
20
System
Bus
64
System
Address
15
128
Cbox
FP Issue Queue
(15 Entries)
Integer Issue Queue
(20 Entries)
Ibox
Decode and
Rename Registers
Retire
Unit
ITB
Predecode
Fetch Unit
Next Address
Virtual Address Four
Instructions
Instruction Cache
128
Physical
Address
Data
Data
FM-
056
42-AI4
2–4 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
Figure 2–2 Branch Predictor
Local Predictor
The local predictor uses a 2-level table that holds the history of individual branches.
The 2-level table design approaches the prediction accuracy of a larger single-level
table while requiring fewer total bits of storage. Figure 2–3 shows how the local pre-
dictor generates a prediction. Bits [11:2] of the VPC of the current branch are used as
the index to a 1K entry table in which each entry is a 10-bit value. This 10-bit value is
used as the index to a 1K entry table of 3-bit saturating counters. The value of the satu-
rating counter determines the predication, taken/not-taken, of the current branch.
Figure 2–3 Local Predictor
Global Predictor
The global predictor is indexed by a global history of all recent branches. The global
predictor correlates the local history of the current branch with all recent branches. Fig-
ure 2–4 shows how the global predictor generates a prediction. The global path history
is comprised of the taken/not-taken state of the 12 most-recent branches. These 12
states are used to form an index into a 4K entry table of 2-bit saturating counters. The
value of the saturating counter determines the predication, taken/not-taken, of the cur-
rent branch.
Local
Predictor Global
Predictor Choice
Predictor
Predicted
Branch
Address
FM-05810.AI4
Local
History
Table
1K x 10
FM-05811.AI4
Local
Predictor
1K x 3 +/-
10 Index
VPC[11:2]
Local Branch Prediction
1
3
10
3
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–5
21264/EV67 Microarchitecture
Figure 2–4 Global Predictor
Choice Predictor
The choice predictor monitors the history of the local and global predictors and chooses
the best of the two predictors for a particular branch. Figure 2–5 shows how the choice
predictor generates its choice of the result of the local or global prediction. The 12-bit
global path history (see Figure 2–4) is used to index a 4K entry table of 2-bit saturating
counters. The value of the saturating counter determines the choice between the outputs
of the local and global predictors.
Figure 2–5 Choice Predictor
2.1.1.3 Instruction-Stream Translation Buffer
The Ibox includes a 128-entry, fully-associative instruction-stream translation buffer
(ITB) that is used to store recently used instruction-stream (Istream) address transla-
tions and page protection information. Each of the entries in the ITB can map 1, 8, 64,
or 512 contiguous 8KB pages. The allocation scheme is round-robin.
The ITB supports an 8-bit ASN and contains an ASM bit. The Icache is virtually
addressed and contains the access-check information, so the ITB is accessed only for
Istream references that miss in the Icache.
Istream transactions to I/O address space are UNDEFINED.
Global
Path
History
FM-05812.AI4
Global
Predictor
4K x 2 +/-
12
Index
Global Branch Prediction
1
2
2
Global
Path
History
FM-05813.AI4
Choice
Predictor
4K x 2
12
Choice Prediction
12
2
2–6 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
2.1.1.4 Instruction Fetch Logic
The instruction prefetcher (predecode) reads an octaword, containing up to four natu-
rally aligned instructions per cycle, from the Icache. Branch prediction and line predic-
tion bits accompany the four instructions. The branch prediction scheme operates most
efficiently when only one branch instruction is contained among the four fetched
instructions. The line prediction scheme attempts to predict the Icache line that the
branch predictor will generate, and is described in Section 2.2.
An entry from the subroutine return prediction stack, together with set prediction bits
for use by the Icache stream controller, are fetched along with the octaword. The Icache
stream controller generates fetch requests for additional Icache lines and stores the
Istream data in the Icache. There is no separate buffer to hold Istream requests.
2.1.1.5 Register Rename Maps
The instruction prefetcher forwards instructions to the integer and floating-point regis-
ter rename maps. The rename maps perform the two functions listed here:
Eliminate register write-after-read (WAR) and write-after-write (WAW) data
dependencies while preserving true read-after-write (RAW) data dependencies, in
order to allow instructions to be dynamically rescheduled.
Provide a means of speculatively executing instructions before the control flow
previous to those instructions is resolved. Both exceptions and branch
mispredictions represent deviations from the control flow predicted by the
instruction prefetcher.
The map logic translates each instruction’s operand register specifiers from the virtual
register numbers in the instruction to the physical register numbers that hold the corre-
sponding architecturally-correct values. The map logic also renames each instruction’s
destination register specifier from the virtual number in the instruction to a physical
register number chosen from a list of free physical registers, and updates the register
maps.
The map logic can process four instructions per cycle. It does not return the physical
register, which holds the old value of an instruction’s virtual destination register, to the
free list until the instruction has been retired, indicating that the control flow up to that
instruction has been resolved.
If a branch mispredict or exception occurs, the map logic backs up the contents of the
integer and floating-point register rename maps to the state associated with the instruc-
tion that triggered the condition, and the prefetcher restarts at the appropriate VPC. At
most, 20 valid fetch slots containing up to 80 instructions can be in flight between the
register maps and the end of the machine’s pipeline, where the control flow is finally
resolved. The map logic is capable of backing up the contents of the maps to the state
associated with any of these 80 instructions in a single cycle.
The register rename logic places instructions into an integer or floating-point issue
queue, from which they are later issued to functional units for execution.
2.1.1.6 Integer Issue Queue
The 20-entry integer issue queue (IQ), associated with the integer execution units
(Ebox), issues the following types of instructions at a maximum rate of four per cycle:
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–7
21264/EV67 Microarchitecture
Integer operate
Integer conditional branch
Unconditional branch – both displacement and memory format
Integer and floating-point load and store
PAL-reserved instructions: HW_MTPR, HW_MFPR, HW_LD, HW_ST,
HW_RET
Integer-to-floating-point (ITOFx) and floating-point-to-integer (FTOIx)
Each queue entry asserts four request signals—one for each of the Ebox subclusters. A
queue entry asserts a request when it contains an instruction that can be executed by the
subcluster, if the instructions operand register values are available within the subclus-
ter.
There are two arbiters—one for the upper subclusters and one for the lower subclusters.
(Subclusters are described in Section 2.1.2.) Each arbiter picks two of the possible 20
requesters for service each cycle. A given instruction only requests upper subclusters or
lower subclusters, but because many instructions can only be executed in one type or
another this is not too limiting.
For example, load and store instructions can only go to lower subclusters and shift
instructions can only go to upper subclusters. Other instructions, such as addition and
logic operations, can execute in either upper or lower subclusters and are statically
assigned before being placed in the IQ.
The IQ arbiters choose between simultaneous requesters of a subcluster based on the
age of the request—older requests are given priority over newer requests. If a given
instruction requests both lower subclusters, and no older instruction requests a lower
subcluster, then the arbiter assigns subcluster L0 to the instruction. If a given instruction
requests both upper subclusters, and no older instruction requests an upper subcluster,
then the arbiter assigns subcluster U1 to the instruction. This asymmetry between the
upper and lower subcluster arbiters is a circuit implementation optimization with negli-
gible overall performance effect.
2.1.1.7 Floating-Point Issue Queue
The 15-entry floating-point issue queue (FQ) associated with the Fbox issues the fol-
lowing instruction types:
Floating-point operates
Floating-point conditional branches
Floating-point stores
Floating-point register to integer register transfers (FTOIx)
Each queue entry has three request lines—one for the add pipeline, one for the multiply
pipeline, and one for the two store pipelines. There are three arbiters—one for each of
the add, multiply, and store pipelines. The add and multiply arbiters pick one requester
per cycle, while the store pipeline arbiter picks two requesters per cycle, one for each
store pipeline.
2–8 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
The FQ arbiters pick between simultaneous requesters of a pipeline based on the age of
the request—older requests are given priority over newer requests. Floating-point store
instructions and FTOIx instructions in even-numbered queue entries arbitrate for one
store port. Floating-point store instructions and FTOIx instructions in odd-numbered
queue entries arbitrate for the second store port.
Floating-point store instructions and FTOIx instructions are queued in both the integer
and floating-point queues. They wait in the floating-point queue until their operand reg-
ister values are available. They subsequently request service from the store arbiter.
Upon being issued from the floating-point queue, they signal the corresponding entry in
the integer queue to request service. Upon being issued from the integer queue, the
operation is completed.
2.1.1.8 Exception and Interrupt Logic
There are two types of exceptions: faults and synchronous traps. Arithmetic exceptions
are precise and are reported as synchronous traps.
The four sources of interrupts are listed as follows:
Level-sensitive hardware interrupts sourced by the IRQ_H[5:0] pins
Edge-sensitive hardware interrupts generated by the serial line receive pin,
performance counter overflows, and hardware corrected read errors
Software interrupts sourced by the software interrupt request (SIRR) register
Asynchronous system traps (ASTs)
Interrupt sources can be individually masked. In addition, AST interrupts are qualified
by the current processor mode.
2.1.1.9 Retire Logic
The Ibox fetches instructions in program order, executes them out of order, and then
retires them in order. The Ibox retire logic maintains the architectural state of the
machine by retiring an instruction only if all previous instructions have executed with-
out generating exceptions or branch mispredictions. Retiring an instruction commits the
machine to any changes the instruction may have made to the software-visible state.
The three software-visible states are listed as follows:
Integer and floating-point registers
Memory
Internal processor registers (including control/status registers and translation
buffers)
The retire logic can sustain a maximum retire rate of eight instructions per cycle, and
can retire up to as many as 11 instructions in a single cycle.
2.1.2 Integer Execution Unit
The integer execution unit (Ebox) is a 4-path integer execution unit that is implemented
as two functional-unit “clusters” labeled 0 and 1. Each cluster contains a copy of an 80-
entry, physical-register file and two “subclusters”, named upper (U) and lower (L). Fig-
ure 2–6 shows the integer execution unit. In the figure, iop_wr is the cross-cluster bus
for moving integer result values between clusters.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–9
21264/EV67 Microarchitecture
Figure 2–6 Integer Execution Unit—Clusters 0 and 1
Most instructions have 1-cycle latency for consumers that execute within the same clus-
ter. Also, there is another 1-cycle delay associated with producing a value in one cluster
and consuming the value in the other cluster. The instruction issue queue minimizes the
performance effect of this cross-cluster delay. The Ebox contains the following
resources:
Four 64-bit adders that are used to calculate results for integer add instructions
(located in U0, U1, L0, and L1)
The adders in the lower subclusters that are used to generate the effective virtual
address for load and store instructions (located in L0 and L1)
Four logic units
Two barrel shifters and associated byte logic (located in U0 and U1)
Two sets of conditional branch logic (located in U0 and U1)
Two copies of an 80-entry register file
One pipelined multiplier (located in U1) with 7-cycle latency for all integer multiply
operations
One fully-pipelined unit (located in U0), with 3-cycle latency, that executes the fol-
lowing instructions:
CTLZ, CTPOP, CTTZ
PERR, MINxxx, MAXxxx, UNPKxx, PKxx
L0
Register
U0
Load/Store Data
L1
Register
U1
Load/Store Data
iop_wr
iop_wr
eff_VA eff_VA
iop_wr
iop_wr
FM-05643.AI4
2–10 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
The Ebox has 80 register-file entries that contain storage for the values of the 31 Alpha
integer registers (the value of R31 is not stored), the values of 8 PALshadow registers,
and 41 results written by instructions that have not yet been retired.
Ignoring cross-cluster delay, the two copies of the Ebox register file contain identical
values. Each copy of the Ebox register file contains four read ports and six write ports.
The four read ports are used to source operands to each of the two subclusters within a
cluster. The six write ports are used as follows:
Two write ports are used to write results generated within the cluster.
Two write ports are used to write results generated by the other cluster.
Two write ports are used to write results from load instructions. These two ports
are also used for FTOIx instructions.
2.1.3 Floating-Point Execution Unit
The floating-point execution unit (Fbox) has two paths. The Fbox executes both VAX
and IEEE floating-point instructions. It support IEEE S_floating-point and T_floating-
point data types and all rounding modes. It also supports VAX F_floating-point and
G_floating-point data types, and provides limited support for D_floating-point format.
The basic structure of the floating-point execution unit is shown in Figure 2–7.
Figure 2–7 Floating-Point Execution Units
The Fbox contains the following resources:
72-entry physical register file
Fully-pipelined multiplier with 4-cycle latency
Fully-pipelined adder with 4-cycle latency
Nonpipelined divide unit associated with the adder pipeline
Nonpipelined square root unit associated with the adder pipeline
The 72 Fbox register file entries contain storage for the values of the 31 Alpha floating-
point registers (F31 is not stored) and 41 values written by instructions that have not
been retired.
LK98-0004A
FP Mul
Reg
FP Add
FP Div
SQRT
Floating-Point
Execution Units
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–11
21264/EV67 Microarchitecture
The Fbox register file contains six reads ports and four write ports. Four read ports are
used to source operands to the add and multiply pipelines, and two read ports are used
to source data for store instructions. Two write ports are used to write results generated
by the add and multiply pipelines, and two write ports are used to write results from
floating-point load instructions.
2.1.4 External Cache and System Interface Unit
The interface for the system and external cache (Cbox) controls the Bcache and system
ports. It contains the following structures:
Victim address file (VAF)
Victim data file (VDF)
I/O write buffer (IOWB)
Probe queue (PQ)
Duplicate Dcache tag (DTAG)
2.1.4.1 Victim Address File and Victim Data File
The victim address file (VAF) and victim data file (VDF) together form an 8-entry vic-
tim buffer used for holding:
Dcache blocks to be written to the Bcache
Istream cache blocks from memory to be written to the Bcache
Bcache blocks to be written to memory
Cache blocks sent to the system in response to probe commands
2.1.4.2 I/O Write Buffer
The I/O write buffer (IOWB) consists of four 64-byte entries and associated address
and control logic used for buffering I/O write data between the store queue and the sys-
tem port.
2.1.4.3 Probe Queue
The probe queue (PQ) is an 8-entry queue that holds pending system port cache probe
commands and addresses.
2.1.4.4 Duplicate Dcache Tag Array
The duplicate Dcache tag (DTAG) array holds a duplicate copy of the Dcache tags and
is used by the Cbox when processing Dcache fills, Icache fills, and system port probes.
2.1.5 Onchip Caches
The 21264/EV67 contains two onchip primary-level caches.
2.1.5.1 Instruction Cache
The instruction cache (Icache) is a 64KB virtual-addressed, 2-way set-predict cache.
Set prediction is used to approximate the performance of a 2-set cache without slowing
the cache access time. Each Icache block contains:
16 Alpha instructions (64 bytes)
2–12 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
Virtual tag bits [47:15]
8-bit address space number (ASN) field
1-bit address space match (ASM) bit
1-bit PALcode bit to indicate physical addressing
Valid bit
Data and tag parity bits
Four access-check bits for the following modes: kernel, executive, supervisor, and
user (KESU)
Additional predecoded information to assist with instruction processing and fetch
control
2.1.5.2 Data Cache
The data cache (Dcache) is a 64KB, 2-way set-associative, virtually indexed, physically
tagged, write-back, read/write allocate cache with 64-byte blocks. During each cycle
the Dcache can perform one of the following transactions:
Two quadword (or shorter) read transactions to arbitrary addresses
Two quadword write transactions to the same aligned octaword
Two non-overlapping less-than-quadword writes to the same aligned quadword
One sequential read and write transaction from and to the same aligned octaword
Each Dcache block contains:
64 data bytes and associated quadword ECC bits
Physical tag bits
Valid, dirty, shared, and modified bits
Tag parity bit calculated across the tag, dirty, shared, and modified bits
One bit to control round-robin set allocation (one bit per two cache blocks)
The Dcache contains two sets, each with 512 rows containing 64-byte blocks per row
(that is, 32K bytes of data per set). The 21264/EV67 requires two additional bits of vir-
tual address beyond the bits that specify an 8KB page, in order to specify a Dcache row
index. A given virtual address might be found in four unique locations in the Dcache,
depending on the virtual-to-physical translation for those two bits. The 21264/EV67
prevents this aliasing by keeping only one of the four possible translated addresses in
the cache at any time.
2.1.6 Memory Reference Unit
The memory reference unit (Mbox) controls the Dcache and ensures architecturally
correct behavior for load and store instructions. The Mbox contains the following struc-
tures:
Load queue (LQ)
Store queue (SQ)
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–13
Pipeline Organization
Miss address file (MAF)
Dstream translation buffer (DTB)
2.1.6.1 Load Queue
The load queue (LQ) is a reorder buffer for load instructions. It contains 32 entries and
maintains the state associated with load instructions that have been issued to the Mbox,
but for which results have not been delivered to the processor and the instructions
retired. The Mbox assigns load instructions to LQ slots based on the order in which
they were fetched from the Icache, then places them into the LQ after they are issued by
the IQ. The LQ helps ensure correct Alpha memory reference behavior.
2.1.6.2 Store Queue
The store queue (SQ) is a reorder buffer and graduation unit for store instructions. It
contains 32 entries and maintains the state associated with store instructions that have
been issued to the Mbox, but for which data has not been written to the Dcache and the
instruction retired. The Mbox assigns store instructions to SQ slots based on the order
in which they were fetched from the Icache and places them into the SQ after they are
issued by the IQ. The SQ holds data associated with store instructions issued from the
IQ until they are retired, at which point the store can be allowed to update the Dcache.
The SQ also helps ensure correct Alpha memory reference behavior.
2.1.6.3 Miss Address File
The 8-entry miss address file (MAF) holds physical addresses associated with pending
Icache and Dcache fill requests and pending I/O space read transactions.
2.1.6.4 Dstream Translation Buffer
The Mbox includes a 128-entry, fully associative Dstream translation buffer (DTB) used
to store Dstream address translations and page protection information. Each of the entries
in the DTB can map 1, 8, 64, or 512 contiguous 8KB pages. The allocation scheme is
round-robin. The DTB supports an 8-bit ASN and contains an ASM bit.
2.1.7 SROM Interface
The serial read-only memory (SROM) interface provides the initialization data load
path from a system SROM to the Icache. Refer to Chapter 7 for more information.
2.2 Pipeline Organization
The 7-stage pipeline provides an optimized environment for executing Alpha instruc-
tions. The pipeline stages (0 to 6) are shown in Figure 2–8 and described in the follow-
ing paragraphs.
2–14 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Pipeline Organization
Figure 2–8 Pipeline Organization
Stage 0 Instruction Fetch
The branch predictor uses a branch history algorithm to predict a branch instruction tar-
get address.
Up to four aligned instructions are fetched from the Icache, in program order. The
branch prediction tables are also accessed in this cycle. The branch predictor uses tables
and a branch history algorithm to predict a branch instruction target address for one
branch or memory format JSR instruction per cycle. Therefore, the prefetcher is limited
to fetching through one branch per cycle. If there is more than one branch within the
fetch line, and the branch predictor predicts that the first branch will not be taken, it will
predict through subsequent branches at the rate of one per cycle, until it predicts a taken
branch or predicts through the last branch in the fetch line.
The Icache array also contains a line prediction field, the contents of which are applied
to the Icache in the next cycle. The purpose of the line predictor is to remove the pipe-
line bubble which would otherwise be created when the branch predictor predicts a
branch to be taken. In effect, the line predictor attempts to predict the Icache line which
the branch predictor will generate. On fills, the line predictor value at each fetch line is
initialized with the index of the next sequential fetch line, and later retrained by the
branch predictor if necessary.
Stage 1 — Instruction Slot
The Ibox maps four instructions per cycle from the 64KB 2-way set-predict Icache.
Instructions are mapped in order, executed dynamically, but are retired in order.
Branch
Predictor
Instruction
Cache
(64KB)
(2-Set)
Integer
Register
Rename
Map
Floating-
Point
Register
Rename
Map
Integer
Issue
Queue
(20)
Integer
Register
File
Floating-
Point
Issue
Queue
(15)
Floating-
Point
Register
File
ALU
Shifter
ALU Shifter
Multiplier
ALU Address
Address
ALU
Floating-Point
Add, Divide,
and Square Root
Floating-Point
Multiply
64KB
Data
Cache
Bus
Interface
Unit
System
Bus
(64 Bits)
Cache
Bus
(128 Bits)
Physical
Address
(44 Bits)
Four
Instructions
FM-05575.AI4
0213456
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–15
Pipeline Organization
In the slot stage, the branch predictor compares the next Icache index that it generates to
the index that was generated by the line predictor. If there is a mismatch, the branch
predictor wins—the instructions fetched during that cycle are aborted, and the index
predicted by the branch predictor is applied to the Icache during the next cycle. Line
mispredictions result in one pipeline bubble.
The line predictor takes precedence over the branch predictor during memory format
calls or jumps. If the line predictor was trained with a true (as opposed to predicted)
memory format call or jump target, then its contents take precedence over the target
hint field associated with these instructions. This allows dynamic calls or jumps to be
correctly predicted.
The instruction fetcher produces the full VPC address during the fetch stage of the pipe-
line. The Icache produces the tags for both Icache sets 0 and 1 each time it is accessed.
That enables the fetcher to separate set mispredictions from true Icache misses. If the
access was caused by a set misprediction, the instruction fetcher aborts the last two
fetched slots and refetches the slot in the next cycle. It also retrains the appropriate set
prediction bits.
The instruction data is transferred from the Icache to the integer and floating-point reg-
ister map hardware during this stage. When the integer instruction is fetched from the
Icache and slotted into the IQ, the slot logic determines whether the instruction is for
the upper or lower subclusters. The slot logic makes the decision based on the
resources needed by the (up to four) integer instructions in the fetch block. Although all
four instructions need not be issued simultaneously, distributing their resource usage
improves instruction loading across the units. For example, if a fetch block contains
two instructions that can be placed in either cluster followed by two instructions that
must execute in the lower cluster, the slot logic would designate that combination as
EELL and slot them as UULL. Slot combinations are described in Section 2.3.2 and
Table 2–3.
Stage 2 Map
Instructions are sent from the Icache to the integer and floating-point register maps dur-
ing the slot stage and register renaming is performed during the map stage. Also, each
instruction is assigned a unique 8-bit number, called an inum, which is used to identify
the instruction and its program order with respect to other instructions during the time
that it is in flight. Instructions are considered to be in flight between the time they are
mapped and the time they are retired.
Mapped instructions and their associated inums are placed in the integer and floating-
point queues by the end of the map stage.
Stage 3 Issue
The 20-entry integer issue queue (IQ) issues instructions at the rate of four per cycle.
The 15-entry floating-point issue queue (FQ) issues floating-point operate instructions,
conditional branch instructions, and store instructions, at the rate of two per cycle. Nor-
mally, instructions are deleted from the IQ or FQ two cycles after they are issued. For
example, if an instruction is issued in cycle n, it remains in the FQ or IQ in cycle n+1
but does not request service, and is deleted in cycle n+2.
2–16 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Instruction Issue Rules
Stage 4 Register Read
Instructions issued from the issue queues read their operands from the integer and float-
ing-point register files and receive bypass data.
Stage 5 — Execute
The Ebox and Fbox pipelines begin execution.
Stage 6 Dcache Access
Memory reference instructions access the Dcache and data translation buffers. Nor-
mally load instructions access the tag and data arrays while store instructions only
access the tag arrays. Store data is written to the store queue where it is held until the
store instruction is retired. Most integer operate instructions write their register results
in this cycle.
2.2.1 Pipeline Aborts
The abort penalty as given is measured from the cycle after the fetch stage of the
instruction which triggers the abort to the fetch stage of the new target, ignoring any
Ibox pipeline stalls or queuing delay that the triggering instruction might experience.
Table 2–1 lists the timing associated with each common source of pipeline abort.
2.3 Instruction Issue Rules
This section defines instruction classes, the functional unit pipelines to which they are
issued, and their associated latencies.
Table 2–1 Pipeline Abort Delay (GCLK Cycles)
Abort Condition Penalty
(Cycles) Comments
Branch misprediction 7 Integer or floating-point conditional branch
misprediction.
JSR misprediction 8 Memory format JSR or HW_RET.
Mbox order trap 14 Load-load order or store-load order.
Other Mbox replay traps 13
DTB miss 13
ITB miss 7
Integer arithmetic trap 12
Floating-point arithmetic
trap 13+latency Add latency of instruction. See Section 2.3.3 for
instruction latencies.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–17
Instruction Issue Rules
2.3.1 Instruction Group Definitions
Table 2–2 lists the instruction class, the pipeline assignments, and the instructions
included in the class.
Table 2–2 Instruction Name, Pipeline, and Types
Class
Name Pipeline Instruction Type
ild L0, L1 All integer load instructions
fld L0, L1 All floating-point load instructions
ist L0, L1 All integer store instructions
fst FST0, FST1, L0, L1 All floating-point store instructions
lda L0, L1, U0, U1 LDA, LDAH
mem_misc L1 WH64, ECB, WMB
rpcc L1 RPCC
rx L1 RS, RC
mxpr L0, L1
(depends on IPR) HW_MTPR, HW_MFPR
ibr U0, U1 Integer conditional branch instructions
jsr L0 BR, BSR, JMP, CALL, RET, COR, HW_RET,
CALL_PAL
iadd L0, U0, L1, U1 Instructions with opcode 1016, except CMPBGE
ilog L0, U0, L1, U1 AND, BIC, BIS, ORNOT, XOR, EQV, CMPBGE
ishf U0, U1 Instructions with opcode 1216
cmov L0, U0, L1, U1 Integer CMOV — either cluster
imul U1 Integer multiply instructions
imisc U0 CTLZ, CTPOP, CTTZ, PERR, MINxxx, MAXxxx, PKxx,
UNPKxx
fbr FA Floating-point conditional branch instructions
fadd FA All floating-point operate instructions except multiply,
divide, square root, and conditional move instructions
fmul FM Floating-point multiply instruction
fcmov1 FA Floating-point CMOV—first half
fcmov2 FA Floating-point CMOV— second half
fdiv FA Floating-point divide instruction
fsqrt FA Floating-point square root instruction
nop None TRAP, EXCB, UNOP - LDQ_U R31, 0(Rx)
2–18 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Instruction Issue Rules
2.3.2 Ebox Slotting
Instructions that are issued from the IQ, and could execute in either upper or lower
Ebox subclusters, are slotted to one pair or the other during the pipeline mapping stage
based on the instruction mixture in the fetch line. The codes that are used in Table 2–3
are as follows:
U—The instruction only executes in an upper subcluster.
L—The instruction only executes in a lower subcluster.
E—The instruction could execute in either an upper or lower subcluster.
Table 2–3 defines the slotting rules. The table field Instruction Class 3, 2, 1 and 0 iden-
tifies each instructions location in the fetch line by the value of bits [3:2] in its PC.
ftoi FST0, FST1, L0, L1 FTOIS, FTOIT
itof L0, L1 ITOFS, ITOFF, ITOFT
mx_fpcr FM Instructions that move data from the floating-point
control register
Table 2–3 Instruction Group Definitions and Pipeline Unit
Instruction Class
3 2 1 0 Slotting
3 2 1 0 Instruction Class
3 2 1 0 Slotting
3 2 1 0
E E E E U L U L L L L L L L L L
E E E L U L U L L L L U L L L U
E E E U U L L U L L U E L L U U
E E L E U L L U L L U L L L U L
E E L L U U L L L L U U L L U U
E E L U U L L U L U E E L U L U
E E U E U L U L L U E L L U U L
E E U L U L U L L U E U L U L U
E E U U L L U U L U L E L U L U
E L E E U L U L L U L L L U L L
E L E L U L U L L U L U L U L U
E L E U U L L U L U U E L U U L
E L L E U L L U L U U L L U U L
E L L L U L L L L U U U L U U U
E L L U U L L U U E E E U L U L
E L U E U L U L U E E L U L U L
E L U L U L U L U E E U U L L U
Table 2–2 Instruction Name, Pipeline, and Types (Continued)
Class
Name Pipeline Instruction Type
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–19
Instruction Issue Rules
E L U U L L U U U E L E U L L U
E U E E L U L U U E L L U U L L
E U E L L U U L U E L U U L L U
E U E U L U L U U E U E U L U L
E U L E L U L U U E U L U L U L
E U L L U U L L U E U U U L U U
E U L U L U L U U L E E U L U L
E U U E L U U L U L E L U L U L
E U U L L U U L U L E U U L L U
E U U U L U U U U L L E U L L U
L E E E L U L U U L L L U L LL
L E E L L U U L U L L U U L L U
L E E U L U L U U L U E U L U L
L E L E L U L U U L U L U L U L
L E L L L U L L U L U U U L U U
L E L U L U L U U U E E U U L L
L E U E L U U L U U E L U U L L
L E U L L U U L U U E U U U L U
L E U U L L U U U U L E U U L L
L L E E L L U U U U L L U U L L
L L E L L L U L U U L U U U L U
L L E U L L U U U U U E U U U L
L L L E L L L U U U U L U U U L
U U U U U U U U
Table 2–3 Instruction Group Definitions and Pipeline Unit (Continued)
Instruction Class
3 2 1 0 Slotting
3 2 1 0 Instruction Class
3 2 1 0 Slotting
3 2 1 0
2–20 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Instruction Issue Rules
2.3.3 Instruction Latencies
After an instruction is placed in the IQ or FQ, its issue point is determined by the avail-
ability of its register operands, functional unit(s), and relationship to other instructions
in the queue. There are register producer-consumer dependencies and dynamic func-
tional unit availability dependencies that affect instruction issue. The mapper removes
register producer-producer dependencies.
The latency to produce a register result is generally fixed. The one exception is for load
instructions that miss the Dcache. Table 2–4 lists the latency, in cycles, for each
instruction class.
Table 2–4 Instruction Class Latency in Cycles
Class Latency Comments
ild 3
13+ Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
fld 4
14+ Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
ist Does not produce register value.
fst Does not produce register value.
rpcc 1 Possible 1-cycle cross-cluster delay.
rx 1
mxpr 1 or 3 HW_MFPR: Ebox IPRs = 1.
Ibox and Mbox IPRs = 3.
HW_MTPR does not produce a register value.
icbr Conditional branch. Does not produce register value.
ubr 3 Unconditional branch. Does not produce register value.
jsr 3
iadd 1 Possible 1-cycle Ebox cross-cluster delay.
ilog 1 Possible 1-cycle Ebox cross-cluster delay.
ishf 1 Possible 1-cycle Ebox cross-cluster delay.
cmov1 1 Only consumer is cmov2. Possible 1-cycle Ebox cross-cluster delay.
cmov2 1 Possible 1-cycle Ebox cross-cluster delay.
imul 7 Possible 1-cycle Ebox cross-cluster delay.
imisc 3 Possible 1-cycle Ebox cross-cluster delay.
fcbr Does not produce register value.
fadd 4
6Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from when an fadd is issued from the FQ to when an fst or ftoi is issued
from the IQ.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–21
Instruction Retire Rules
2.4 Instruction Retire Rules
An instruction is retired when it has been executed to completion, and all previous
instructions have been retired. The execution pipeline stage in which an instruction
becomes eligible to be retired depends upon the instruction’s class.
Table 2–5 gives the minimum retire latencies (assuming that all previous instructions
have been retired) for various classes of instructions.
fmul 4
6Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from when an fmul is issued from the FQ to when an fst or ftoi is issued
from the IQ.
fcmov1 4 Only consumer is fcmov2.
fcmov2 4
6Consumer other than fst.
Consumer fst or ftoi.
Measured from when an fcmov2 is issued from the FQ to when an fst or ftoi is issued
from the IQ.
fdiv 12
9
15
12
Single precision - latency to consumer of result value.
Single precision - latency to using divider again.
Double precision - latency to consumer of result value.
Double precision - latency to using divider again.
fsqrt 18
15
33
30
Single precision - latency to consumer of result value.
Single precision - latency to using unit again.
Double precision - latency to consumer of result value.
Double precision - latency to using unit again.
ftoi 3
itof 4
nop Does not produce register value.
Table 2–5 Minimum Retire Latencies for Instruction Classes
Instruction Class Retire Stage Comments
Integer conditional branch 7
Integer multiply 7/13 Latency is 13 cycles for the MUL/V instruction.
Integer operate 7
Memory 10 —
Floating-point add 11
Floating-point multiply 11
Table 2–4 Instruction Class Latency in Cycles (Continued)
Class Latency Comments
2–22 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Retire of Operate Instructions into R31/F31
2.4.1 Floating-Point Divide/Square Root Early Retire
The floating-point divider and square root unit can detect that, for many combinations
of source operand values, no exception can be generated. Instructions with these oper-
ands can be retired before the result is generated. When detected, they are retired with
the same latency as the FP add class. Early retirement is not possible for the following
instruction/operand/architecture state conditions:
Instruction is not a DIV or SQRT.
SQRT source operand is negative.
Divide operand exponent_a is 0.
Either operand is NaN or INF.
Divide operand exponent_b is 0.
Trapping mode is /I (inexact).
INE status bit is 0.
Early retirement is also not possible for divide instructions if the resulting exponent has
any of the following characteristics (EXP is the result exponent):
DIVT, DIVG: (EXP >= 3FF16) OR (EXP <= 216)
DIVS, DIVF: (EXP >= 7F16) OR (EXP <= 38216)
2.5 Retire of Operate Instructions into R31/F31
Many instructions that have R31 or F31 as their destination are retired immediately
upon decode (stage 3). These instructions do not produce a result and are removed from
the pipeline as well. They do not occupy a slot in the issue queues and do not occupy a
functional unit. Table 2–6 lists these instructions and some of their characteristics. The
instruction type in Table 2–6 is from Table C-6 in Appendix C of the Alpha Architecture
Handbook, Version 4.
Floating-point DIV/SQRT 11 + latency Add latency of unit reuse for the instruction indicated in Table
2–4. For example, latency for a single-precision fdiv would be
11 plus 9 from Table 2–4. Latency is 11 if hardware detects that
no exception is possible (see Section 2.4.1).
Floating-point conditional
branch 11 Branch instruction mispredict is reported in stage 7.
BSR/JSR 10 JSR instruction mispredict is reported in stage 8.
Table 2–5 Minimum Retire Latencies for Instruction Classes (Continued)
Instruction Class Retire Stage Comments
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–23
Load Instructions to R31 and F31
2.6 Load Instructions to R31 and F31
This section describes how the 21264/EV67 processes software-directed prefetch trans-
actions and load instructions with a destination of R31 and F31.
Prefetches allocate a MAF entry. How the MAF entry is allocated is what distinguishes
the type of prefetch. A normal prefetch is equivalent to a normal load MAF (that is, a
MAF entry that puts the block into the Dcache in a readable state). A prefetch with
modify intent is equivalent to a normal store MAF (that is, a MAF entry that puts the
block into the Dcache in a writeable state). A prefetch, evict next, is equivalent to a nor-
mal load MAF, with the additional behavior described in Section 2.6.3, below.
A prefetch is not performed if the prefetch hits in the Dcache (as if it were a normal
load).
Load operations to R31 and F31 may generate exceptions. These exceptions must be
dismissed by PALcode.
The following sections describe the operational prefetch behavior of these instructions.
2.6.1 Normal Prefetch: LDBU, LDF, LDG, LDL, LDT, LDWU, HW_LDL Instructions
The 21264/EV67 processes these instructions as normal cache line prefetches. If the
load instruction hits the Dcache, the instruction is dismissed, otherwise the addressed
cache block is allocated into the Dcache.
The HW_LDL instruction construct equates to the HW_LD instruction with the LEN
field clear. See Table 6–3.
2.6.2 Prefetch with Modify Intent: LDS Instruction
The 21264/EV67 processes an LDS instruction, with F31 as the destination, as a
prefetch with modify intent transaction (ReadBlkMod command). If the transaction hits
a dirty Dcache block, the instruction is dismissed. Otherwise, the addressed cache block
is allocated into the Dcache for write access, with its dirty and modified bits set.
Table 2–6 Instructions Retired Without Execution
Instruction Type Notes
INTA, INTL, INTM, INTS All with R31 as destination.
FLTI, FLTL, FLTV All with F31 as destination. MT_FPCR is not included
because it has no destination—it is never removed from the
pipeline.
LDQ_U All with R31 as destination.
MISC TRAPB and EXCB are always removed. Others are never
removed.
FLTS All (SQRT, ITOF) with F31 as destination.
2–24 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Special Cases of Alpha Instruction Execution
2.6.3 Prefetch, Evict Next: LDQ and HW_LDQ Instructions
The 21264/EV67 processes this instruction like a normal prefetch transaction (Read-
BlkSpec command), with one exception—if the load misses the Dcache, the addressed
cache block is allocated into the Dcache, but the Dcache set allocation pointer is left
pointing to this block. The next miss to the same Dcache line will evict the block. For
example, this instruction might be used when software is reading an array that is known
to fit in the offchip Bcache, but will not fit into the onchip Dcache. In this case, the
instruction ensures that the hardware provides the desired prefetch function without dis-
placing useful cache blocks stored in the other set within the Dcache.
The HW_LDQ instruction construct equates to the HW_LD instruction with the LEN
field set. See Table 6–3.
2.6.4 Prefetch with the LDx_L / STx_C Instruction Sequence
A prefetch within a dynamic 80-instruction window of a LDx_L instruction can cause
the subsequent STx_C to incorrectly succeed when all three references are to the same
64-byte cache block. Within that 80-instruction window, the proximity of the prefetch
to the LDx_L instruction directly affects the possibility of the incorrect behavior. Fur-
ther, if the prefetch issues before the LDx_L, the error cannot occur, and if the prefetch
issues after the LDx_L, the error can only occur when another processor is simulta-
neously acquiring the same lock.
2.7 Special Cases of Alpha Instruction Execution
This section describes the mechanisms that the 21264/EV67 uses to process irregular
instructions in the Alpha instruction set, and cases in which the 21264/EV67 processes
instructions in a non-intuitive way.
2.7.1 Load Hit Speculation
The latency of integer load instructions that hit in the Dcache is three cycles. Figure 2–
9 shows the pipeline timing for these integer load instructions. In Figure 2–9:
Symbol Meaning
Q Issue queue
R Register file read
E Execute
D Dcache access
B Data bus active
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–25
Special Cases of Alpha Instruction Execution
Figure 2–9 Pipeline Timing for Integer Load Instructions
There are two cycles in which the IQ may speculatively issue instructions that use load
data before Dcache hit information is known. Any instructions that are issued by the IQ
within this 2-cycle speculative window are kept in the IQ with their requests inhibited
until the load instruction’s hit condition is known, even if they are not dependent on the
load operation. If the load instruction hits, then these instructions are removed from the
queue. If the load instruction misses, then the execution of these instructions is aborted
and the instructions are allowed to request service again.
For example, in Figure 2–9, instruction 1 and instruction 2 are issued within the specu-
lative window of the load instruction. If the load instruction hits, then both instructions
will be deleted from the queue by the start of cycle 7—one cycle later than normal for
instruction 1 and at the normal time for instruction 2. If the load instruction misses, both
instructions are aborted from the execution pipelines and may request service again in
cycle 6.
IQ-issued instructions are aborted if issued within the speculative window of an integer
load instruction that missed in the Dcache, even if they are not dependent on the load
data. However, if software misses are likely, the 21264/EV67 can still benefit from
scheduling the instruction stream for Dcache miss latency. The 21264/EV67 includes a
saturating counter that is incremented when load instructions hit and is decremented
when load instructions miss. When the upper bit of the counter equals zero, the integer
load latency is increased to five cycles and the speculative window is removed. The
counter is 4 bits wide and is incremented by 1 on a hit and is decremented by two on a
miss.
Since load instructions to R31 do not produce a result, they do not create a speculative
window when they execute and, therefore, never waste IQ-issue cycles if they miss.
Floating-point load instructions that hit in the Dcache have a latency of four cycles. Fig-
ure 2–10 shows the pipeline timing for floating-point load instructions. In Figure 2–10:
Symbol Meaning
Q Issue queue
R Register file read
E Execute
D Dcache access
B Data bus active
1Cycle Number
ILD
Instruction 1
Instruction 2
2 3 4 5 6 7 8
QREDB
QR
Q
Hit
FM-05814.AI4
2–26 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Special Cases of Alpha Instruction Execution
Figure 2–10 Pipeline Timing for Floating-Point Load Instructions
The speculative window for floating-point load instructions is one cycle wide.
FQ-issued instructions that are issued within the speculative window of a floating-point
load instruction that has missed, are only aborted if they depend on the load being suc-
cessful.
For example, in Figure 2–10 instruction 1 is issued in the speculative window of the
load instruction.
If instruction 1 is not a user of the data returned by the load instruction, then it is
removed from the queue at its normal time (at the start of cycle 7).
If instruction 1 is dependent on the load instruction data and the load instruction hits,
instruction 1 is removed from the queue one cycle later (at the start of cycle 8). If the
load instruction misses, then instruction 1 is aborted from the Fbox pipeline and may
request service again in cycle 7.
2.7.2 Floating-Point Store Instructions
Floating-point store instructions are duplicated and loaded into both the IQ and the FQ
from the mapper. Each IQ entry contains a control bit, fpWait, that when set prevents
that entry from asserting its requests. This bit is initially set for each floating-point store
instruction that enters the IQ, unless it was the target of a replay trap. The instruction’s
FQ clone is issued when its Ra register is about to become clean, resulting in its IQ
clone’s fpWait bit being cleared and allowing the IQ clone to issue and be executed by
the Mbox. This mechanism ensures that floating-point store instructions are always
issued to the Mbox, along with the associated data, without requiring the floating-point
register dirty bits to be available within the IQ.
2.7.3 CMOV Instruction
For the 21264/EV67, the Alpha CMOV instruction has three operands, and so presents
a special case. The required operation is to move either the value in register Rb or the
value from the old physical destination register into the new destination register, based
upon the value in Ra. Since neither the mapper nor the Ebox and Fbox data paths are
otherwise required to handle three operand instructions, the CMOV instruction is
decomposed by the Ibox pipeline into two 2-operand instructions:
The Alpha architecture instruction CMOV Ra, Rb Rc
Becomes the 21264/EV67 instructions CMOV1 Ra, oldRc newRc1
CMOV2 newRc1, Rb newRc2
1Cycle Number
FLD
Instruction 1
Instruction 2
2 3 4 5 6 7 8
QREDB
QR
Q
Hit
FM-05815.AI4
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–27
Memory and I/O Address Space Instructions
The first instruction, CMOV1, tests the value of Ra and records the result of this test in
a 65th bit of its destination register, newRc1. It also copies the value of the old physical
destination register, oldRc, to newRc1.
The second instruction, CMOV2, then copies either the value in newRc1 or the value in
Rb into a second physical destination register, newRc2, based on the CMOV predicate
bit stored in newRc1.
In summary, the original CMOV instruction is decomposed into two dependent instruc-
tions that each use a physical register from the free list.
To further simplify this operation, the two component instructions of a CMOV instruc-
tion are driven through the mappers in successive cycles. Hence, if a fetch line contains
n CMOV instructions, it takes n+1 cycles to run that fetch line through the mappers.
For example, the following fetch line:
ADD CMOVx SUB CMOVy
Results in the following three map cycles:
ADD CMOVx1
CMOVx2 SUB CMOVy1
CMOVy2
The Ebox executes integer CMOV instructions as two distinct 1-cycle latency opera-
tions. The Fbox add pipeline executes floating-point CMOV instructions as two distinct
4-cycle latency operations.
2.8 Memory and I/O Address Space Instructions
This section provides an overview of the way the 21264/EV67 processes memory and I/
O address space instructions.
The 21264/EV67 supports, and internally recognizes, a 44-bit physical address space
that is divided equally between memory address space and I/O address space. Memory
address space resides in the lower half of the physical address space (PA[43]=0)
and I/O address space resides in the upper half of the physical address space
(PA[43]=1).
The IQ can issue any combination of load and store instructions to the Mbox at the rate
of two per cycle. The two lower Ebox subclusters, L0 and L1, generate the
48-bit effective virtual address for these instructions.
An instruction is defined to be newer than another instruction if it follows that instruc-
tion in program order and is older if it precedes that instruction in program order.
2.8.1 Memory Address Space Load Instructions
The Mbox begins execution of a load instruction by translating its virtual address to a
physical address using the DTB and by accessing the Dcache. The Dcache is virtually
indexed, allowing these two operations to be done in parallel. The Mbox puts informa-
tion about the load instruction, including its physical address, destination register, and
data format, into the LQ.
2–28 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Memory and I/O Address Space Instructions
If the requested physical location is found in the Dcache (a hit), the data is formatted
and written into the appropriate integer or floating-point register. If the location is not in
the Dcache (a miss), the physical address is placed in the miss address file (MAF) for
processing by the Cbox. The MAF performs a merging function in which a new miss
address is compared to miss addresses already held in the MAF. If the new miss address
points to the same Dcache block as a miss address in the MAF, then the new miss
address is discarded.
When Dcache fill data is returned to the Dcache by the Cbox, the Mbox satisfies the
requesting load instructions in the LQ.
2.8.2 I/O Address Space Load Instructions
Because I/O space load instructions may have side effects, they cannot be performed
speculatively. When the Mbox receives an I/O space load instruction, the Mbox places
the load instruction in the LQ, where it is held until it retires. The Mbox replays retired
I/O space load instructions from the LQ to the MAF in program order, at a rate of one
per GCLK cycle.
The Mbox allocates a new MAF entry to an I/O load instruction and increases I/O band-
width by attempting to merge I/O load instructions in a merge register. Table 2–7 shows
the rules for merging data. The columns represent the load instructions replayed to the
MAF while the rows represent the size of the load in the merge register.
In summary, Table 2–7 shows some of the following rules:
Byte/word load instructions and different size load instructions are not allowed to
merge.
A stream of ascending non-overlapping, but not necessarily consecutive, longword
load instructions are allowed to merge into naturally aligned 32-byte blocks.
A stream of ascending non-overlapping, but not necessarily consecutive, quadword
load instructions are allowed to merge into naturally aligned 64-byte blocks.
Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the Mbox I/O register has closed its merge window, the Cbox sends I/O read
requests offchip in the order that they were received from the Mbox.
Table 2–7 Rules for I/O Address Space Load Instruction Data Merging
Merge Register/
Replayed Instruction Load Byte/Word Load Longword Load Quadword
Byte/Word No merge No merge No merge
Longword No merge Merge up to 32 bytes No merge
Quadword No merge No merge Merge up to 64 bytes
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–29
Memory and I/O Address Space Instructions
2.8.3 Memory Address Space Store Instructions
The Mbox begins execution of a store instruction by translating its virtual address to a
physical address using the DTB and by probing the Dcache. The Mbox puts informa-
tion about the store instruction, including its physical address, its data and the results of
the Dcache probe, into the store queue (SQ).
If the Mbox does not find the addressed location in the Dcache, it places the address
into the MAF for processing by the Cbox. If the Mbox finds the addressed location in a
Dcache block that is not dirty, then it places a ChangeToDirty request into the MAF.
A store instruction can write its data into the Dcache when it is retired, and when the
Dcache block containing its address is dirty and not shared. SQ entries that meet these
two conditions can be placed into the writable state. These SQ entries are placed into
the writable state in program order at a maximum rate of two entries per cycle. The
Mbox transfers writable store queue entry data from the SQ to the Dcache in program
order at a maximum rate of two entries per cycle. Dcache lines associated with writable
store queue entries are locked by the Mbox. System port probe commands cannot evict
these blocks until their associated writable SQ entries have been transferred into the
Dcache. This restriction assists in STx_C instruction and Dcache ECC processing.
SQ entry data that has not been transferred to the Dcache may source data to newer load
instructions. The Mbox compares the virtual Dcache index bits of incoming load
instructions to queued SQ entries, and sources the data from the SQ, bypassing the
Dcache, when necessary.
2.8.4 I/O Address Space Store Instructions
The Mbox begins processing I/O space store instructions, like memory space store
instructions, by translating the virtual address and placing the state associated with the
store instruction into the SQ.
The Mbox replays retired I/O space store entries from the SQ to the IOWB in program
order at a rate of one per GCLK cycle. The Mbox never allows queued I/O space store
instructions to source data to subsequent load instructions.
The Cbox maximizes I/O bandwidth when it allocates a new IOWB entry to an I/O
store instruction by attempting to merge I/O store instructions in a merge register. Table
2–8 shows the rules for I/O space store instruction data merging. The columns represent
the load instructions replayed to the IOWB while the rows represent the size of the store
in the merge register.
Table 2–8 shows some of the following rules:
Table 2–8 Rules for I/O Address Space Store Instruction Data Merging
Merge Register/
Replayed Instruction Store
Byte/Word Store Longword Store Quadword
Byte/Word No merge No merge No merge
Longword No merge Merge up to 32 bytes No merge
Quadword No merge No merge Merge up to 64 bytes
2–30 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
MAF Memory Address Space Merging Rules
Byte/word store instructions and different size store instructions are not allowed to
merge.
A stream of ascending non-overlapping, but not necessarily consecutive, longword
store instructions are allowed to merge into naturally aligned 32-byte blocks.
A stream of ascending non-overlapping, but not necessarily consecutive, quadword
store instructions are allowed to merge into naturally aligned 64-byte blocks.
Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the IOWB merge register has closed its merge window, the Cbox sends I/O space
store requests offchip in the order that they were received from the Mbox.
2.9 MAF Memory Address Space Merging Rules
Because all memory transactions are to 64-byte blocks, efficiency is improved by merg-
ing several small data transactions into a single larger data transaction. Table 2–9 lists
the rules the 21264/EV67 uses when merging memory transactions into 64-byte natu-
rally aligned data block transactions. Rows represent the merged instruction in the
MAF and columns represent the new issued transaction.
In summary, Table 2–9 shows that only like instruction types, with the exception of
load instructions merging with store instructions, are merged.
2.10 Instruction Ordering
In the absence of explicit instruction ordering, such as with MB or WMB instructions,
the 21264/EV67 maintains a default instruction ordering relationship between pairs of
load and store instructions.
Table 2–9 MAF Merging Rules
MAF/New LDx STx STx_C WH64 ECB Istream
LDx Merge —————
STxMergeMerge————
STx_C——Merge———
WH64———Merge——
ECB————Merge
Istream—————Merge
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–31
Replay Traps
The 21264/EV67 maintains the default memory data instruction ordering as shown in
Table 2–10 (assume address X and address Y are different).
The 21264/EV67 maintains the default I/O instruction ordering as shown in Table 2–11
(assume address X and address Y are different).
2.11 Replay Traps
There are some situations in which a load or store instruction cannot be executed due to
a condition that occurs after that instruction issues from the IQ or FQ. The instruction is
aborted (along with all newer instructions) and restarted from the fetch stage of the
pipeline. This mechanism is called a replay trap.
2.11.1 Mbox Order Traps
Load and store instructions may be issued from the IQ in a different order than they
were fetched from the Icache, while the architecture dictates that Dstream memory
transactions to the same physical bytes must be completed in order. Usually, the Mbox
manages the memory reference stream by itself to achieve architecturally correct
behavior, but the two cases in which the Mbox uses replay traps to manage the memory
stream are load-load and store-load order traps.
Table 2–10 Memory Reference Ordering
First Instruction in Pair Second Instruction In Pair Reference Order
Load memory to address X Load memory to address X Maintained (litmus test 1)
Load memory to address X Load memory to address Y Not maintained
Store memory to address X Store memory to address X Maintained
Store memory to address X Store memory to address Y Maintained
Load memory to address X Store memory to address X Maintained
Load memory to address X Store memory to address Y Not maintained
Store memory to address X Load memory to address X Maintained
Store memory to address X Load memory to address Y Not maintained
Table 2–11 I/O Reference Ordering
First Instruction in Pair Second Instruction in Pair Reference Order
Load I/O to address X Load I/O to address X Maintained
Load I/O to address X Load I/O to address Y Maintained
Store I/O to address X Store I/O to address X Maintained
Store I/O to address X Store I/O to address Y Maintained
Load I/O to address X Store I/O to address X Maintained
Load I/O to address X Store I/O to address Y Not maintained
Store I/O to address X Load I/O to address X Maintained
Store I/O to address X Load I/O to address Y Not maintained
2–32 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
I/O Write Buffer and the WMB Instruction
2.11.1.1 Load-Load Order Trap
The Mbox ensures that load instructions that read the same physical byte(s) ultimately
issue in correct order by using the load-load order trap. The Mbox compares the
address of each load instruction, as it is issued, to the address of all load instructions in
the load queue. If the Mbox finds a newer load instruction in the load queue, it invokes
a load-load order trap on the newer instruction. This is a replay trap that aborts the tar-
get of the trap and all newer instructions from the machine and refetches instructions
starting at the target of the trap.
2.11.1.2 Store-Load Order Trap
The Mbox ensures that a load instruction ultimately issues after an older store instruc-
tion that writes some portion of its memory operand by using the store-load order trap.
The Mbox compares the address of each store instruction, as it is issued, to the address
of all load instructions in the load queue. If the Mbox finds a newer load instruction in
the load queue, it invokes a store-load order trap on the load instruction. This is a replay
trap. It functions like the load-load order trap.
The Ibox contains extra hardware to reduce the frequency of the store-load trap. There
is a 1-bit by 1024-entry VPC-indexed table in the Ibox called the stWait table. When an
Icache instruction is fetched, the associated stWait table entry is fetched along with the
Icache instruction. The stWait table produces 1 bit for each instruction accessed from
the Icache. When a load instruction gets a store-load order replay trap, its associated bit
in the stWait table is set during the cycle that the load is refetched. Hence, the trapping
load instructions stWait bit will be set the next time it is fetched.
The IQ will not issue load instructions whose stWait bit is set while there are older unis-
sued store instructions in the queue. A load instruction whose stWait bit is set can be
issued the cycle immediately after the last older store instruction is issued from the
queue. All the bits in the stWait table are unconditionally cleared every 16384 cycles, or
every 65536 cycles if I_CTL[ST_WAIT_64K] is set.
2.11.2 Other Mbox Replay Traps
The Mbox also uses replay traps to control the flow of the load queue and store queue,
and to ensure that there are never multiple outstanding misses to different physical
addresses that map to the same Dcache or Bcache line. Unlike the order traps, however,
these replay traps are invoked on the incoming instruction that triggered the condition.
2.12 I/O Write Buffer and the WMB Instruction
The I/O write buffer (IOWB) consists of four 64-byte entries with the associated
address and control logic used to buffer I/O write data between the store queue (SQ)
and the system port.
2.12.1 Memory Barrier (MB/WMB/TB Fill Flow)
The Cbox CSR SYSBUS_MB_ENABLE bit determines if MB instructions produce
external system port transactions. When the SYSBUS_MB_ENABLE bit equals 0, the
Cbox CSR MB_CNT[3:0] field contains the number of pending uncommitted transac-
tions. The counter will increment for each of the following commands:
RdBlk, RdBlkMod, RdBlkI
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–33
I/O Write Buffer and the WMB Instruction
RdBlkSpec (valid), RdBlkModSpec (valid), RdBlkSpecI (valid)
RdBlkVic, RdBlkModVic, RdBlkVicI
CleanToDirty, SharedToDirty, STChangeToDirty, InvalToDirty
FetchBlk, FetchBlkSpec (valid), Evict
RdByte, RdLw, RdQw, WrByte, WrLW, WrQW
The counter is decremented with the C (commit) bit in the Probe and SysDc commands
(see Section 4.7.7). Systems can assert the C bit in the SysDc fill response to the com-
mands that originally incremented the counter, or attached to the last probe seen by that
command when it reached the system serialization point. If the number of uncommitted
transactions reaches 15 (saturating the counter), the Cbox will stall MAF and IOWB
processing until at least one of the pending transactions has been committed. Probe pro-
cessing is not interrupted by the state of this counter.
2.12.1.1 MB Instruction Processing
When an MB instruction is fetched in the predicted instruction execution path, it stalls
in the map stage of the pipeline. This also stalls all instructions after the MB, and con-
trol of instruction flow is based upon the value in Cbox CSR SYSBUS_MB_ENABLE
as follows:
If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the IQ is
empty and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port.
b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the
youngest probe queue entry.
c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and
IOWB are empty.
When all of the above have occurred and a probe response has been sent to the sys-
tem for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox waits until the IQ is empty
and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port
b. Sends the MB command to the system port
c. Waits until the MB command is acknowledged, then marks the youngest entry
in the probe queue
d. Waits until the MAF contains no more Dstream references and the SQ, LQ, and
IOWB are empty
When all of the above have occurred and a probe response has been sent to the sys-
tem for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
2–34 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
I/O Write Buffer and the WMB Instruction
Because the MB instruction is executed speculatively, MB processing can begin
and the original MB can be killed. In the internal acknowledge case, the MB may
have already been sent to the system interface, and the system is still expected to
respond to the MB.
2.12.1.2 WMB Instruction Processing
Write memory barrier (WMB) instructions are issued into the Mbox store-queue, where
they wait until they are retired and all prior store instructions become writable. The
Mbox then stalls the writable pointer and informs the Cbox. The Cbox closes the IOWB
merge register and responds in one of the following two ways:
If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox performs the following
actions:
a. Stalls further MAF and IOWB processing.
b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the
youngest probe queue entry.
c. When a probe response has been sent to the system for the marked probe queue
entry, the Cbox considers the WMB to be satisfied.
If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox performs the following
actions:
a. Stalls further MAF and IOWB processing.
b. Sends the MB command to the system port.
c. Waits until the MB command is acknowledged by the system with a SysDc
MBDone command, then sends acknowledge and marks the youngest entry in
the probe queue.
d. When a probe response has been sent to the system for the marked probe queue
entry, the Cbox considers the WMB to be satisfied.
2.12.1.3 TB Fill Flow
Load instructions (HW_LDs) to a virtual page table entry (VPTE) are processed by the
21264/EV67 to avoid litmus test problems associated with the ordering of memory
transactions from another processor against loading of a page table entry and the subse-
quent virtual-mode load from this processor.
Consider the sequence shown in Table 2–12. The data could be in the Bcache. Pj should
fetch datai if it is using PTEi.
Table 2–12 TB Fill Flow Example Sequence 1
Pi Pj
Write Datai Load/Store datai
MB <TB miss>
Write PTEi Load-PTE
<write TB>
Load/Store (restart)
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–35
I/O Write Buffer and the WMB Instruction
Also consider the related sequence shown in Table 2–13. In this case, the data could be
cached in the Bcache; Pj should fetch datai if it is using PTEi.
The 21264/EV67 processes Dstream loads to the PTE by injecting, in hardware, some
memory barrier processing between the PTE transaction and any subsequent load or
store instruction. This is accomplished by the following mechanism:
1. The integer queue issues a HW_LD instruction with VPTE.
2. The integer queue issues a HW_MTPR instruction with a DTB_PTE0, that is data-
dependent on the HW_LD instruction with a VPTE, and is required in order to fill
the DTBs. The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4]
and [0].
3. When a HW_MTPR instruction with a DTB_PTE0 is issued, the Ibox signals the
Cbox indicating that a HW_LD instruction with a VPTE has been processed. This
causes the Cbox to begin processing the MB instruction. The Ibox prevents any
subsequent memory operations being issued by not clearing the IPR scoreboard bit
[0]. IPR scoreboard bit [0] is one of the scoreboard bits associated with the
HW_MTPR instruction with DTB_PTE0.
4. When the Cbox completes processing the MB instruction (using one of the above
sequences, depending upon the state of SYSBUS_MB_ENABLE), the Cbox sig-
nals the Ibox to clear IPR scoreboard bit [0].
The 21264/EV67 uses a similar mechanism to process Istream TB misses and fills to
the PTE for the Istream.
1. The integer queue issues a HW_LD instruction with VPTE.
2. The IQ issues a HW_MTPR instruction with an ITB_PTE that is data-dependent
upon the HW_LD instruction with VPTE. This is required in order to fill the ITB.
The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4] and [0].
3. The Cbox issues a HW_MTPR instruction for the ITB_PTE and signals the Ibox
that a HW_LD/VPTE instruction has been processed, causing the Cbox to start pro-
cessing the MB instruction. The Mbox stalls Ibox fetching from when the HW_LD/
VPTE instruction finishes until the probe queue is drained.
4. When the 21264/EV67 is finished (SYS_MB selects one of the above sequences),
the Cbox directs the Ibox to clear IPR scoreboard bit [0]. Also, the Mbox directs the
Ibox to start prefetching.
Inserting MB instruction processing within the TB fill flow is only required for multi-
processor systems. Uniprocessor systems can disable MB instruction processing by
deasserting Ibox CSR I_CTL[TB_MB_EN].
Table 2–13 TB Fill Flow Example Sequence 2
Pi Pj
Write Datai Istream read datai
MB <TB miss>
Write PTEi Load-PTE
<write TB>
Istream read (restart) - will miss the Icache
2–36 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Performance Measurement Support—Performance Counters
2.13 Performance Measurement Support—Performance Counters
The 21264/EV67 provides hardware support for two methods of obtaining program
performance feedback information. The two methods do not require program modifica-
tion. The first method offers similar capabilities to earlier microprocessor performance
counters. The second method supports the new ProfileMe way of statistically sampling
individual instructions during program execution to develop a model of program execu-
tion. Both methods use the same hardware registers.
See Section 6.10 for information about counter control.
2.14 Floating-Point Control Register
The floating-point control register (FPCR) is shown in Figure 2–11.
Figure 2–11 Floating-Point Control Register
The floating-point control register fields are described in Table 2–14.
Table 2–14 Floating-Point Control Register Fields
Name Extent Type Description
SUM [63] RW Summary bit. Records bit-wise OR of FPCR exception bits.
INED [62] RW Inexact Disable. If this bit is set and a floating-point instruction that enables
trapping on inexact results generates an inexact value, the result is placed in the
destination register and the trap is suppressed.
63 62 61 60 59 4958 4857 4756 55 54 53 52 51 50 0
SUM
INED
UNFD
UNDZ
DYN
IOV
INE
UNF
OVF
DZE
INV
OVFD
DZED
INVD
DNZ
LK
99
-
0050
A
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–37
Floating-Point Control Register
UNFD [61] RW Underflow Disable. The 21264/EV67 hardware cannot generate IEEE compli-
ant denormal results. UNFD is used in conjunction with UNDZ as follows:
UNDZ [60] RW Underflow to zero. When UNDZ is set together with UNFD, underflow traps
are disabled and the 21264/EV67 places a true zero in the destination register.
See UNFD, above.
DYN [59:58] RW Dynamic rounding mode. Indicates the rounding mode to be used by an IEEE
floating-point instruction when the instruction specifies dynamic rounding
mode:
IOV [57] RW Integer overflow. An integer arithmetic operation or a conversion from float-
ing-point to integer overflowed the destination precision.
INE [56] RW Inexact result. A floating-point arithmetic or conversion operation gave a result
that differed from the mathematically exact result.
UNF [55] RW Underflow. A floating-point arithmetic or conversion operation gave a result
that underflowed the destination exponent.
OVF [54] RW Overflow. A floating-point arithmetic or conversion operation gave a result that
overflowed the destination exponent.
DZE [53] RW Divide by zero. An attempt was made to perform a floating-point divide with a
divisor of zero.
INV [52] RW Invalid operation. An attempt was made to perform a floating-point arithmetic
operation and one or more of its operand values were illegal.
OVFD [51] RW Overflow disable. If this bit is set and a floating-point arithmetic operation gen-
erates an overflow condition, then the appropriate IEEE nontrapping result is
placed in the destination register and the trap is suppressed.
DZED [50] RW Division by zero disable. If this bit is set and a floating-point divide by zero is
detected, the appropriate IEEE nontrapping result is placed in the destination
register and the trap is suppressed.
INVD [49] RW Invalid operation disable. If this bit is set and a floating-point operate generates
an invalid operation condition and 21264/EV67 is capable of producing the
correct IEEE nontrapping result, that result is placed in the destination register
and the trap is suppressed.
Table 2–14 Floating-Point Control Register Fields (Continued)
Name Extent Type Description
UNFD UNDZ Result
0 X Underflow trap.
1 0 Trap to supply a possible denormal result.
1 1 Underflow trap suppressed. Destination is written with a
true zero (+0.0).
Bits Meaning
00 Chopped
01 Minus infinity
10 Normal
11 Plus infinity
2–38 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
AMASK and IMPLVER Instruction Values
2.15 AMASK and IMPLVER Instruction Values
The AMASK and IMPLVER instructions return processor type and supported architec-
ture extensions, respectively.
2.15.1 AMASK
The 21264/EV67 returns the AMASK instruction values provided in Table 2–15. The
I_CTL register reports the 21264/EV67 pass level (see I_CTL[CHIP_ID], Section
5.2.15).
The AMASK bit definitions provided in Table 2–15 are defined in Table 2–16.
2.15.2 IMPLVER
For the 21264/EV67, the IMPLVER instruction returns the value 2.
DNZ [48] RW Denormal operands to zero. If this bit is set, treat all Denormal operands as a
signed zero value with the same sign as the Denormal operand.
Reserved [47:0]1——
1Alpha architecture FPCR bit 47 (DNOD) is not implemented by the 21264/EV67.
Table 2–15 21264/EV67 AMASK Values
21264/EV67 Pass Level AMASK Feature Mask Value
See I_CTL[CHIP_ID], Table 5–11 30716
Table 2–16 AMASK Bit Assignments
Bit Meaning
0 Support for the byte/word extension (BWX)
The instructions that comprise the BWX extension are LDBU, LDWU, SEXTB,
SEXTW, STB, and STW.
1 Support for the square-root and floating-point convert extension (FIX)
The instructions that comprise the FIX extension are FTOIS, FTOIT, ITOFF, ITOFS,
ITOFT, SQRTF, SQRTG, SQRTS, and SQRTT.
2 Support for the count extension (CIX)
The instructions that comprise the CIX extension are CTLZ, CTPOP, and CTTZ.
8 Support for the multimedia extension (MVI)
The instructions that comprise the MVI extension are MAXSB8, MAXSW4,
MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB,
PKWB, UNPKBL, and UNPKBW.
9 Support for precise arithmetic trap reporting in hardware. The trap PC is the same as
the instruction PC after the trapping instruction is executed.
Table 2–14 Floating-Point Control Register Fields (Continued)
Name Extent Type Description
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–39
Design Examples
2.16 Design Examples
The 21264/EV67 can be designed into many different uniprocessor and multiprocessor
system configurations. Figures 2–12 and 2–13 illustrate two possible configurations.
These configurations employ additional system/memory controller chipsets.
Figure 2–12 shows a typical uniprocessor system with a second-level cache. This sys-
tem configuration could be used in standalone or networked workstations.
Figure 2–12 Typical Uniprocessor Configuration
Figure 2–13 shows a typical multiprocessor system, each processor with a second-level
cache. Each interface controller must employ a duplicate tag store to maintain cache
coherency. This system configuration could be used in a networked database server
application.
21264
Tag
Address
Out
Address
Address
In
Data
Data
L2 Cache
Tag
Store
Data
Store
21272 Core
Logic Chipset
Data Slice
Chips
Control
Chips
Host PCI
Bridge Chip
Duplicate
Tag Store
(Optional)
DRAM
Arrays
Address
Data
64-bit PCI Bus
FM-05573-EV67
2–40 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Design Examples
Figure 2–13 Typical Multiprocessor Configuration
64-bit PCI Bus
64-bit PCI Bus
21264
L2
Cache
21264
L2
Cache
21272 Core
Logic Chipset
Control
Chip
Data Slice
Chips
Host PCI
Bridge Chip Host PCI
Bridge Chip
DRAM
Arrays
Address
Data
DRAM
Arrays
Address
Data
FM-05574-EV67
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface 3–1
3
Hardware Interface
This chapter contains the 21264/EV67 microprocessor logic symbol and provides infor-
mation about signal names, their function, and their location. This chapter also
describes the mechanical specifications of the 21264/EV67. It is organized as follows:
The 21264/EV67 logic symbol
The 21264/EV67 signal names and functions
Lists of the signal pins, sorted by name and PGA location
The specifications for the 21264/EV67 mechanical package
The top and bottom views of the 21264/EV67 pinouts
3.1 21264/EV67 Microprocessor Logic Symbol
Figure 3–1 show the logic symbol for the 21264/EV67 chip.
3–2 Hardware Interface
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microprocessor Logic Symbol
Figure 3–1 21264/EV67 Microprocessor Logic Symbol
21264
System Interface Bcache Interface
SysAddIn_L[14:0]
SysAddInClk_L
SysAddOut_L[14:0]
SysAddOutClk_L
SysVref
SysData_L[63:0]
SysCheck_L[7:0]
SysDataInClk_H[7:0]
SysDataOutClk_L[7:0]
SysDataInValid_L
SysDataOutValid_L
SysFillValid_L
BcAdd_H[23:4]
BcData_H[127:0]
BcCheck_H[15:0]
BcDataInClk_H[7:0]
BcDataOutClk_[3:0]
BcDataOE_L
BcDataWr_L
BcTag_H[42:20]
BcTagInClk_H
BcTagOutClk_x
BcVref
BcTagDirty_H
BcTagParity_H
BcTagShared_H
BcTagValid_H
BcTagOE_L
BcTagWr_L
BcLoad_L
x
Clocks
ClkIn_x
FrameClk_x
EV6Clk_x
PLL_VDD
Miscellaneous
IRQ_H[5:0]
ClkFwdRst_H
SromData_H
Tms_H
Trst_L
Tck_H
Tdi_H
PllBypass_H
MiscVref
Reset_L
DCOK_H
SromClk_H
SromOE_L
TestStat_H
Tdo_H
LK99-0051A
3.3 V
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface 3–3
21264/EV67 Signal Names and Functions
3.2 21264/EV67 Signal Names and Functions
Table 3–1 defines the 21264/EV67 signal types referred to in this section.
Table 3–2 lists all signal pins in alphabetic order and provides a full functional descrip-
tion of the pins. Table 3–4 lists the signal pins and their corresponding pin grid array
(PGA) locations in alphabetic order for the signal type. Table 3–5 lists the pin grid array
locations in alphabetical order.
Table 3–1 Signal Pin Types Definitions
Signal Type Definition
Inputs
I_DC_REF Input DC reference pin
I_DA Input differential amplifier receiver
I_DA_CLK Input clock pin
Outputs
O_OD Open drain output driver
O_OD_TP Open drain driver for test pins
O_PP Push/pull output driver
O_PP_CLK Push/pull output clock driver
Bidirectional
B_DA_OD Bidirectional differential amplifier receiver with open drain output
B_DA_PP Bidirectional differential amplifier receiver with push/pull output
Other
Spare Reserved to Compaq