R3000 Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 354

DownloadR3000-manual
Open PDF In BrowserView PDF
Table of Contents

IDT R30xx Family
Software Reference Manual

Revision 1.0

1994 Integrated Device Technology, Inc.
Portions 1994 Algorithmics, Ltd.
Chapter 16 contains some material that is 1988 Prentice-Hall.
Appendices A & B contain material that is 1994 by Mips Technology, Inc.

i–1

Table of Contents

About IDT
Integrated Device Technology, Inc. has been a MIPS semiconductor
partner since 1988, and has led efforts to bring the high-performance
inherent in the MIPS architecture to embedded systems engineers. These
efforts include derivatives of MIPS R3xxx and R4xxx CPUs, development
tools, and applications support.
Additional information about IDT’s RISC family can be obtained from
your local sales representative. Alternately, IDT can be reached directly at:
Corporate Marketing

(800) 345-7015

RISC Applications "Hotline"

(408) 492-8208

RISC Applications FAX

(408) 492-8469

RISC Applications Internet

rischelp@idtinc.com

About Algorithmics
Much of this manual was written by Dominic Sweetman and Nigel
Stephens of Algorithmics Ltd in London, England, under contract to IDT.
Algorithmics were early enthusiasts for the MIPS architecture, designing
their first MIPS systems and system software in 1986/87. A small
engineering company, Algorithmics provide enabling technologies for
companies designing in both R30xx family CPUs and the 64-bit R4x00
architecture. This includes training, toolkits, GNU C support, and
evaluation boards. Dominic Sweetman can be reached at the following:.
Dominic Sweetman
Algorithmics Ltd
3 Drayton Park
London N5 1NU
ENGLAND.

phone: +44 71 700 3301
fax: +44 71 700 3400
email: dom@algor.co.uk

i–2

Table of Contents

About This Manual

This manual is targeted to a systems programmer building an R30xxbased system. It contains the architecture specific operations and
programming conventions relevant to such a programmer.
This manual is not intended to be a tutorial on structured programming,
real-time operating systems, any particular high-level programming
language, or any particular toolchain. Other references are better suited to
those topics.
This manual does contain specific code fragments and the most
common programming conventions that are specific to the IDT R30xx
RISController family. The manual was consciously limited to the R30xx
family; information relevant to the R4xxx family of processors may be
found, but the device specific programs (such as cache management,
exception handling, etc.) shown as examples are specific to the R30xx
family.
This manual contains references to the toolchains most commonly used
by the authors (IDT, Inc., and Algorithmics, Ltd.). Code fragments shown
are typically from software used by and/or provided by these companies,
includeing development tools such as IDT/c and software utilities (such as
IDT/kit, IDT/sim, and Micromonitor). A wide variety of other, 3rd party
products, are also available to support R30xx development, under the
Advantage-IDT program. The reader of this manual is encouraged to look
at all the available tools to determine which toolchains and utilities best fit
the system development requirements.
Additional information on the IDT family of RISC processors, and their
support tools, is available from your local IDT salesman.

i–3

Table of Contents

Integrated Device Technology, Inc. reserves the right to make changes to its products or specifications at
any time, without notice, in order to improve design or performance and to supply the best possible product.
IDT does not assume any responsibility for use of any circuitry described other than the circuitry embodied
in an IDT product. The Company makes no representations that circuitry described herein is free from patent
infringement or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent, patent rights or other rights, of Integrated Device Technology, Inc.

LIFE SUPPORT POLICY
Integrated Device Technology's products are not authorized for use as critical components in life
support devices or systems unless a specific written agreement pertaining to such intended use is
executed between the manufacturer and an officer of IDT.
1. Life support devices or systems are devices or systems which (a) are intended for surgical implant
into the body or (b) support or sustain life and whose failure to perform, when properly used in
accordance with instructions for use provided in the labeling, can be reasonably expected to result in
a significant injury to the user.
2. A critical component is any components of a life support device or system whose failure to
perform can be reasonably expected to cause the failure of the life support device or system, or to
affect its safety or effectiveness.
The IDT logo is a registered trademark and BiCameral, BurstRAM, BUSMUX, CacheRAM, DECnet,
Double-Density, FASTX, Four-Port, FLEXI-CACHE, Flexi-PAK, Flow-thruEDC, IDT/c, IDTenvY, IDT/sae,
IDT/sim, IDT/ux, MacStation, MICROSLICE, Orion, PalatteDAC, REAL8, R3041, R3051, R3052, R3081,
R3721, R4600, RISCompiler, RISController, RISCore, RISC Subsystem, RISC Windows, SARAM, SmartLogic,
SyncFIFO, SyncBiFIFO, SPC, TargetSystem and WideBus are trademarks of Integrated Device Technology,
Inc.
MIPS is a registered trademark of MIPS Computer Systems, Inc
All others are trademarks of their respective companies..

i–4

Table of Contents

IDT R30xx Family
Software Reference Manual
Table of Contents
Introduction........................................................................................................................1
What is a RISC?......................................................................................................... 1-1
PIPELINES ................................................................................................................ 1-2
The IDT R3xxx Family CPUs ................................................................................... 1-3
MIPS Architecture Levels.......................................................................................... 1-4
MIPS-1 Compared with CISC Archtectures.............................................................. 1-4
Unusual Instruction Encoding Features ............................................................... 1-5
Addressing and Memory Accesses ...................................................................... 1-5
Operations not Directly Supported ...................................................................... 1-6
Multiply and Divide Operations ................................................................................ 1-7
Programmer-visible Pipeline Effects ......................................................................... 1-7
A Note on Machine and Assembler Language .......................................................... 1-8
MIPs-1 (R30xx) Architecture............................................................................................2
Programmer’s View of the Processor Archtecture..................................................... 2-1
Registers..................................................................................................................... 2-1
Conventional Names and Uses of General-Purpose Registers .................................. 2-2
Notes on Conventional Register Names ............................................................. 2-2
Integer Multiply Unit and Registers .......................................................................... 2-3
Instruction Types ....................................................................................................... 2-4
Loading and Storing: Addressing Modes .................................................................. 2-5
Data types in Memory and Registers ......................................................................... 2-6
Integer Data Types .............................................................................................. 2-6
Unaligned Loads and Stores ............................................................................... 2-6
Floating Point Data in Memory .......................................................................... 2-7
Basic Address Space .................................................................................................. 2-8
Summary of System Addressing................................................................................ 2-9
Kernel vs. User Mode .......................................................................................... 2-9
Memory map for CPUs without MMU Hardware............................................. 2-10
Subsegments in the R3041 – Memory Width Configuration ...................... 2-10
System Control Coprocessor Architecture......................................................................3
CPU Control Summary .............................................................................................. 3-1
CPU Control and ‘‘CO-PROCESSOR 0’’................................................................. 3-2
CPU Control Instructions..................................................................................... 3-2
Standard CPU control registers............................................................................ 3-3
PRId Register ................................................................................................ 3-4
SR Register .................................................................................................... 3-4
Cause Register ............................................................................................... 3-7
EPC Register ................................................................................................. 3-8
BadVaddr Register ........................................................................................ 3-8
R3041, R3071, and R3081 Specific Registers..................................................... 3-8
i–5

Table of Contents

Count and Compare Registers (R3041 only) .................................................3-8
Config Register (R3071 and R3081) .............................................................3-8
Config Register (R3041) ...............................................................................3-9
BusCtrl Register (R3041 only) ....................................................................3-10
PortSize Register (R3041 only) ...................................................................3-11
What registers are relevant when?......................................................................3-11
Exception Management.....................................................................................................4
Exceptions ..................................................................................................................4-1
Precise Exceptions................................................................................................4-1
When Exceptions Happen ....................................................................................4-2
Exception vectors .................................................................................................4-2
Exception Handling – Basics................................................................................4-3
Nesting Exceptions ...............................................................................................4-4
An Exception Routine ..........................................................................................4-4
Interrupts...................................................................................................................4-12
Conventions and Examples ................................................................................4-14
Cache Management ...........................................................................................................5
Caches and Cache Management .................................................................................5-1
Cache Isolation and Swapping .............................................................................5-3
Initializing and Sizing the Caches ........................................................................5-4
Invalidation...........................................................................................................5-6
Testing and Probing..............................................................................................5-8
Configuration (R3041/71/81 only) .......................................................................5-8
Write Buffer................................................................................................................5-9
Implementing wbflush()......................................................................................5-10
Memory Management and the TLB ................................................................................6
Memory Management and the TLB ...........................................................................6-1
MMU Registers Described ...................................................................................6-3
EntryHi, EntryLo ...........................................................................................6-3
Index ..............................................................................................................6-4
Random ..........................................................................................................6-4
Context ...........................................................................................................6-4
MMU Control Instructions ...................................................................................6-5
Programming Interface to the TLB.......................................................................6-5
How Refill Happens ......................................................................................6-5
Using ASIDs ..................................................................................................6-6
The Random Register and Wired Entries ......................................................6-6
Memory Translation – Setup ................................................................................6-6
TLB Exception Sample Code ...............................................................................6-7
Basic Exception Handler ...............................................................................6-7
Fast kuseg Refill from Page Table ................................................................6-7
Simulating Dirty Bits............................................................................................6-8
Use of TLB in Debugging ..........................................................................................6-8
TLB Management Utilities.........................................................................................6-9
Reset Initialization.............................................................................................................7
Starting Up..................................................................................................................7-1
Probing and Recognizing the CPU .......................................................................7-4
Bootstrap Sequences .............................................................................................7-5
Starting Up an Application ...................................................................................7-5
i–6

Table of Contents

Floating Point Coprocessor...............................................................................................8
The IEEE754 Standard and its Background .............................................................. 8-1
What is Floating Point?.............................................................................................. 8-2
IEEE exponent field and bias............................................................................... 8-3
IEEE mantissa and normalization........................................................................ 8-3
Strange values use reserved exponent values ...................................................... 8-3
MIPS FP Data formats ......................................................................................... 8-4
MIPS Implementation of IEEE754............................................................................ 8-5
Floating Point Registers............................................................................................. 8-6
Floating Point Eeceptions/Interrupts.......................................................................... 8-6
The Floating Point Control/Status Register ............................................................... 8-6
Floating Point Implementation/Revision Register..................................................... 8-8
Guide to FP Instructions ............................................................................................ 8-8
Load/Store............................................................................................................ 8-8
Move Between Registers ..................................................................................... 8-9
3-Operand Arithmetic Operations........................................................................ 8-9
Unary (sign-changing) Operations..................................................................... 8-10
Conversion Operations....................................................................................... 8-10
Conditional Branch and Test Instructions.......................................................... 8-10
Instruction Timing Requirements ............................................................................ 8-12
Instruction Timing for Speed ................................................................................... 8-12
Initialization and Enable On Demand...................................................................... 8-12
Floating Point Emulation ......................................................................................... 8-13
Assembler Language Programming.................................................................................9
Syntax Overview........................................................................................................ 9-1
Key Points to Note ............................................................................................... 9-1
Register-to-Register Instructions ............................................................................... 9-2
Immediate (Constant) Operands ................................................................................ 9-3
Multiply/Divide.......................................................................................................... 9-4
Load/Store Instructions.............................................................................................. 9-5
Unaligned Loads and Store.................................................................................. 9-5
Addressing Modes ..................................................................................................... 9-6
Gp-Relative Addressing....................................................................................... 9-6
Jumps, Subroutine Calls and Branches...................................................................... 9-8
Conditional Branches................................................................................................. 9-8
Co-processor Conditional Branches .................................................................... 9-9
Compare and Set ........................................................................................................ 9-9
Coprocessor Transfers ............................................................................................... 9-9
Coprocessor Hazards ......................................................................................... 9-10
Assembler Directives ............................................................................................... 9-10
Sections .............................................................................................................. 9-10
.text, .rdata, .data ......................................................................................... 9-10
.lit4, .lit8 ...................................................................................................... 9-10
Program Segments in Memory ................................................................... 9-11
.bss .............................................................................................................. 9-12
.sdata, .sbss .................................................................................................. 9-12
Stack and Heap ........................................................................................... 9-12
Special Symbols .......................................................................................... 9-12
Data Definition and Alignment.......................................................................... 9-12
i–7

Table of Contents

.byte, .half, .word ........................................................................................ 9-13
.float, .double .............................................................................................. 9-13
.ascii, .asciiz ................................................................................................ 9-13
.align ............................................................................................................ 9-13
.comm, .lcomm ........................................................................................... 9-13
.space ........................................................................................................... 9-14
Symbol Binding Attributes ................................................................................ 9-14
.globl ........................................................................................................... 9-14
.extern .......................................................................................................... 9-15
.weakext ...................................................................................................... 9-15
Function Directives............................................................................................ 9-15
.ent, .end ...................................................................................................... 9-15
.aent ............................................................................................................. 9-16
.frame, .mask, .fmask .................................................................................. 9-16
Assembler Control (.set) .................................................................................... 9-17
.set noreorder/reorder .................................................................................. 9-17
.set volatile/novolatile ................................................................................. 9-17
.set noat/at ................................................................................................... 9-18
.set nomacro/macro ..................................................................................... 9-18
.set nobopt/bopt ........................................................................................... 9-18
The Complete Guide to Assembler Instructions...................................................... 9-18
Alphabetic List of Assembler Instructions .............................................................. 9-30
C Programming................................................................................................................10
The Stack, Subroutine Linkage, Parameter Passing ................................................ 10-1
Stack Argument Structure.................................................................................. 10-1
Which Arguments go in What Registers ........................................................... 10-1
Examples from the C Library ............................................................................ 10-2
Exotic Example; Passing Structures .................................................................. 10-2
How Printf() and Varargs Work ........................................................................ 10-3
Returning Value from a Function ...................................................................... 10-4
Macros for Prologues and Epilogues ................................................................. 10-4
Stack-Frame Allocation ..................................................................................... 10-4
Leaf Functions ............................................................................................ 10-4
Non-Leaf Functions .................................................................................... 10-5
Functions Needing Run-Time Computed Stack Locations ........................ 10-7
Shared and Non-Shared Libraries............................................................................ 10-9
Sharing Code in Single-Address Space Systems ............................................... 10-9
Sharing Code Across Address Spaces ............................................................. 10-10
An Introduction to Optimization............................................................................ 10-11
Common Optimizations ................................................................................... 10-11
How to Prevent Unwanted Effects From Optimization................................... 10-14
Optimizer-Unfriendly Code and How to Avoid It........................................... 10-15
Portability Considerations ..............................................................................................11
Writing Portable C ................................................................................................... 11-1
C Language Standards ...................................................................................... 11-1
C Library Functions and POSIX ....................................................................... 11-2
Data Representations and Alignment....................................................................... 11-3
Notes on Structure Layout and Padding ............................................................ 11-3
Isolating System Dependencies ............................................................................... 11-5
i–8

Table of Contents

Locating System Dependencies ......................................................................... 11-5
Fixing Up Dependencies.................................................................................... 11-5
Isolating Non-Portable Code ....................................................................... 11-6
Using Assembler................................................................................................ 11-6
Endianness ............................................................................................................... 11-7
What It Means to the Programmer..................................................................... 11-8
Bitfield Layout and Endianness .................................................................. 11-9
Changing the Endianness of a MIPS CPU....................................................... 11-10
Designing and Specifying for Configurable Endianness ................................. 11-10
Read-Only Instruction Memory ................................................................ 11-10
Writable (Volatile) Memory ..................................................................... 11-11
Byte-Lane Swapping ................................................................................. 11-11
Configurable IO Controllers ..................................................................... 11-12
Portability and Endianness-Independent Code ................................................ 11-13
Endianness-Independent Code .................................................................. 11-13
Compatibility Within the R30XX Family.............................................................. 11-13
Porting to MIPS: Frequently Encountered Issues.................................................. 11-15
Considerations for Portability to Future Devices................................................... 11-16
Writing Power-On Diagnostics.......................................................................................12
Golden Rules for Diagnostics Programming ........................................................... 12-1
What Should Tests Do? ........................................................................................... 12-2
How to Test the Diagnostic Tests? .......................................................................... 12-3
Overview of Algorithmics’ Power-On Selftest........................................................ 12-3
Starting Points.................................................................................................... 12-3
Control and Environment Variables .................................................................. 12-4
Reporting............................................................................................................ 12-4
Unexpected Exceptions During Test Sequence ................................................. 12-5
Driving Test Output Devices ............................................................................. 12-5
Restarting the System ........................................................................................ 12-5
Standard Test Sequence ..................................................................................... 12-5
Notes on the Test Sequence ............................................................................... 12-6
Annotated Examples from the Test Code .......................................................... 12-9
Instruction Timing and Optimization............................................................................13
Notes and Examples........................................................................................... 13-1
Additional Hazards .................................................................................................. 13-2
Early Modification of HI and LO ...................................................................... 13-2
Bitfields in CPU Control Registers.................................................................... 13-3
Non-Obvious Hazards........................................................................................ 13-3
Software Tools for Board Bring-Up...............................................................................14
Tools Used in Debug ............................................................................................... 14-1
Initial Debugging ..................................................................................................... 14-2
Porting Micromonitor .............................................................................................. 14-2
Running Micromonitor ............................................................................................ 14-2
Initial IDT/SIM Activity .......................................................................................... 14-2
A Final Note on IDT/KIT ........................................................................................ 14-3
Software Design Examples ..............................................................................................15
Application Software ............................................................................................... 15-1
Memory Map ..................................................................................................... 15-1
Starting Up ......................................................................................................... 15-1
i–9

Table of Contents

C Library Functions ........................................................................................... 15-2
Input and Output ......................................................................................... 15-3
Character Class Tests .................................................................................. 15-3
String Functions .......................................................................................... 15-3
Mathematical Functions .............................................................................. 15-3
Utility Functions ......................................................................................... 15-3
Diagnostics .................................................................................................. 15-4
Variable Argument Lists ............................................................................. 15-4
Non-Local Jumps ........................................................................................ 15-4
Signals ......................................................................................................... 15-4
Date and Time ............................................................................................. 15-4
Running the Program ......................................................................................... 15-4
Debugging the Program ..................................................................................... 15-5
Embedded System Software .................................................................................... 15-5
Memory Map ..................................................................................................... 15-6
Starting Up ......................................................................................................... 15-6
Embedded System Library Functions................................................................ 15-7
Trap and Interrupt Handling ....................................................................... 15-8
Simple Interrupt Routines ........................................................................... 15-8
Floating-Point Traps and Interrupts ............................................................ 15-9
Emulating Floating Point Instructions ...................................................... 15-10
Debugging........................................................................................................ 15-10
Unix-Like System S/W .......................................................................................... 15-11
Terminology..................................................................................................... 15-11
Components of a Process ................................................................................. 15-12
System Calls and Protection ............................................................................ 15-13
What the Kernel Does...................................................................................... 15-13
Virtual Memory Implementation for MIPS ..................................................... 15-14
Interrupt Handling for MIPS............................................................................ 15-15
How it Works ............................................................................................ 15-16
Assembly Language Programming Tips........................................................................16
32-bit Address or Constant Values .................................................................... 16-1
Use of “Set” Instructions ................................................................................... 16-1
Use of “Set” with Complex Branch Operations ......................................... 16-2
Carry, Borrow, Overflow, and Multi-Precision Math ................................. 16-2
Machine Instructions Reference (Appendix A)..............................................................A
CPU Instruction Overview.................................................................................. A-1
Instruction Classes .............................................................................................. A-1
Instruction Formats ............................................................................................. A-2
Instruction Notation Conventions ....................................................................... A-2
Instruction Notation Examples ..................................................................... A-3
Load and Store Instructions ................................................................................ A-4
Jump and Branch Instructions............................................................................. A-5
Coprocessor Instructions..................................................................................... A-5
System Control Coprocessor (CP0) Instructions ................................................ A-6
Instruct Set Details.............................................................................................. A-6
Instruction Summary......................................................................................... A-79
FPA Instruction Reference (Appendix B).......................................................................B
FPU Instruction Set Details .................................................................................B-1
i–10

Table of Contents

FPU Instructions ...........................................................................................B-1
Floating-Point Data Transfer ........................................................................B-1
Floating-Point Conversions ..........................................................................B-1
Floating-Point Arithmetic .............................................................................B-2
Floating-Point Register-to-Register Move ....................................................B-2
Floating-Point Branch ...................................................................................B-2
FP Computational Instructions and Valid Operands ...........................................B-2
FP Compare and Condition values ......................................................................B-3
FPU Register Specifiers.......................................................................................B-3
32-bit CP1 registers..............................................................................................B-4
FPU Register Access for 32-bit CP1 Registers..............................................B-5
Instruction Notation Conventions ..................................................................B-5
Load and Store Memory ......................................................................................B-6
Instruction Descriptions .......................................................................................B-6
FPA Instruction Set Summary ...........................................................................B-27
CP0 Operation Reference (Appendix C) ........................................................................C
CP0 Operation Details .........................................................................................C-1
MMU Operations .................................................................................................C-1
Exception Operations...........................................................................................C-1
Dand Register Movement Operations............................................................C-1
Operation Descriptions ........................................................................................C-1
Assembler Language Syntax (Appendix D)....................................................................D
Object Code Formats (Appendix E)................................................................................E
Sections and Segments...............................................................................................E-1
ECOFF Object File Format (RISC/OS).....................................................................E-1
File Header...........................................................................................................E-2
Optional a.out Header ..........................................................................................E-2
Example Loader ...................................................................................................E-3
Further Reading ...................................................................................................E-4
ELF (MIPS ABI)........................................................................................................E-4
File Header...........................................................................................................E-4
Program Header ...................................................................................................E-5
Example Loader ...................................................................................................E-6
Further Reading ...................................................................................................E-7
Object Code Tools .....................................................................................................E-7
Glossary of Common "MIPS" Terms............................................................................. F
DRAWINGS
1.1
MIPS 5-Stage Pipeline..........................................................................................1.2
1.2
The Pipeline and Branch Delays.......................................................................... 1-7
1.3
The Pipeline and Load Delays ............................................................................. 1-8
3.1
PRId Register Fields ............................................................................................ 3-4
3.2
Fields in Status Register....................................................................................... 3-4
3.3
Fields in the Cause Register................................................................................. 3-7
3.4
Fields in the R3071/81 Config Register............................................................... 3-8
3.5
Fields in the R3041 Config (Cache Configuration)Register................................ 3-9
3.6
Fields in the R3041 Bus Control (BusCtrl) Register ......................................... 3-10
5.1
Direct Mapped Cache .......................................................................................... 5-1
6.1
EntryHi and EntryLo Register Fields .................................................................. 6-3
i–11

Table of Contents

6.2
6.3
6.4
6.5
8.1
8.2
9.1
10.1
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
15.1
A.1

EntryHi and EntryLo Register Fields .................................................................. 6-3
Fields in the Index Register ................................................................................. 6-4
Fields in the Random Register............................................................................. 6-4
Fields in the Context Register.............................................................................. 6-4
FPA Control/Status Register Fields ..................................................................... 8-6
FPA Implementation/Revision Register .............................................................. 8-8
Program Segments in Memory .......................................................................... 9-11
Stackframe for a Non-Leaf Function ................................................................. 10-5
Structure Layout and Padding in Memory......................................................... 11-3
Data Representation with #pragma Pack(1) ...................................................... 11-4
Data Representation with #pragma Pack(2) ...................................................... 11-5
Typical Big-Endians Picture .............................................................................. 11-8
Little Endians Picture......................................................................................... 11-8
Bitfields and Big-Endian.................................................................................... 11-9
Bitfields and Little-Endian............................................................................... 11-10
Garbled String Storage when Mixing Modes .................................................. 11-11
Byte-Lane Swapper.......................................................................................... 11-12
Memory Layout of a BSD Process .................................................................. 15-12
CPU Instruction Formats .................................................................................... A-2

TABLES
1.1
R30xx Family Members Compared..................................................................... 1-4
2.1
Conventional Names of Registers with Usage Mnemonics................................. 2-2
3.1
Summary of CPU Control Registers (Not MMU) ............................................... 3-3
3.2
ExcCode Values: Different kinds of Exceptions ................................................. 3-7
4.1
Reset and Exception Entry Points (Vectors) for R30xx Family .......................... 4-3
4.2
Interrupt Bitfields and Interrup Pins .................................................................. 4-13
6.1
CPU Control Registers for Memory Management .............................................. 6-3
8.1
Floating Point Data Formats ................................................................................ 8-4
8.2
Rounding Modes Encoded in FP Control/Status Register................................... 8-7
8.4
FP Move Instructions........................................................................................... 8-9
8.5
FPA 3-Operand Arithmetic................................................................................ 8-10
8.6
FPA Sign-Changing Operators .......................................................................... 8-10
8.7
FPA Data Conversion Operations...................................................................... 8-10
8.8
FP Test Instructions ........................................................................................... 8-11
9.1
Assembler Register and Identifier Conventions ................................................ 9-20
9.2
Assembler Instructions....................................................................................... 9-20
12.1 Test Sequence in Brief ....................................................................................... 12-5
16.1 32-bit Immediate Values.................................................................................... 16-1
16.2 Add-With-Carry................................................................................................. 16-2
16.3 Subtract-with-Borrow Operation ....................................................................... 16-3
A.1
CPU Instruction Operation Notations................................................................. A-3
A.2
Load and Store Common Function ..................................................................... A-4
A.3
Access Type Specifications for Load/Store........................................................ A-5
B.1
Format Field Decoding ........................................................................................B-2
B.2
Logical Negation of Predicates by Condition True/False....................................B-3
B.3
Valid FP Operand Specifiers with 32-bit Coprocessor 1 Registers.....................B-4
B.4
Load and Store Common Functions ....................................................................B-6
i–12

®

INTRODUCTION

CHAPTER 1

Integrated Device Technology, Inc.

1

IDT’s R30xx family of RISC microcontrollers family includes the R3051,
R3052, R3071, R3081 and R3041 processors. The different members of
the family offer different price/performance trade-offs, but are all basically
integrated versions of the MIPS R3000A CPU. The R3000A CPU is well
known for the high-performance Unix systems implemented around it; less
publicized but equally impressive is the performance it has brought to a
wide variety of embedded applications.
IDT’s RISController family also includes devices built around MIPS
R4000 64-bit microprocessor technology. These devices, such as the IDT
R4600 Orion microprocessor, offer even higher levels of performance than
the R3000A derivative family. However, these devices also feature slightly
different OS models, and allow 64-bit kernels and applications. Thus, they
are sufficiently different from the R30xx family that this manual is focused
exclusively on the R30xx family.
This manual is aimed at the programmer dealing with the IDT R30xx
family components. Although most programming occurs using a high-level
language (usually “C”), and with little awareness of the underlying system
or processor architecture, certain operations require the programmer to
use assembly programming, and/or be aware of the underlying system or
processor structure. This manual is designed to be consulted when
addressing these types of issues.

WHAT IS A RISC?
The MIPS CPU is one of the “RISC’’ CPUs, born out of a particularly
fertile period of academic research and development. RISC CPUs
(‘‘Reduced Instruction Set Computer’’) share a number of architectural
attributes to facilitate the implementation of high-performance processors.
Most new architectures (as opposed to implementations) since 1986 owe
their remarkable performance to features developed a few years earlier by
a couple of seminal research projects. Someone commented that ‘‘a RISC
is any computer architecture defined after 1984’’; although meant as a jibe
at the industry’s use of the acronym, the comment’s truth also derives
from the widespread acceptance of the conclusions of that research.
One of these was the ‘‘MIPS’’ project at Stanford University. The project
name MIPS puns the familiar ‘‘millions of instructions per second’’ by
taking its name from the key phrase ‘‘Microcomputer without Interlocked
Pipeline Stages’’. The Stanford group’s work showed that pipelining, a wellknown technique for speeding up computers, had been under-exploited by
earlier architectures.

1–1

CHAPTER 1

INTRODUCTION

PIPELINES

Instruction sequence

instr 1

instr 2

I-cache

register
file

ALU

D-cache

register
file

IF

RD

ALU

MEM

WB

IF

RD

ALU

IF

instr 3

RD

MEM

ALU

WB

MEM

WB

Time
Figure 1.1.

MIPS 5-stage pipeline

Pipelined processors operate by breaking instruction execution into
multiple small independent “stages”; since the stages are independent,
multiple instructions can be in varying states of completion at any one
time. Also, this organization tends to facilitate higher frequencies of
operation, since very complex activities can be broken down into “bitesized” chunks. The result is that multiple instructions are executing at any
one time, and that instructions are initiated (and completed) at very high
frequency. MIPS has consistently been among the most aggressive in the
utilization of these techniques.
Pipelining depends for its success on another technique; using caches
to reduce the amount of time spent waiting for memory. The MIPS R3000A
architecture uses separate instruction and data caches, so it can fetch an
instruction and read or write a memory variable in the same clock phase.
By mating high-frequency operation to high memory-bandwidth, very
high-performance is achieved.
In CISC architectures, caches are often seen as part of memory. A RISC
architecture makes more sense if the dual caches are regarded as very
much part of the CPU; in fact, the pipelines of virtually all RISC processors
require caches to maintain execution. The CPU normally runs from cache
and a cache miss (where data or instructions have to be fetched from
memory) is seen as an exceptional event.
For the R3000A and its derivatives, instruction execution is divided into
five phases (called pipestages), with each pipestage taking a fixed amount
of time (see “MIPS 5-stage pipeline” on page 1-2). Again, note that this
model assumes that instruction fetches and data accesses can be satisfied
from the processor caches at the processor operation frequency. All
instructions are rigidly defined to follow the same sequence of pipestages,
even where the instruction does nothing at some stage.
The net result is that, so long as it keeps hitting the cache, the CPU
starts an instruction every clock.
"Figure 1.1. MIPS 5-stage pipeline”, illustrates this operation.
Instruction execution activity can be described as occurring in the
individual pipestages:
• IF : (‘‘instruction fetch’’) gets the next instruction from the instruction
cache (I-cache).
• RD : (‘‘read registers’’) decodes the instruction and fetches the
contents of any CPU registers it uses.
• ALU : (‘‘arithmetic/logic unit’’) performs an arithmetic or logical
operation in one clock (floating point math and integer multiply/
divide can’t be done in one clock and are done differently; this is
described later).

1–2

INTRODUCTION

CHAPTER 1

• MEM : the stage where the instruction can read/write memory
variables in the data cache (D-cache). Note that for typical programs,
three out of four instructions do nothing in this stage; but allocating
the stage to each instruction ensures that the processor never has
two instructions wanting the data cache at the same time.
• WB : (‘‘write back’’) store the value obtained from an operation back to
the register file.
A rigid pipeline does limit the kinds of things instructions can do; in
particular:
• Instruction length : ALL instructions are 32 bits (exactly one machine
‘‘word’’) long, so that they can be fetched in a constant time. This itself
discourages complexity; there are not enough bits in the instruction
to encode really complicated addressing modes, for example.
• No arithmetic on memory variables : data from cache or memory is
obtained only in stage 4, which is much too late to be available to the
ALU. Memory accesses occur only as simple load or store instructions
which move the data to or from registers (this is described as a ‘‘load/
store architecture’’).
However, the MIPS project architects also attended to the best thinking
of the time about what makes a CPU an easy target for efficient optimizing
compilers. So MIPS CPUs have 32 general purpose registers, 3-operand
arithmetical/logical instructions and eschew complex special-purpose
instructions which compilers can’t usually generate.

THE IDT R3xxx FAMILY CPUS
MIPS Corporation was formed in 1984 to make a commercial version of
the Stanford MIPS CPU. The commercial CPU was enhanced with memory
management hardware, first appearing late in 1985 as the R2000. An
ambitious external floating point math co-processor (the R2010 FPA) first
shipped in mid-87. The R3000, shipped in 1988, is almost identical from
the programmer’s viewpoint (although small hardware enhancements
combined to give a substantial boost to performance). The R3000A was
done in 1989, to improve the frequency of operation over the original
R3000 (other minor enhancements were added, such as the ability for user
tasks to operate with the opposite “endianness” from the kernel).
The R2000/R3000 chips include a cache controller – the
implementation of external caches merely required a few industry
standard SRAMs and some address latches. The math co-processor shares
the cache buses to interpret instructions (in parallel with the integer CPU)
and transfer operands and results between the FPA and memory or the
integer CPU.
The division of function was ingenious, practical and workable, allowing
the R2000/3000 generation to be built without extravagant ultra-high pincount packages. However, as clock speeds increased the very high-speed
signals in the cache interface increased design complexity and limited
operational frequency. In addition, overall chip count for the basic
execution core proved to be a limitation for area and power sensitive
embedded systems.
The R3051, R3052, R3071, R3081 and R3041 are the members (so far)
of a family of products defined, designed, and manufactured by IDT. The
chips integrate the functions of the R3000A CPU, cache memory and
(R3081 only) math co-processor. This means that all the fastest logic is on
chip; so the integrated chips are not only cheaper and smaller than the
original implementation, but also much easier to use.
The parts differ in their cache sizes, whether they include onchip MMU
and/or FPA, clock rates and packaging options. In addition, although all
parts can be used pin-compatibly, certain products feature optional
enhancements in their bus-interface that may serve to reduce system cost
or complexity, and other subtle enhancements for cost or performance.
The major differences are summarized in "Table 1.1. R30xx family
members compared”.
1–3

CHAPTER 1

Part
3051
3051E
3052
3052E

INTRODUCTION

Cache
I+D
4K + 1K

8K + 2K

MMU
–
×
–
×

16K+4K/
8K+8K

–

3081E

16K+4K/
8K+8K

×

3071

16K+4K/
8K+8K

–

3071E

16K+4K/
8K+8K

×

3041

2K + 0.5K

–

3081

FPA

Clock
(MHz)

Package
Options

–

20-40

PLCC

32-bit MUX’ed A/D

–

20-40

PLCC

32-bit MUX’ed A/D

×

20-50

PLCC

Optional 1/2 frequency
bus operation
Optional 1x Clock Input

–

33-50

PLCC

1/2 frequency bus
operation
1x Clock Input

–

16-25

PLCC
TQFP

Variable port width
interface.

System Interface

Table 1.1. R30xx family members compared

MIPS ARCHITECTURE LEVELS
There are multiple generations of the MIPS architecture. The most
commonly discussed are the MIPS-1, MIPS-2, and MIPS-3 architectures.
MIPS-1 is the ISA found in the R2000 and R3000 generation CPUs. It is
a 32-bit ISA, and defines the basic instruction set. Any application written
with the MIPS-1 instruction set will operate correctly on all generations of
the architecture.
The MIPS-2 ISA is also 32-bit. It adds some instructions to speed up
floating point data movement, branch-likely instructions, and other minor
enhancements. This was first implemented in the MIPS R6000 ECL
microprocessor.
The MIPS-3 ISA is a 64-bit ISA. In addition to supporting all MIPS-1 and
MIPS-2 instructions, the MIPS-3 ISA contains 64-bit equivalents of certain
earlier instructions that are sensitive to operand size (e.g. load double and
load word are both supported), including doubleword (64-bit) data
movement and arithmetic. This ISA was first implemented in the R4000 as
a clean (“seamless”) transition from the existing 32-bit architecture.
Note that these ISA levels do not necessarily imply a particular structure
for the MMU, caches, exception model, or other kernel specific resources.
Thus, different implementations of ISA compatible chips may require
different kernels.
In the case of the R30xx family, all devices implement the MIPS-1 ISA.
Many devices are also kernel compatible with the R3000A, but some
devices (most notably those without an MMU) may require small kernel
changes or different boot modules†.

MIPS-1 COMPARED WITH CISC ARCHITECTURES
Although the MIPS architecture is fairly straight-forward, there are a few
features, visible only to assembly programmers, which may at first appear
surprising. In addition, operations familiar to CISC architectures are
† Historically, many embedded MIPS applications have run
exclusively out of the “kseg0 and kseg1” memory regions
(described later in the book). For these applications, the presence
or absence of the MMU is largely irrelevant.
1–4

INTRODUCTION

CHAPTER 1

irrelevant to the MIPS architecture. For example, the MIPS architecture
does not mandate a stack pointer or stack usage; thus, programmers may
be surprised to find that push/pop instructions do not exist directly.
The most notable of these features are summarized here.

Unusual instruction encoding features
• All instructions are 32-bits long : as mentioned above. This means, for
example, that it is impossible to incorporate a 32-bit constant into a
single instruction (there would be no instruction bits left to encode
the operation and the registers!). A ‘‘load immediate’’ instruction is
limited to a 16-bit value; a special ‘‘load upper immediate’’ must be
followed by an ‘‘or immediate’’ to put a 32-bit constant value into a
register.
• Instruction actions must fit the pipeline : actions can only be carried out
in the designated pipeline phase, and must be complete in one clock.
For example, the register writeback phase provides for just one value
to be stored in the register file, so instructions can only change one
register.
• 3-operand instructions : arithmetic/logical operations don’t have to
specify memory locations, so there are plenty of instruction bits to
define two independent source and one destination register.
Compilers love 3-operand instructions, which give optimizers more
scope to improve the code which handles complex expressions.
• 32 registers : the choice of 32 has become universal; compilers like a
large (but not necessarily too large) number of registers, but there is
a cost in context-saving and in encoding the registers to be used by
an instruction. Register $0 always returns zero, to give a compact
encoding of that useful constant.
• No condition codes : the MIPS architecture does not provide condition
code flags implicitly set by arithmetical operations. The motivation is
to make sure that execution state is stored in one place – the register
file. Conditional branches (in MIPS) test a single register for sign/zero,
or a pair of registers for equality.

Addressing and memory accesses
• Memory references are always register loads and stores : arithmetic on
memory variables upsets the pipeline, so is not done. Memory
references only occur due to explicit load or store instructions. The
large register file allows multiple variables to be “on-chip”
simultaneously.
• Only one data addressing mode : all loads and stores define the
memory location with a single base register value modified by a 16-bit
signed displacement. Note that the assembler/compiler tools can use
the $0 register, along with the immediate value, to synthesize
additional addressing modes from this one directly supported mode.
• Byte-addressed : the instruction set includes load/store operations
for 8- and 16-bit variables (referred to as byte and halfword). Partialword load instructions come in two flavors – sign-extend and zeroextend.
• Loads/stores must be address-aligned : memory word operations can
only load or store data from a single 4-byte aligned word; halfword
operations must be aligned on half-word addresses. Many CISC
microprocessors will load/store a multi-byte item from any byte
address (although unaligned transfers always take longer).
Techniques to generate code which will handle unaligned data
efficiently will be explained later.
• Jump instructions : The smallest op-code field in a MIPS instruction is
6 bits; leaving 26 bits to define the target of a jump. Since all
instructions are 4-byte aligned in memory the two least-significant
1–5

CHAPTER 1

INTRODUCTION

address bits need not be stored, allowing an address range of 228 =
256Mbytes. Rather than make this branch PC-relative, this is
interpreted as an absolute address within a 256Mbyte ‘‘segment’’. In
theory, this could impose a limit on the size of a single program; in
reality, it hasn’t been a problem.
Branches out of segment can be achieved by using a jr instruction,
which uses the contents of a register as the target.
Conditional branches have only a 16-bit displacement field (218 byte
range since instructions are 4-byte aligned) which is interpreted as a
signed PC-relative displacement. Compilers can only code a simple
conditional branch instruction if they know that the target will be
within 128Kbytes of the instruction following the branch.

Operations not directly supported
• No byte or halfword arithmetic : all arithmetical and logical operations
are performed on 32-bit quantities. Byte and/or halfword arithmetic
would require significant extra resources, many more op-codes, and
is an understandable omission. Most C programmers will use the int
data type for most arithmetic, and for MIPS an int is 32 bits and such
arithmetic will be efficient. C’s rules are to perform arithmetic in int
whenever any source or destination variable is as long as int.
However, where a program explicitly does arithmetic as short the
compiler must insert extra code to make sure that wraparound and
overflows have the appropriate effect.
• No special stack support : conventional MIPS assembler usage does
define a sp register, but the hardware treats sp just like any other
register. There is a recommended format for the stack frame layout of
subroutines, so that programs can mix modules from different
languages and compilers; it is recommended that programmers stick
to these conventions, but they have no relationship to the hardware.
• Minimal subroutine overhead : there is one special feature; jump
instructions have a ‘‘jump and link’’ option which stores the return
address into a register. $31 is the default, so for convenience and by
convention $31 becomes the ‘‘return address’’ register.
Minimal
interrupt overhead : The MIPS architecture makes very few
•
presumptions about system exception handling, allowing fast
response and a wide variety of software models. In the R30xx family,
the CPU stashes away the restart location in the special register EPC,
modifies the machine state just enough to signal why the trap
happened and to disallow further interrupts; then it jumps to a single
predefined location† in low memory. Everything else is up to the
software.
Just to emphasize this: on an interrupt or trap a MIPS CPU does not
store anything on a stack, or write memory, or preserve any registers
by itself.
By convention, two registers ($k0, $k1; register conventions are
explained in chapter 2) are reserved so that interrupt/trap routines
can ‘‘bootstrap’’ themselves – it is impossible to do anything on a MIPS
CPU without using some registers. For a program running in any
system which takes interrupts or traps, the values of these registers
may change at any time, and thus should not be used.

† One particular kind of trap (a TLB miss on an address in the
user-privilege address space) has a different dedicated entry point.
1–6

INTRODUCTION

CHAPTER 1

Multiply and divide operations
The MIPS CPU does have an integer multiply/divide unit; worth
mentioning because many RISC machines don’t have multiply hardware.
The multiply unit is relatively independent of the rest of the CPU, with its
own special output registers.

Programmer-visible pipeline effects
In addition to the discussion above, programmers of R3xxx architecture
CPUs also must be aware of certain effects of the MIPS pipeline.
Specifically, the results of certain operations may not be available in the
immediately subsequent instruction; the programmer may need to be
explicitly aware of such cases.

branch

IF

RF

branch
delay

branch
addr

IF

branch
target

MEM

RF

ALU

IF

Figure 1.2.

RF

WB

MEM

ALU

WB

MEM

WB

The pipeline and branch delays

• Delayed branches : the pipeline structure of the MIPS CPU (see "Figure
1.2. The pipeline and branch delays”) means that when a jump
instruction reaches the ‘‘execute’’ phase and a new program counter
is generated, the instruction after the jump will already have been
decoded. Rather than discard this potentially useful work, the
architecture rules state that the instruction after a branch is always
executed before the instruction at the target of the branch.
"Figure 1.2. The pipeline and branch delays” show that a special path
is provided through the ALU to make the branch address available a
half-clock early, ensuring that there is only a one cycle delay before
the outcome of the branch is determined and the appropriate
instruction flow (branch taken or not taken) is initiated.
It is the responsibility of the compiler system or the assemblerprogrammer to allow for and even to exploit this “branch delay slot”;
it turns out that it is usually possible to arrange code such that the
instruction in the ‘‘delay slot’’ does useful work. Quite often, the
instruction which would otherwise have been placed before the
branch can be moved into the delay slot.
This can be a bit tricky on a conditional branch, where the branch
delay instruction must be (at least) harmless on the path where it isn’t
wanted. Where nothing useful can be done the delay slot is filled with
a ‘‘nop’’ (no-op, or no-operation) instruction.
Many MIPS assemblers will hide this feature from the programmer
unless explicitly told not to, as described later.
• Load data not available to next instruction : another consequence of
the pipeline is that a load instruction’s data arrives from the cache/
memory system AFTER the next instruction’s ALU phase starts – so it
is not possible to use the data from a load in the following instruction.
See "Figure 1.3. The pipeline and load delays” for how this works. On
the MIPS-1 architecture, the programmer must insure that this rule
is not violated

1–7

CHAPTER 1

INTRODUCTION

• .

load

load
delay

IF

RD

D-cache
MEM rd

ALU

IF

use
data

RD

ALU

IF

Figure 1.3.

RD

WB

MEM

ALU

WB

MEM

WB

The pipeline and load delays

Again, most assemblers will hide this if they can. Frequently, the
assembler can move an instruction which is independent of the load
into the load delay slot; in the worst case, it can insert a NOP to insure
proper program execution.

A NOTE ON MACHINE AND ASSEMBLER LANGUAGE
To simplify assembly level programming, the MIPS Corp’s assembler
(and many other MIPS assemblers) provides a set of “synthetic”
instructions. Typically, a synthetic instruction is a common assembly level
operation that the assembler will map into one or more true instructions.
This mapping can be more intelligent than a mere macro expansion. For
example, an immediate load may map into one instruction if the datum is
small enough, or multiple instructions if the datum is larger. However,
these instructions can dramatically simplify assembly level programming.
For example, the programmer just writes a ‘‘load immediate’’ instruction
and the assembler will figure out whether it needs to generate multiple
machine instructions or can get by with just one (in this example,
depending on the size of the immediate datum).
This is obviously useful, but can be confusing. This manual will try to
use synthetic instructions sparingly, and indicate when it happens.
Moreover, the instruction tables below will consistently distinguish
between synthetic and machine instructions.
These features are there to help human programmers; most compilers
generate instructions which are one-for-one with machine code. However,
some compilers will in fact generate synthetic instructions.
Helpful things the assembler does:
• 32-bit load immediates : The programmer can code a load with any
value (including a memory location which will be computed at link
time), and the assembler will break it down into two instructions to
load the high and low half of the value.
• Load from memory location : The programmer can code a load from a
memory-resident variable. The assembler will normally replace this
by loading a temporary register with the high-order half of the
variable’s address, followed by a load whose displacement is the loworder half of the address.
Of course, this does not apply to variables defined inside C functions,
which are implemented either in registers or on the stack.
• Efficient access to memory variables : some C programs contain many
references to static or extern variables, and a two-instruction
sequence to load/store any of them is expensive. Some compilation
systems, with run-time support, get around this. Certain variables
are selected at compile/assemble time (by default MIPS Corp’s
assembler selects variables which occupy 8 or less bytes of storage)

1–8

INTRODUCTION

CHAPTER 1

and kept together in a single section of memory which must end up
smaller than 64Kbytes. The run-time system then initializes one
register ($28 or gp (global pointer) by convention) to point to the
middle of this section.
Loads and stores to these variables can now be coded as a single gp
relative load or store.
• More types of branch condition : the assembler synthesizes a full set of
branches conditional on an arithmetic test between two registers.
• Simple or different forms of instructions : unary operations such as not
and neg are produced as a nor or sub with the zero-valued register $0.
Two-operand forms of 3-operand instructions can be written; the
assembler will put the result back into the first-specified register.
• Hiding the branch delay slot: in normal coding most assemblers will
not allow access the branch delay slot. MIPS Corp.’s assembler, in
particular, is exceptionally ingenious and may re-organize the
instruction sequence substantially in search of something useful to
do in the delay slot. An assembler directive ‘‘.noreorder’’ is available
where this must not happen.
• Hiding the load delay: many assemblers will detect an attempt to use
the result of a load in the next instruction, and will either move code
around or insert a nop.
• Unaligned transfers: the ‘‘unaligned’’ load/store instructions will
fetch halfword and word quantities correctly, even if the target
address turns out to be unaligned.
• Other pipeline corrections: some instructions (such as those which use
the integer multiply unit) have additional constraints that are
implementation specific (see the Appendix on hazards). Many
assemblers will just “handle” these cases automatically, or at least
warn the programmer about possible hazards violations.
• Other optimizations: some MIPS instructions (particularly floating
point) take multiple clocks to produce results. However, the hardware
is ‘‘interlocked’’, so the programmer does not need to be aware of these
delays to write correct programs. But MIPS Corp.’s assembler is
particularly aggressive in these circumstances, and will perform
substantial code movement to try to make it run faster. This may need
to be considered when debugging.
In general, it is best to use a dis-assembler utility to disassemble a
resulting binary during debug. This will show the system designers the
true code sequence being executed, and thus “uncover” the modifications
made by the assembler or compiler.

1–9

®

MIPS-1 (R30xx)
ARCHITECTURE

CHAPTER 2

Integrated Device Technology, Inc.

1
PROGRAMMER’S VIEW OF THE PROCESSOR
ARCHITECTURE
This chapter describes the assembly programmer’s view of the CPU
architecture, in terms of registers, instructions, and computational
resources. This viewpoint corresponds, for example, to an assembly
programmer writing user applications (although more typically, such a
programmer would use a high-level language).
Information about kernel software development (such as handling
interrupts, traps, and cache and memory management) are described in
later chapters.

Registers
There are 32 general purpose registers: $0 to $31. Two, and only two,
are special to the hardware:
• $0 always returns zero, no matter what software attempts to store to
it.
• $31 is used by the normal subroutine-calling instruction (jal) for the
return address. Note that the call-by-register version (jalr) can use
ANY register for the return address, though practice is to use only
$31.
In all other respects all registers are identical and can be used in any
instruction ($0 can be used as the destination of instructions; the value of
$0 will remain unchanged, however, so the instruction would be effectively
a NOP).
In the MIPS architecture the ‘‘program counter’’ is not a register, and it
is probably better to not think of it that way. The return address of a jal is
two instructions later in sequence (the instruction after the jump delay slot
instruction); the instruction after the call is the call’s ‘‘delay slot’’ and is
typically used to set up the last parameter.
There are no condition codes and nothing in the ‘‘status register’’ or
other CPU internals is of any consequence to the user-level programmer.
There are two registers associated with the integer multiplier. These
registers, referred to as “HI” and “LO”, contain the 64-bit product result of
a multiply operation, or the quotient and remainder of a divide.
The floating point math co-processor (called FPA for floating point
accelerator), if available, adds 32 floating point registers†; in simple
assembler language they are just called $0 to $31 again – the fact that
these are floating point registers is implicitly defined by the instruction.
Actually, only the 16 even-numbered registers are usable for math; but
they can be used for either single-precision (32 bit) or double-precision
(64-bit) numbers, When performing double-precision arithmetic, odd
numbered register $N+1 holds the remaining bits of the even numbered
register identified $N. Only moves between integer and FPA, or FPA load/
store instructions, ever refer to odd-numbered registers (and even then the
assembler helps the programmer forget...)

† The FPA also has a different set of registers called ‘‘co-processor
1 registers’’ for control purposes. These are typically used to
manage the actions/state of the FPA, and should not be confused
with the FPA data registers.
2–1

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

Conventional names and uses of general-purpose registers
Although the hardware makes few rules about the use of registers, their
practical use is governed by a number of conventions. These conventions
allow inter-changeability of tools, operating systems, and library modules.
It is strongly recommended that these conventions be followed.
Reg No

Name

Used for

0

zero

Always returns 0

1

at

(assembler temporary) Reserved for use by assembler

2-3

v0-v1

Value (except FP) returned by subroutine

4-7

a0-a3

(arguments) First four parameters for a subroutine

8-15

t0-t7

(temporaries) subroutines may use without saving

24-25

t8-t9

16-23

s0-s7

Subroutine ‘‘register variables’’; a subroutine which will write
one of these must save the old value and restore it before it
exits, so the calling routine sees their values preserved.

26-27

k0-k1

Reserved for use by interrupt/trap handler - may change
under your feet

28

gp

global pointer - some runtime systems maintain this to give
easy access to (some) ‘‘static’’ or ‘‘extern’’ variables.

29

sp

stack pointer

30

s8/fp

9th register variable. Subroutines which need one can use
this as a ‘‘frame pointer’’.

31

ra

Return address for subroutine

Table 2.1. Conventional names of registers with usage mnemonics

With the conventional uses of the registers go a set of conventional
names. Given the need to fit in with the conventions, use of the
conventional names is pretty much mandatory. The common names are
described in Table 2.1, “Conventional names of registers with usage
mnemonics”.
Notes on conventional register names
• at : this register is reserved for use inside the synthetic instructions
generated by the assembler. If the programmer must use it explicitly
the directive .noat stops the assembler from using it, but then there
are some things the assembler won’t be able to do.
• v0-v1 : used when returning non-floating-point values from a
subroutine. To return anything bigger than 2×32 bits, memory must
be used (described in a later chapter).
• a0-a3 : used to pass the first four non-FP parameters to a subroutine.
That’s an occasionally-false oversimplification; the actual convention
is fully described in a later chapter.
• t0-t9 : by convention, subroutines may use these values without
preserving them. This makes them easy to use as ‘‘temporaries’’ when
evaluating expressions – but a caller must remember that they may
be destroyed by a subroutine call.
• s0-s8 : by convention, subroutines must guarantee that the values of
these registers on exit are the same as they were on entry – either by
not using them, or by saving them on the stack and restoring before
exit.
This makes them eminently suitable for use as ‘‘register variables’’ or
for storing any value which must be preserved over a subroutine call.
2–2

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• k0-k1 : reserved for use by the trap/interrupt routines, which will not
restore their original value; so they are of little use to anyone else.
• gp : (global pointer). If present, it will point to a load-time-determined
location in the midst of your static data. This means that loads and
stores to data lying within 32Kbytes either side of the gp value can be
performed in a single instruction using gp as the base register.
Without the global pointer, loading data from a static memory area
takes two instructions: one to load the most significant bits of the 32bit constant address computed by the compiler and loader, and one
to do the data load.
To use gp a compiler must know at compile time that a datum will end
up linked within a 64Kbyte range of memory locations. In practice it
can’t know, only guess. The usual practice is to put ‘‘small’’ global
data items in the area pointed to by gp, and to get the linker to
complain if it still gets too big. The definition of what is “small” can
typically be specified with a compiler switch (most compilers use “G“). The most common default size is 8 bytes or less.
Not all compilation systems or OS loaders support gp.
• sp : (stack pointer). Since it takes explicit instructions to raise and
lower the stack pointer, it is generally done only on subroutine entry
and exit; and it is the responsibility of the subroutine being called to
do this. sp is normally adjusted, on entry, to the lowest point that the
stack will need to reach at any point in the subroutine. Now the
compiler can access stack variables by a constant offset from sp.
Stack usage conventions are explained in a later chapter.
• fp : (also known as s8). A subroutine will use a ‘‘frame pointer’’ to keep
track of the stack if it wants to use operations which involve extending
the stack by an amount which is determined at run-time. Some
languages may do this explicitly; assembler programmers are always
welcome to experiment; and (for many toolchains) C programs which
use the ‘‘alloca’’ library routine will find themselves doing so.
In this case it is not possible to access stack variables from sp, so fp
is initialized by the function prologue to a constant position relative
to the function’s stack frame. Note that a ‘‘frame pointer’’ subroutine
may call or be called by subroutines which do not use the frame
pointer; so long as the functions it calls preserve the value of fp (as
they should) this is OK.
• ra : (return address). On entry to any subroutine, ra holds the address
to which control should be returned – so a subroutine typically ends
with the instruction ‘‘jr ra’’.
Subroutines which themselves call subroutines must first save ra,
usually on the stack.

Integer multiply unit and registers
MIPS’ architects decided that integer multiplication was important
enough to deserve a hard-wired instruction. This is not so common in
RISCs, which might instead:
• implement a ‘‘multiply step’’ which fits in the standard integer
execution pipeline, and require software routines for every
multiplication (e.g. Sparc or AM29000); or
• perform integer multiplication in the floating point unit – a good
solution but which compromises the optional nature of the MIPS
floating point ‘‘co-processor’’.
The multiply unit consumes a small amount of die area, but
dramatically improves performance (and cache performance) over
“multiply step” operations. It’s basic operation is to multiply two 32-bit
values together to produce a 64-bit result, which is stored in two 32-bit

2–3

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

registers (called ‘‘hi’’ and ‘‘lo’’) which are private to the multiply unit.
Instructions mfhi, mflo are defined to copy the result out into general
registers.
Unlike results for integer operations, the multiply result registers are
interlocked. An attempt to read out the results before the multiplication is
complete results in the CPU being stopped until the operation completes.
The integer multiply unit will also perform an integer division between
values in two general-purpose registers; in this case the ‘‘lo’’ register stores
the quotient, and the ‘‘hi’’ register the remainder.
In the R30xx family, multiply operations take 12 clocks and division
takes 35. The assembler has a synthetic multiply operation which starts
the multiply and then retrieves the result into an ordinary register. Note
that MIPS Corp.’s assembler may even substitute a series of shifts and
adds for multiplication by a constant, to improve execution speed.
Multiply/divide results are written into ‘‘hi’’ and ‘‘lo’’ as soon as they are
available; the effect is not deferred until the writeback pipeline stage, as
with writes to general purpose (GP) registers. If a mfhi or mflo instruction
is interrupted by some kind of exception before it reaches the writeback
stage of the pipeline, it will be aborted with the intention of restarting it.
However, a subsequent multiply instruction which has passed the ALU
stage will continue (in parallel with exception processing) and would
overwrite the ‘‘hi’’ and ‘‘lo’’ register values, so that the re-execution of the
mfhi would get wrong (i.e. new) data. For this reason it is recommended
that a multiply should not be started within two instructions of an mfhi/
mflo. The assembler will avoid doing this where it can.
Integer multiply and divide operations never produce an exception,
though divide by zero produces an undefined result. Compilers will often
generate code to trap on errors, particularly on divide by zero. Frequently,
this instruction sequence is placed after the divide is initiated, to allow it
to execute concurrently with the divide (and avoid a performance loss).
Instructions mthi, mtlo are defined to setup the internal registers from
general-purpose registers. They are essential to restore the values of ‘‘hi’’
and ‘‘lo’’ when returning from an exception, but probably not for anything
else.

Instruction types
A full list of R30xx family integer instructions is presented in Appendix
A. Floating point instructions are listed in Appendix B of this manual.
Currently, floating point instructions are only available in the R3081, and
are described in the R3081 User’s Manual.
The MIPS-1 ISA uses only three basic instruction encoding formats; this
is one of the keys to the high-frequencies attained by RISC architectures.
Instructions are mostly in numerical order; to simplify reading, the list
is occasionally re-ordered for clarity.
Throughout this manual, the description of various instructions will
also refer to various subfields of the instruction. In general, the following
typical nomenclature is used:
op
The basic op-code, which is 6 bits long. Instructions which large
sub-fields (for example, large immediate values, such as required
for the ‘‘long’’ j/jal instructions, or arithmetic with a 16-bit
constant) have a unique ‘‘op’’ field. Other instructions are
classified in groups sharing an ‘‘op’’ value, distinguished by
other fields (‘‘op2’’ etc.).
rs, rs1,
rs2
One or two fields identifying source registers.
rd
The register to be changed by this instruction.
sa
Shift-amount: How far to shift, used in shift-by-constant
instructions.

2–4

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

op2

Sub-code field used for the 3-register arithmetic/logical group of
instructions (op value of zero).
offset 16-bit signed word offset defining the destination of a ‘‘PCrelative’’ branch. The branch target will be the instruction
‘‘offset’’ words away from the ‘‘delay slot’’ instruction after the
branch; so a branch-to-self has an offset of -1.
target 26-bit word address to be jumped to (it corresponds to a 28-bit
byte address, which is always word-aligned). The long j
instruction is rarely used, so this format is pretty much
exclusively for function calls (jal).
The high-order 4 bits of the target address can’t be specified by
this instruction, and are taken from the address of the jump
instruction. This means that these instructions can reach
anywhere in the 256Mbyte region around the instructions’
location. To jump further use a jr (jump register) instruction.
constant
16-bit integer constant for ‘‘immediate’’ arithmetic or logic
operations.
mf
Yet another extended opcode field, this time used by ‘‘coprocessor’’ type instructions.
rg
Field which may hold a source or destination register.
crg
Field to hold the number of a CPU control register (different from
the integer register file). Called ‘‘crs’’/‘‘crd’’ in contexts where it
must be a source/destination respectively.
The instruction encodings have been chosen to facilitate the design of a
high-frequency CPU. Specifically:.
• The instruction encodings do reveal portions of the internal CPU
design. Although there are variable encodings, those fields which are
required very early in the pipeline are encoded in a very regular way:
• Source registers are always in the same place : so that the CPU can
fetch two instructions from the integer register file without any
conditional decoding. Some instructions may not need both registers
– but since the register file is designed to provide two source values
on every clock nothing has been lost.
• 16-bit constant is always in the same place : permitting the
appropriate instruction bits to be fed directly into the ALU’s input
multiplexer, without conditional shifts.

Loading and storing: addressing modes
As mentioned above, there is only one basic ‘‘addressing mode’’. Any
load or store machine instruction can be written as:
operation dest-reg, offset(src-reg)
e.g.:lw $1, offset($2); sw $3, offset($4)

Any of the GP registers can be used for the destination and source. The
offset is a signed, 16-bit number (so can be anywhere between -32768 and
32767); the program address used for the load is the sum of dest-reg and
the offset. This address mode is normally enough to pick out a particular
member of a C structure (‘‘offset’’ being the distance between the start of
the structure and the member required); it implements an array indexed
by a constant; it is enough to reference function variables from the stack
or frame pointer; to provide a reasonable sized global area around the gp
value for static and extern variables.
The assembler provides the semblance of a simple direct addressing
mode, to load the values of memory variables whose address can be
computed at link time.

2–5

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

More complex modes such as double-register or scaled index must be
implemented with sequences of instructions.

Data types in Memory and registers
The R30xx family CPUs can load or store between 1 and 4 bytes in a
single operation. Naming conventions are used in the documentation and
to build instruction mnemonics:
‘‘C’’ name

MIPS name

Size(bytes)

Assembler
mnemonic

int

word

4

‘‘w’’ as in lw

long

word

4

‘‘w’’ as in lw

short

halfword

2

‘‘h’’ as in lh

char

byte

1

‘‘b’’ as in lb

Integer data types
Byte and halfword loads come in two flavors:
• Sign-extend : lb and lh load the value into the least significant bits of
the 32-bit register, but fill the high order bits by copying the ‘‘sign bit’’
(bit 7 of a byte, bit 16 of a half-word). This correctly converts a signed
value to a 32-bit signed integer.
• Zero-extend : instructions lbu and lhu load the value into the least
significant bits of a 32-bit register, with the high order bits filled with
zero. This correctly converts an unsigned value in memory to the
corresponding 32-bit unsigned integer value; so byte value 254
becomes 32-bit value 254.
If the byte-wide memory location whose address is in t1 contains the
value 0xFE (-2, or 254 if interpreted as unsigned), then:
lb
lbu

t2, 0(t1)
t3, 0(t1)

will leave t2 holding the value 0xFFFF FFFE (-2 as signed 32-bit) andt3
holding the value 0x0000 00FE (254 as signed or unsigned 32-bit).
Subtle differences in the way shorter integers are extended to longer
ones are a historical cause of C portability problems, and the modern C
standards have elaborate rules. On machines like the MIPS, which does
not perform 8- or 16-bit precision arithmetic directly, expressions
involving short or char variables are less efficient than word operations.
Unaligned loads and stores
Normal loads and stores in the MIPS architecture must be aligned; halfwords may be loaded only from 2-byte boundaries, and words only from 4byte boundaries. A load instruction with an unaligned address will
produce a trap. Because CISC architectures such as the MC680x0 and
iAPXx86 do handle unaligned loads and stores, this could complicate
porting software from one of these architectures. The MIPS architecture
does provide mechanisms to support this type of operation; in extremity,
software can provide a trap handler which will emulate the desired load
operation and hide this feature from the application.
All data items declared by C code will be correctly aligned.
But when it is known in advance that the program will transfer a word
from an address whose alignment is unknown and will be computed at run
time, the architecture does allow for a special 2-instruction sequence
(much more efficient than a series of byte loads, shifts and assembly). This
sequence is normally generated by the macro-instruction ulw (unaligned
load word).

2–6

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

(A macro-instruction ulh, unaligned load half, is also provided, and is
synthesized by two loads, a shift, and a bitwise ‘‘or’’ operation.)
The special machine instructions are lwl and lwr (load word left, load
word right). ‘‘Left’’ and ‘‘right’’ are arithmetical directions, as in ‘‘shift left’’;
‘‘left’’ is movement towards more significant bits, ‘‘right’’ is towards less
significant bits.
These instructions do three things:
• load 1, 2, 3 or 4 bytes from within one aligned 4-byte (word) location;
• shift that data to move the byte selected by the address to either the
most-significant (lwl) or least-significant (lwr) end of a 32-bit field;
• merge the bytes fetched from memory with the data already in the
destination.
This breaks most of the rules the architecture usually sticks by; it does
a logical operation on a memory variable, for example. Special hardware
allows the lwl, lwr pair to be used in consecutive instructions, even though
the second instruction uses the value generated by the first.
For example, on a CPU configured as big-endian the assembler
instruction:
ulw
add

t1, 0(t2)
t4, t3, t1

is implemented as:
lwl
lwr
nop
add

t1, 0(t2)
t1, 3(t2)
t4, t3, t1

Where:
• the lwl picks up the lowest-addressed byte of the unaligned 4-byte
region, together with however many more bytes which fit into an
aligned word. It then shifts them left, to form the most-significant
bytes of the register value.
• the lwr is aimed at the highest-addressed byte in the unaligned 4-byte
region. It loads it, together with any bytes which precede it in the
same memory word, and shifts it right to get the least significant bits
of the register value. The merge leaves the high-order bits unchanged.
• Although special hardware ensures that a nop is not required between
the lwl and lwr, there is still a load delay between the second of them
and a normal instruction.
Note that if t2 was in fact 4-byte aligned, then both instructions load the
entire word; duplicating effort, but achieving the desired effect.
CPU behavior when operating with little-endian byte order is described
in a later chapter.
Floating point data in memory
Loads into floating point registers from 4-byte aligned memory move
data without any interpretation – a program can load an invalid floating
point number and no FP error will result until an arithmetic operation is
requested with it as an operand.
This allows a programmer to load single-precision values by a load into
an even-numbered floating point register; but the programmer can also
load a double-precision value by a macro instruction, so that:
ldc1

$f2, 24(t1)

is expanded to two loads to consecutive registers:
lwc1
lwc1

2–7

$f2, 24(t1)
$f3, 28(t1)

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

The C compiler aligns 8-byte long double-precision floating point
variables to 8-byte boundaries. R30xx family hardware does not require
this alignment; but it is done to avoid compatibility problems with
implementations of MIPS-2 or MIPS-3 CPUs such as the IDT R4600
(Orion), where the ldc1 instruction is part of the machine code, and the
alignment is necessary.

BASIC ADDRESS SPACE
The way in which MIPS processors use and handle addresses is subtly
different from that of traditional CISC CPUs, and may appear confusing.
Read the first part of this section carefully. Here are some guidelines:
• The addresses put into programs are rarely the same as the physical
addresses which come out of the chip (sometimes they’re close, but
not the same). This manual will refer to them as program addresses
and physical addresses respectively. A more common name for
program addresses is “virtual addresses”; note that the use of the
term “virtual address” does not necessarily imply that an operating
system must perform virtual memory management (e.g. demand
paging from disks...), but rather that the address undergoes some
transformation before being presented to physical memory. Although
virtual address is a proper term, this manual will typically use the
term “program address” to avoid confusing virtual addresses with
virtual memory management requirements.
• A MIPS-1 CPU has two operating modes: user and kernel. In user
mode, any address above 2Gbytes (most-significant bit of the address
set) is illegal and causes a trap. Also, some instructions cause a trap
in user mode.
• The 32-bit program address space is divided into four big areas with
traditional names; and different things happen according to the area
an address lies in:
kuseg 0000 0000 – 7FFF FFFF (low 2Gbytes): these are the addresses
permitted in user mode. In machines with an MMU (“E” versions
of the R30xx family), they will always be translated (more about
the R30xx MMU in a later chapter). Software should not attempt
to use these addresses unless the MMU is set up.
For machines without an MMU (“base” versions of the R30xx
family), the kuseg “program address” is transformed to a
physical address by adding a 1GB offset; the address
transformations for “base versions” of the R30xx family are
described later in this chapter. Note, however, that many
embedded applications do not use this address segment (those
applications which do not require that the kernel and its
resources be protected from user tasks).
kseg0 0x8000 0000 – 9FFF FFFF (512 Mbytes): these addresses are
‘‘translated’’ into physical addresses by merely stripping off the
top bit, mapping them contiguously into the low 512 Mbytes of
physical memory. This transformation operates the same for
both “base” and “E” family members. This segment is referred to
as “unmapped” because “E” version devices cannot redirect this
translation to a different area of physical memory.
Addresses in this region are always accessed through the cache,
so may not be used until the caches are properly initialized. They
will be used for most programs and data in systems using “base”
family members; and will be used for the OS kernel for systems
which do use the MMU (“E” version devices).

2–8

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

kseg1 0xA000 0000 – BFFF FFFF (512 Mbytes): these addresses are
mapped into physical addresses by stripping off the leading three
bits, giving a duplicate mapping of the low 512 Mbytes of
physical memory. However, kseg1 program address accesses will
not use the cache.
The kseg1 region is the only chunk of the memory map which is
guaranteed to behave properly from system reset; that’s why the
after-reset starting point ( 0xBFC0 0000, commonly called the
“reset exception vector”) lies within it. The physical address of
the starting point is 0x1FC0 0000 – which means that the
hardware should place the boot ROM at this physical address.
Software will therefore use this region for the initial program
ROM, and most systems also use it for I/O registers. In general,
IO devices should always be mapped to addresses that are
accessible from Kseg1, and system ROM is always mapped to
contain the reset exception vector. Note that code in the ROM
can then be accessed uncacheably (during boot up) using kseg1
program addresses, and also can be accessed cacheably (for
normal operation) using kseg0 program addresses.
kseg2 0xC000 0000 –
FFFF
FFFF (1 Gbyte): this area is only
accessible in kernel mode. As for kuseg, in “E” devices program
addresses are translated by the MMU into physical addresses;
thus, these addresses must not be referenced prior to MMU
initialization. For “base versions”, physical addresses are
generated to be the same as program addresses for kseg2.
Note that many systems will not need this region. In “E” versions,
it frequently contains OS structures such as page tables; simpler
OS’es probably will have little need for kseg2.

SUMMARY OF SYSTEM ADDRESSING
MIPS program addresses are rarely simply the same as physical
addresses, but simple embedded software will probably use addresses in
kseg0 and kseg1, where the program address is related in an obvious and
unchangeable way to physical addresses.
Physical memory locations from 0x2000 0000 (512Mbyte) upward may
be difficult to access. In “E” versions of the R30xx family, the only way to
reach these addresses is through the MMU. In “base” family members,
certain of these physical addresses can be reached using kseg2 or kuseg
addresses: the address transformations for base R30xx family members is
described later in this chapter.

Kernel vs. user mode
In kernel mode (the CPU resets into this state), all program addresses
are accessible.
In user mode:
• Program addresses above 2Gbytes (top bit set) are illegal and will
cause a trap.
Note that if the CPU has an MMU, this means all valid user mode
addresses must be translated by the MMU; thus, User mode for “E”
devices typically requires the use of a memory-mapped OS.
For “base” CPUs, kuseg addresses are mapped to a distinct area of
physical memory. Thus, kernel memory resources (including IO
devices) can be made inaccessible to User mode software, without
requiring a memory-mapping function from the OS. Alternately, the
hardware can choose to “ignore” high-order address bits when
performing address decoding, thus “condensing” kuseg, kseg2, kseg1,
and kseg0 into the same physical memory.

2–9

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

• Instructions beyond the standard user set become illegal. Specifically,
the kernel can prevent User mode software from accessing the onchip CP0 (system control coprocessor, which controls exception and
machine state and performs the memory management functions of
the CPU).
Thus, the primary differences between User and Kernel modes are:
• User mode tasks can be inhibited from accessing kernel memory
resources, including OS data structures and IO devices. This also
means that various user tasks can be protected from each other.
• User mode tasks can be inhibited from modifying the basic machine
state, by prohibiting accesses to CP0.
Note that the kernel/user mode bit does not change the interpretation
of anything – just some things cease to be allowed in user mode. In kernel
mode the CPU can access low addresses just as if it was in user mode, and
they will be translated in the same way.

Memory map for CPUs without MMU hardware
The treatment of kseg0 and kseg1 addresses is the same for all IDT
R30xx CPUs. If the system can be implemented using only physical
addresses in the low 512Mbytes, and system software can be written to use
only kseg0 and kseg1, then the choice of “base” vs. “E” versions of the
R30xx family is not relevant.
For versions without the MMU (“base versions”), addresses in kuseg and
kseg2 will undergo a fixed address translation, and provide the system
designer the option to provide additional memory.
The base members of the R30xx family provide the following address
translations for kuseg and kseg2 program addresses:
• kuseg: this region (the low 2Gbytes of program addresses) is
translated to a contiguous 2Gbyte physical region between 13Gbytes. In effect, a 1GB offset is added to each kuseg program
address. In hex:
Program address
0x0000 0000 0x7FFF FFFF

Physical Address
→

0x4000 0000 0xBFFF FFFF

• kseg2: these program addresses are genuinely untranslated. So
program addresses from 0xC000 0000 – 0xFFFF FFFF emerge as
identical physical addresses.
This means that “base” versions can generate most physical addresses
(without the use of an MMU), except for a gap between 512Mbyte and
1Gbyte (0x2000 0000 through 0x3FFF FFFF). As noted above, many
systems may ignore high-order address bits when performing address
decoding, thus condensing all physical memory into the lowest 512MB
addresses.
Subsegments in the R3041 – memory width configuration
The R3041 CPU can be configured to access different regions of memory
as either 32-, 16- or 8-bits wide. Where the program requests a 32-bit
operation to a narrow memory (either with an uncached access, or a cache
miss, or a store), the CPU may break a transaction into multiple data
phases, to match the datum size to the memory port width.
The width configuration is applied independently to subsegments of the
normal kseg regions, as follows:
• kseg0 and kseg1: as usual, these are both mapped onto the low
512Mbytes. This common region is split into 8 subsegments
(64Mbytes each), each of which can be programmed as 8-, 16- or 32bits wide. The width assignment affects both kseg0 and kseg1
accesses (that is, one can view these as subsegments of the
corresponding “physical” addresses).

2–10

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• kuseg: is divided into four 512Mbyte subsegments, each
independently programmable for width. Thus, kuseg can be broken
into multiple portions, which may have varying widths. An example of
this may be a 32-bit main memory with some 16-bit PCMCIA font
cards and an 8-bit NVRAM.
• kseg2: is divided into two 512Mbyte subsegments, independently
programmable for width. Again, this means that kseg2 can support
multiple memory subsystems, of varying port width.
Note that once the various memory port widths have been configured
(typically at boot time), software does not have to be aware of the actual
width of any memory system. It can choose to treat all memory as 32-bit
wide, and the CPU will automatically adjust when an access is made to a
narrower memory region. This simplifies software development, and also
facilitates porting to various system implementations (which may or may
not choose the same memory port widths).

2–11

®

MIPS-1 (R30xx)
ARCHITECTURE

CHAPTER 2

Integrated Device Technology, Inc.

1
PROGRAMMER’S VIEW OF THE PROCESSOR
ARCHITECTURE
This chapter describes the assembly programmer’s view of the CPU
architecture, in terms of registers, instructions, and computational
resources. This viewpoint corresponds, for example, to an assembly
programmer writing user applications (although more typically, such a
programmer would use a high-level language).
Information about kernel software development (such as handling
interrupts, traps, and cache and memory management) are described in
later chapters.

Registers
There are 32 general purpose registers: $0 to $31. Two, and only two,
are special to the hardware:
• $0 always returns zero, no matter what software attempts to store to
it.
• $31 is used by the normal subroutine-calling instruction (jal) for the
return address. Note that the call-by-register version (jalr) can use
ANY register for the return address, though practice is to use only
$31.
In all other respects all registers are identical and can be used in any
instruction ($0 can be used as the destination of instructions; the value of
$0 will remain unchanged, however, so the instruction would be effectively
a NOP).
In the MIPS architecture the ‘‘program counter’’ is not a register, and it
is probably better to not think of it that way. The return address of a jal is
two instructions later in sequence (the instruction after the jump delay slot
instruction); the instruction after the call is the call’s ‘‘delay slot’’ and is
typically used to set up the last parameter.
There are no condition codes and nothing in the ‘‘status register’’ or
other CPU internals is of any consequence to the user-level programmer.
There are two registers associated with the integer multiplier. These
registers, referred to as “HI” and “LO”, contain the 64-bit product result of
a multiply operation, or the quotient and remainder of a divide.
The floating point math co-processor (called FPA for floating point
accelerator), if available, adds 32 floating point registers†; in simple
assembler language they are just called $0 to $31 again – the fact that
these are floating point registers is implicitly defined by the instruction.
Actually, only the 16 even-numbered registers are usable for math; but
they can be used for either single-precision (32 bit) or double-precision
(64-bit) numbers, When performing double-precision arithmetic, odd
numbered register $N+1 holds the remaining bits of the even numbered
register identified $N. Only moves between integer and FPA, or FPA load/
store instructions, ever refer to odd-numbered registers (and even then the
assembler helps the programmer forget...)

† The FPA also has a different set of registers called ‘‘co-processor
1 registers’’ for control purposes. These are typically used to
manage the actions/state of the FPA, and should not be confused
with the FPA data registers.
2–1

®

SYSTEM CONTROL COPROCESSOR ARCHITECTURE

CHAPTER 3

Integrated Device Technology, Inc.

1

This chapter concentrates on the aspects of the R30xx family
architecture that must be managed by the OS programmer. Note that most
of these features are transparent to the user program author; however, the
nature of embedded systems is such that most embedded systems
programmers will have a view of the underlying CPU and system
architecture, and thus will find this material important.
Co-processors
MIPS uses the term “co-processor” both in a traditional fashion, and also
in a non-traditional fashion. Specifically, the FPA device is a traditional
microprocessor co-processor: it is an optional part of the architecture,
with its own particular instruction set.
Opcodes are reserved and instruction fields defined for up to four ‘‘coprocessors’’. Architecturally, the co-processors can be tightly coupled to
the base integer CPU; for example, the ISA defines instructions to move
data directly between memory and the coprocessor, rather than requiring
it to be moved into the integer processor first.
However, MIPS also uses the term “co-processor” for the functions
required to manage the CPU environment, including exception
management, cache control, and memory management. This
segmentation insures that the chip architecture can be varied (e.g. cache
architecture, interrupt controller, etc.), without impacting user mode
software compatibility.
These functions are grouped by MIPS into the on-chip “co-processor 0”,
or ‘‘system control co-processor’’ - and these instructions implement the
whole CPU control system. Note that co-processor 0 has no independent
existence, and is certainly not optional. It provides a standard way of
encoding the instructions which access the CPU status register; so that,
although the definition of the status register changes among
implementations, programmers can use the same assembler for both
CPUs. Similarly, the exception and memory management strategies can
be varied among implementations, and these effects isolated to particular
portions of the OS kernel.

CPU CONTROL SUMMARY
This chapter, coupled with chapters on cache management, memory
management, and exception processing, provide details on managing the
machine and OS state. The areas of interest include:
• CPU control and co-processor : how privileged instructions are
organized, with shortform descriptions. There are relatively few
privileged instructions; most of the low-level control over the CPU is
exercised by reading and writing bit-fields within special registers.
• Exceptions : external interrupts, invalid operations, arithmetic errors
– all result in ‘‘exceptions’’, where control is transferred to an
exception handler routine.
MIPS exceptions are extremely simple – the hardware does the
absolute minimum, allowing the programmer to tailor the exception
mechanism to the needs of the particular system.
A later chapter describes MIPS exceptions, why they are ‘‘precise’’,
exception vectors, and conventions about how to code exception
handling routines.
Special problems can arise with nested exceptions: exceptions
occurring while the CPU is still handling an earlier exception.

3–1

CHAPTER 3

•

•

•

•

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

Hardware interrupts have their own style and rules.
The Exception Management chapter includes an annotated example
of a moderately-complicated exception handler.
Caches and cache management : all R30xx implementations have dual
caches (the I-cache for instructions, the D-cache for data). On-chip
hardware is provided to manage the caches, and the programmer
working with I/O devices, particularly with DMA devices, may need to
explicitly manage the caches in particular situations.
To manipulate the caches, the CPU allows software to isolate them,
inhibiting cache/memory traffic and allowing the processor to access
cache as if it were simple memory; and the CPU can swap the roles of
the I-cache and D-cache (the only way to make the I-cache writable).
Caches must sometimes be cleared of stale or invalid/uninitialized
data. Even following power-up, the R30xx caches are in a random
state and must be cleaned up before they can be used. A later chapter
will discuss the techniques used by software to manage the on-chip
cache resources.
In addition, techniques to determine the on-chip cache sizes will be
shown (greatest flexibility is achieved if software can be written to be
independent of cache sizes).
For the diagnostics programmer, techniques to test the cache memory
and probe for particular entries will be discussed.
On some CPU implementations the system designer may make
configuration choices about the cache (e.g. the R3081 and R3071
allow the cache organization to be selected between 16kB of I-cache/
4kB of D-cache and 8kB each of I- and D- cache). The cache
management chapter will also discuss some of the considerations to
apply to make a proper selection.
Write buffer : on R30xx family CPUs the D-cache is always write
through; all writes go to main memory as well as the cache. This
simplifies the caches, but main memory won’t be able to accept data
as fast as the CPU can write it. Much of the performance loss can be
made up by using a FIFO store which holds a number of ‘‘write cycles’’
(it stores both address and data). In the R30xx family, this FIFO,
called the write buffer, is integrated on-chip.
System programmers may need to know that writes happen later than
the code sequence suggests. The chapter on cache management
discusses this.
Starting up : at reset almost nothing is defined, so the software must
build carefully. In MIPS CPUs, reset is implemented in almost exactly
the same way as the exceptions.
A later chapter on reset initialization discusses ways of finding out
which CPU is executing the software, and how to get a ROM program
to run.
An example of a C runtime environment, attending to the stack and
special registers, is provided.
Memory management and the TLB : A later chapter will discuss
address translation and managing the translation hardware (the
TLB). This section is mostly for OS programmers.

CPU CONTROL AND ‘‘CO-PROCESSOR 0’’
CPU control instructions
Most control functions are implemented with registers (most of which
consist of multiple bitfields). The MIPS architecture has an escape
mechanism to define instructions for ‘‘co-processors’’ – and the CPU
control instructions are coded for ‘‘co-processor 0’’.

3–2

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

There are several CPU control instructions used in the memory
management implementation, which are described in a later chapter. But
leaving aside the MMU, CPU control defines just one instruction beyond
the necessary move to and from the control registers.
mtc0
rs, 
–Move to co-processor zero
Loads ‘‘co-processor 0’’ register number nn from CPU general register rs. It
is unusual, and not good practice, to refer to CPU control registers by their
number in assembler sources; normal practice is to use the names listed
in Table 3.1, “Summary of CPU control registers (not MMU)”. In some toolchains the names are defined by a C-style ‘‘include’’ file, and the C preprocessor run as a front-end to the assembler; the assembler manual
should provide guidance on how to do this. This is the only way of setting
bits in a CPU control register.
mfc0
rd, –Move from co-processor zero
General register rd is loaded with the values from CPU control register
number nn. Once again, it is common to use a symbolic name and a
macro-processor to save remembering the numbers. This is the only way
of inspecting bits in a control register.
rfe
–Restore from exception
Note that this is not ‘‘return from exception’’. This instruction restores the
status register to go back to the state prior to the trap. To understand what
it does, refer to the status register SR defined later in this chapter. The only
secure way of returning to user mode from an exception is to return with
a jr instruction which has the rfe in its delay slot.

Standard CPU control registers
This table describes the general CPU control registers (ignoring the
MMU control registers). Also note that typical convention is to reserve k0
and k1 for exception processing, although they are proper GP registers of
the integer CPU unit.
Register
Mnemonic

Description

CP0
reg no.

PRId

CP0 type and rev level

15

SR

(status register) CPU mode flags

12

Cause

Describes the most recently recognized
exception

13

EPC

Return address from trap

14

BadVaddr

Contains the last invalid program address
which caused a trap. It is set by address
errors of all kinds, even if there is no MMU

8

Config

CPU configuration (R3081 and R3041 only)

3

BusCtrl

(R3041 only) configure bus interface signals.
Needs to be setup to match the hardware
implementation.

2

PortSize

(R3041 only) used to flag some program
address regions as 8- or 16-bits wide. Must be
programmed to match the hardware
implementation.

10

Count

(R3041 only, read/write) a 24-bit counter
incrementing with the CPU clock.

9

Compare

(R3041 only, read/write) a 24-bit value used
to wraparound the Count value and set an
output signal.

11

Table 3.1. Summary of CPU control registers (not MMU)

3–3

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

Encoding of control registers
The next section describes the format of the control registers, with a
sketch of the function of each field. In most cases, more information
about how things work is to be found in separate sections or chapters
later.
A note about reserved fields is in order here. Many unused control
register fields are marked ‘‘0’’. Bits in such fields are guaranteed to read
zero, and should be written as zero. Other reserved fields are marked
‘‘reserved’’ or ‘‘×’’; software must always write them as zero, and should
not assume that it will get back zero or any other particular value.
Registers specific to the memory management system are described in a
later chapter.

PRId Register
31

16

15

reserved

8

7

Imp
Figure 3.1.

0

Rev

PRId Register fields

Figure 3.1, “PRId Register fields” shows the layout of the PRId register,
a read-only register to be consulted to identify the CPU type (more
properly, this register describes CP0, allowing the kernel to dynamically
configure itself for various CPU implementations). ‘‘Imp’’ should be related
to the CPU control register set. The encoding of Imp is described below:
CPU type

‘‘Imp’’ value

R3000A (including
R3051, R3052, R3071,
and R3081)

3

IDT unique (R3041)

7

Note that when the Imp field indicates IDT unique, the revision number
can be used to distinguish among various CP0 implementations. Refer to
the R3041 User’s manual for the revision level appropriate for that device.
Since the R3051, 52, 71, and 81 are kernel compatible with the R3000A,
they share the same Imp value.
When printing the value of this register, it is conventional to print them
out as ‘‘x.y’’ where ‘‘x’’ and ‘‘y’’ are the decimal values of Imp and Rev
respectively. Try not to use this register and the CPU manuals to size
things, or to establish the presence or absence of particular features;
software will be more portable and robust if it is designed to include code
sequences to probe for the existence of individual features. This manual
will provide numerous examples designed to determine cache sizes,
presence or absence of TLB, FPA, etc.
SR Register
31

30

29

28

27

CU3

CU2

CU1

CU0

26
0

25

24

RE

15

0

8
IM

23

22

21

20

19

18

17

16

BEV

TS

PE

CM

PZ

SwC

IsC

6

5

4

3

2

1

0

KUo

IEo

KUp

IEp

KUc

IEc

7
0

Figure 3.2.

Fields in status register (SR)

3–4

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

The MIPS CPU has remarkably few mode bits; those that exist are
defined by fields in the CPU status register SR, as shown in Figure 3.2,
“Fields in status register (SR)”.
Note that there are no modes such as non-translated or non-cached in
MIPS CPUs; all translation and caching decisions are made on the basis of
the program address. Fields are:
CU3,
CU2 Bits (31:30) control the usability of ‘‘co-processors’’ 3 and 2
respectively. In the R30xx family, these might be enabled if
software wishes to use the BrCond(3:2) input pins for polling, or
to speed exception decoding.
CU1 ‘‘co-processor 1 usable’’: 1 to use FPA if present, 0 to disable.
When 0, all FPA instructions cause an exception, even for the
kernel. It can be useful to turn off an FPA even when one is
available; it may also be enabled in devices which do not include
an FPA, if the intent is to use the BrCond(1) pin as a polled input.
CU0 ‘‘co-processor 0 usable’’: set 1 to be able to use some nominallyprivileged instructions in user mode (this is rarely if ever done).
The CPU control instructions encoded as ‘‘co-processor 0’’ type
are always usable in kernel mode, regardless of the setting of this
bit.
RE
‘‘reverse endianness in user mode’’. The MIPS processors can be
configured, at reset time, with either ‘‘endianness’’ (byte ordering
convention, discussed in the various CPU’s User’s Manuals and
later in this manual). The RE bit allows binaries intended to be
run with one byte ordering convention to be run in systems with
the opposite convention, presuming OS software provided the
necessary support.
When RE is active, user-privilege software runs as if the CPU had
been configured with the opposite endianness.
However, achieving cross-universe running would require a large
software effort as well, and should not be necessary in embedded
systems.
BEV ‘‘boot exception vectors’’: when BEV == 1, the CPU uses the ROM
(kseg1) space exception entry point (described in a later chapter).
BEV is usually set to zero in running systems; this relocates the
exception vectors. to RAM addresses, speeding accesses and
allowing the use of “user supplied” exception service routines.
TS
‘‘TLB shutdown’’: In devices which implement the full R3000A
MMU, TS gets set if a program address simultaneously matches
two TLB entries. Prolonged operation in this state, in some
implementations, could cause internal contention and damage
to the chip. TLB shutdown is terminal, and can be cleared only
by a hardware reset.
In base family members, which do not include the TLB, this bit
is set by reset; software can rely on this feature to determine the
presence or absence of TLB support hardware.
PE
set if a cache parity error has occurred. No exception is
generated by this condition, which is really only useful for
diagnostics. The MIPS architecture has cache diagnostic
facilities because earlier versions of the CPU used external
caches, and this provided a way to verify the timing of a
particular system. For those implementations the cache parity
error bit was an essential design debug tool.
For CPUs with on-chip caches this feature is rarely needed; only
the R3071 and R3081 implement parity over the on-chip caches.

3–5

CHAPTER 3

CM

PZ

SwC,
IsC

IM

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

shows the result of the last load operation performed with the Dcache isolated (described in the chapter on cache management).
CM is set if the cache really contained data for the addressed
memory location (i.e. if the load would have hit in the cache even
if the cache had not been isolated).
When set, cache parity bits are written as zero and not checked.
This was useful in old R3000A systems which required external
cache RAMs, but is of little relevance to the R30xx family.
‘‘swap caches’’ and ‘‘isolate (data) cache’’. Cache mode bits for
cache management and diagnostics; their use is described in
detail in a later chapter on cache management. In simple terms:
• IsC set 1: makes all loads and stores access only the data
cache, and never memory; and in this mode a partialword store invalidates the cache entry. Note that when
this bit is set, even uncached data accesses will not be
seen on the bus; further, this bit is not initialized by reset.
Boot-up software must insure this bit is properly
initialized before relying on external data references.
• SwC set 1: reverses the roles of the I-cache and D-cache,
so that software can access and invalidate I-cache entries.
‘‘interrupt mask’’: an 8 bit field defining which interrupt sources,
when active, will be allowed to cause an exception. Six of the
interrupt sources are external pins (one may be used by the FPA,
which although it lives on the same chip is logically external); the
other two are the software-writable interrupt bits in the Cause
register.
No interrupt prioritization is provided by the CPU: the hardware
treats all interrupt bits the same. This is described in greater
detail in the chapter dealing with exceptions.

KUc,
IEc

The two basic CPU protection bits.
KUc is set 1 when running with kernel privileges, 0 for user
mode. In kernel mode, software can get at the whole program
address space, and use privileged (‘‘co-processor 0’’)
instructions. User mode restricts software to program addresses
between 0x0000 0000 and 0x7FFF FFFF, and can be denied
permission to run privileged instructions; attempts to break the
rules result in an exception.
IEc is set 0 to prevent the CPU taking any interrupt, 1 to enable.
KUp, IEp‘‘KU previous, IE previous’’:
on an exception, the hardware takes the values of KUc and IEc
and saves them here; at the same time as changing the values of
KUc, IEc to [1, 0] (kernel mode, interrupts disabled). The
instruction rfe can be used to copy KUp, IEp back into KUc, IEc.
KUo, IEo‘‘KU old, IE old’’:
on an exception the KUp, IEp bits are saved here. Effectively, the
six KU/IE bits are operated as a 3-deep, 2-bit wide stack which
is pushed on an exception and popped by an rfe.
This provides a chance of recovering cleanly from an exception
occurring so early in an exception handling routine that the first
exception has not yet saved SR. The circumstances in which this
can be done are limited, and it is probably only really of use in
allowing the user TLB refill code to be made a little shorter, as
described in the chapter on memory management.

3–6

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

Cause Register
31

30

29

BD

0

CE

28

27

16

0
Figure 3.3.

15
IP

8

7

6

2

0

ExcCode

1

0

0

Fields in the Cause register

Figure 3.3, “Fields in the Cause register” shows the fields in the Cause
register, which are consulted to determine the kind of exception which
happened and will be used to decide which exception routine to call.
BD
‘‘branch delay’’: if set, this bit indicates that the EPC does not
point to the actual “exception” instruction, but rather to the
branch instruction which immediately precedes it.
When the exception restart point is an instruction which is in the
‘‘delay slot’’ following a branch, EPC has to point to the branch
instruction; it is harmless to re-execute the branch, but if the
CPU returned from the exception to the branch delay instruction
itself the branch would not be taken and the exception would
have broken the interrupted program.
The only time software might be sensitive to this bit is if it must
analyze the ‘‘offending’’ instruction (if BD == 1 then the
instruction is at EPC + 4). This would occur if the instruction
needs to be emulated (e.g. a floating point instruction in a device
with no hardware FPA; or a breakpoint placed in a branch delay
slot).
CE
‘‘co-processor error’’: if the exception is taken because a ‘‘coprocessor’’ format instruction was for a ‘‘co-processor’’ which is
not enabled by the CUx bit in SR, then this field has the coprocessor number from that instruction.
IP
‘‘Interrupt Pending’’: shows the interrupts which are currently
asserted (but may be “masked” from actually signalling an
exception). These bits follow the CPU inputs for the six hardware
levels. Bits 9 and 8 are read/writable, and contain the value last
written to them. However, any of the 8 bits active when enabled
by the appropriate IM bit and the global interrupt enable flag IEc
in SR, will cause an interrupt.
IP is subtly different from the rest of the Cause register fields; it
doesn’t indicate what happened when the exception took place,
but rather shows what is happening now.
ExcCode
A 5-bit code which indicates what kind of exception happened,
as detailed in Table 3.2, “ExcCode values: different kinds of
exceptions”.
ExcCode
Value

Mnemonic

Description

0

Int

Interrupt

1

Mod

‘‘TLB modification’’

2

TLBL

‘‘TLB load/TLB store’’

3

TLBS

4

AdEL

5

AdES

Address error (on load/I-fetch or store respectively).
Either an attempt to access outside kuseg when in user
mode, or an attempt to read a word or half-word at a
misaligned address.

Table 3.2. ExcCode values: different kinds of exceptions

3–7

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

ExcCode
Value

Mnemonic

Description

6

IBE

7

DBE

8

Syscall

Generated unconditionally by a syscall instruction.

9

Bp

Breakpoint - a break instruction.

10

RI

‘‘reserved instruction’’

11

CpU

‘‘Co-Processor unusable’’

12

Ov

‘‘arithmetic overflow’’. Note that ‘‘unsigned’’ versions of
instructions (e.g. addu) never cause this exception.

13-31

-

reserved. Some are already defined for MIPS CPUs such
as the R6000 and R4xxx

Bus error (instruction fetch or data load, respectively).
External hardware has signalled an error of some kind;
proper exception handling is system-dependent. The
R30xx family CPUs can’t take a bus error on a store;
the write buffer would make such an exception
“imprecise”.

Table 3.2. ExcCode values: different kinds of exceptions

EPC Register
This is a 32-bit register containing the 32-bit address of the return point
for this exception. The instruction causing (or suffering) the exception is at
EPC, unless BD is set in Cause, in which case EPC points to the previous
(branch) instruction.
BadVaddr Register
A 32-bit register containing the address whose reference led to an
exception; set on any MMU-related exception, on an attempt by a user
program to access addresses outside kuseg, or if an address is wrongly
aligned for the datum size referenced.
After any other exception this register is undefined. Note in particular
that it is not set after a bus error.

R3041, R3071, and R3081 specific registers
Count and Compare Registers (R3041 only)
Only present in the R3041, these provide a simple 24-bit counter/timer
running at CPU cycle rate. Count counts up, and then wraps around to
zero once it has reached the value in the Compare register. As it wraps
around the Tc* CPU output is asserted. According to CPU configuration
(bit TC of the BusCtrl register), Tc* will either remain active until reset by
software (re-write Compare), or will pulse. In either case the counter just
keeps counting. To generate an interrupt Tc* must be connected to one of
the interrupt inputs.
From reset Compare is setup to its maximum value 0xFF
(
FFFF), so the
counter runs up to 224-1 before wrapping around.
Config Register (R3071 and R3081)
31

30

29

28

Lock

Slow
Bus

DB
Refill

FPInt

Figure 3.4.

26

25

24

23

22

Halt

RF

AC

reserved

Fields in the R3071/81 Config Register

3–8

0

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

• Lock : set this bit to write to the register for the last time; all future
writes to Config will be ignored. The intention is that initialization
software will set the register and can then lock it in case some illbehaved piece of software developed on some earlier version of the
MIPS architecture tries to stomp on Config; this would have had no
effect on earlier CPUs.
• Slow Bus : hardware may require that this bit be set. It only matters
when the CPU performs a store while running from a cached location.
The system hardware design determines the proper setting for this
bit; setting it to ‘1’ should be permissible for any system, but loses
some performance in memory systems able to support more
aggressive bus performance.
If set 1, an idle bus cycle is guaranteed between any read and write
transfer. This enables additional time for bus tri-stating, control logic
generation, etc.
• DB : ‘‘data cache block refill’’, set 1 to reload 4 words into the data
cache on any miss, set 0 to reload just one word. Can be initialized
either way on the R3081, by a reset-time hardware input.
• FPInt : controls the CPU interrupt level on which FPA interrupts are
reported. On original R3000 CPUs the FPA was external and this was
determined by wiring; but the R3081’s FPA is on the chip and it would
be inefficient (and jeopardize pin-compatibility) to send the interrupt
off chip and on again.
Set FPInt to the binary value of the CPU interrupt pin number which
is dedicated to FPA interrupts. By default the field is initialized to
“011’’ to select the pin Int3†; MIPS convention put the FPA on
external interrupt pin 3. For whichever pin is dedicated to the FPA,
the CPU will then ignore the value on the external pin; the IP field of
the cause register will simply follow the FPA.
On the R3071, this field is “reserved”, and must be written as “000”.
• Halt : set to bring the CPU to a standstill. It will start again as soon as
any interrupt input is asserted (regardless of the state of the interrupt
mask). This is useful for power reduction, and can also be used to
emulate old MC68000 “Halt” operation.
• RF : slows the CPU to 1/16th of the normal clock rate, to reduce power
consumption. Illegal unless the CPU is running at 33Mhz or higher.
Note that the CPUs output clock (which is normally used to
synchronize all the interface logic) slows down too; the hardware
design should also accommodate this feature if software desires to
use it.
• AC : ‘‘alternate cache’’. 0 for 16K I-cache/4K D-cache, but set 1 for 8K
I-cache/8K D-cache.
• Reserved : must only be written as zero. It will probably read as zero,
but software should not rely on this.
Config Register (R3041)
31

30

29

28

Lock

1

DBR

0

Figure 3.5.

20

19

18

FDM

0

0

Fields in the R3041 Config (Cache Configuration) Register

† Take care: the external pin Int3 corresponds to the bit numbered
‘‘5’’ in IP of the Cause register or IM of the SR register. That’s
because both the Cause and SR fields support two ‘‘software
interrupts’’ numbered as bits 0 and 1.
3–9

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

• Lock: set 1 to finally configure register (additional writes will not have
any effect until the CPU is reset).
• 1 and 0 : set fields to exactly the value shown.
• DBR: ‘‘DBlockRefill’’, set 1 to read 4 words into the cache on a miss,
0 to refill just the word missed on. The proper setting for a given
system is dependent on a number of factors, and may best be
determined by measuring performance in each mode and selecting
the best one. Note that it is possible for software to dynamically
reconfigure the refill algorithm depending on the current code
executing, presuming the register has not been “locked”.
• FDM: “Force D-Cache Miss”, set 1 for an R3041-specific cache mode,
where all loads result in data being fetched from memory (missing in
the data cache), but the incoming data is still used to refill the cache.
Stores continue to write the cache. This is useful when software
desires to obtain the high-bandwidth of the cache and cache refills,
but the corresponding main memory is “volatile” (e.g. a FIFO, or
updated by DMA).
BusCtrl Register (R3041 only)
The R3041 CPU has many hardware interface options not available on
other members of the R30xx family, which are intended to allow the use of
simpler and cheaper interface and memory components. The BusCtrl
register does most of the configuration work. It needs to be set strictly in
accordance with the needs of the hardware implementation. Note also that
its default settings (from reset) leave the interface compatible with other
R30xx family members.
Figure 3.6, “Fields in the R3041 Bus Control (BusCtrl) Register” shows
the layout of the fields, and their uses are provided for completeness.
31

3
0

Loc 10
k

2
8

2
7

2
6

Mem

Figure 3.6.

2
5
ED

2
4

2
3
IO

2
2

21

2
0

1
9

1
8

BE

1

B
E

11

16

1
6

1
5

1
4

BTA

13

1
2

1
1

1
0

0

DM T
A
C

B
R

0x30
0

Fields in the R3041 Bus Control (BusCtrl) Register

• Lock: when software has initialized BusCtrl to its desired state it may
write this bit to prevent its contents being changed again until the
system is reset.
• 10 and other numbers : write exactly the specified bit pattern to this
field (hex used for big ones, but others are given as binary). Improper
values may cause test modes and other unexpected side effects.
• Mem : ‘‘MemStrobe* control’’. Set this field to xy binary, where x set
means the strobe activates on reads, and y set makes it active on
writes.
• ED: ‘‘ExtDataEn* control’’. Encoded as for ‘‘Mem’’. Note that the BR
bit must be zero for this pin to function as an output.
• IO: ‘‘IOStrobe* control’’. Encoded as for ‘‘Mem’’. Note that the BR bit
must be zero for this pin to function as an output.
• BE16: ‘‘BE16(1:0)* read control’’ – 0 to make these pins active on
write cycles only.
• BE: ‘‘BE(3:0)* read control’’ – 0 to make these pins active on write
cycles only.
• BTA: ‘‘Bus turn around time’’. Program with a binary number
between 0 and 3, for 0-3 cycles of guaranteed delay between the end
of a read cycle and the start of the address phase of the next cycle.
This field enables the use of devices with slow tri-state time, and
enables the system designer to save cost by omitting data
transceivers.

3–10

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

• DMA: ‘‘DMA Protocol Control’’, enables ‘‘DMA pulse protocol’’. When
set, the CPU uses its DMA control pins to communicate its desire for
the bus even while a DMA is in progress.
• TC: ‘‘TC* negation control’’. TC* is the output pin which is activated
when the internal timer register Count reaches the value stored in
Compare. Set TC zero to make the TC* pin just pulse for a couple of
clock periods; leave TC as 1, and TC* will be asserted on a compare
and remain asserted until software explicitly clears it (by re-writing
Compare with any value).
If TC* is used to generate a timer interrupt, then use the default (TC
== 0). The pulse is more useful when the output is being used by
external logic (e.g. to signal a DRAM refresh).
• BR: ‘‘SBrCond(3:2) control’’. Set zero to recycle the SBrCond(3:2)
pins as IOStrobe and ExtDataEn respectively.
PortSize Register (R3041 only)
The PortSize register is used to flag different parts of the program
address space for accesses to 8-, 16- or 32-bit wide memory.
Settings of this register have to be made at a time and to values which
will be mandated by the hardware design. See ‘‘IDT79R3041 Hardware
User’s Manual’’ for details.

What registers are relevant when?
The various CP0 registers and their fields provide support at specific
times during system operation.
• After hardware reset: software must initialize SR to get the CPU into
the right state to bootstrap itself.
• Hardware configuration at start-up: an R3041, R3071, or R3081
require initialization of Config, BusCtrl, and/or PortSize before very
much will work. The system hardware implementation will dictate the
proper configuration of these registers.
• After any exception: any MIPS exception (apart from one particular
MMU event) invokes a single common ‘‘general exception handler’’
routine, at a fixed address.
On entry, no program registers are saved, only the return address in
EPC. The MIPS hardware knows nothing about stacks. In any case the
exception routine cannot use the user-mode stack for any purpose;
the exception might have been a TLB miss on stack memory.
Exception software will need to use at least one of k0 and k1 to point
to some ‘‘safe’’ (exception-proof) memory space. Key information can
be saved, using the other k0 or k1 register to stage data from control
registers where necessary.
Consult the Cause register to find out what kind of exception it was
and dispatch accordingly.
• Returning from exception: control must eventually be returned to the
value stored in EPC on entry.
Whatever kind of exception it was, software will have to adjust SR
back upon return from exception. The special instruction rfe does the
job; but note that it does not transfer control. To make the jump back
software must load the original EPC value back into a generalpurpose register and use a jr operation.
• Interrupts: SR is used to adjust the interrupt masks, to determine
which (if any) interrupts will be allowed ‘‘higher priority’’ than the
current one. The hardware offers no interrupt prioritization, but the
software can do whatever it likes.
• Instructions which always cause exceptions: are often used (for
system calls, breakpoints, and to emulate some kinds of instruction).
These sometimes requires partial decoding of the offending

3–11

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

instruction, which can usually be found at the location EPC. But
there is a complication; suppose that an exception occurs just after a
branch but in time to prevent the branch delay slot instruction from
running. Then EPC will point to the branch instruction (resuming
execution starting at the delay slot would cause the branch to be
ignored), and the BD bit will be set.
This Cause register bit flags this event; to find the instruction at
which the exception occurred, add 4 to the EPC value when the BD
bit is set.
• Cache management routines: SR contains bits defining special modes
for cache management. In particular they allow software to isolate the
data cache, and to swap the roles of the instruction and data caches.
The subsequent chapters will describe appropriate treatment of these
registers, and provide software examples of their use.

3–12

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

Conventional names and uses of general-purpose registers
Although the hardware makes few rules about the use of registers, their
practical use is governed by a number of conventions. These conventions
allow inter-changeability of tools, operating systems, and library modules.
It is strongly recommended that these conventions be followed.
Reg No

Name

Used for

0

zero

Always returns 0

1

at

(assembler temporary) Reserved for use by assembler

2-3

v0-v1

Value (except FP) returned by subroutine

4-7

a0-a3

(arguments) First four parameters for a subroutine

8-15

t0-t7

(temporaries) subroutines may use without saving

24-25

t8-t9

16-23

s0-s7

Subroutine ‘‘register variables’’; a subroutine which will write
one of these must save the old value and restore it before it
exits, so the calling routine sees their values preserved.

26-27

k0-k1

Reserved for use by interrupt/trap handler - may change
under your feet

28

gp

global pointer - some runtime systems maintain this to give
easy access to (some) ‘‘static’’ or ‘‘extern’’ variables.

29

sp

stack pointer

30

s8/fp

9th register variable. Subroutines which need one can use
this as a ‘‘frame pointer’’.

31

ra

Return address for subroutine

Table 2.1. Conventional names of registers with usage mnemonics

With the conventional uses of the registers go a set of conventional
names. Given the need to fit in with the conventions, use of the
conventional names is pretty much mandatory. The common names are
described in Table 2.1, “Conventional names of registers with usage
mnemonics”.
Notes on conventional register names
• at : this register is reserved for use inside the synthetic instructions
generated by the assembler. If the programmer must use it explicitly
the directive .noat stops the assembler from using it, but then there
are some things the assembler won’t be able to do.
• v0-v1 : used when returning non-floating-point values from a
subroutine. To return anything bigger than 2×32 bits, memory must
be used (described in a later chapter).
• a0-a3 : used to pass the first four non-FP parameters to a subroutine.
That’s an occasionally-false oversimplification; the actual convention
is fully described in a later chapter.
• t0-t9 : by convention, subroutines may use these values without
preserving them. This makes them easy to use as ‘‘temporaries’’ when
evaluating expressions – but a caller must remember that they may
be destroyed by a subroutine call.
• s0-s8 : by convention, subroutines must guarantee that the values of
these registers on exit are the same as they were on entry – either by
not using them, or by saving them on the stack and restoring before
exit.
This makes them eminently suitable for use as ‘‘register variables’’ or
for storing any value which must be preserved over a subroutine call.
2–2

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• k0-k1 : reserved for use by the trap/interrupt routines, which will not
restore their original value; so they are of little use to anyone else.
• gp : (global pointer). If present, it will point to a load-time-determined
location in the midst of your static data. This means that loads and
stores to data lying within 32Kbytes either side of the gp value can be
performed in a single instruction using gp as the base register.
Without the global pointer, loading data from a static memory area
takes two instructions: one to load the most significant bits of the 32bit constant address computed by the compiler and loader, and one
to do the data load.
To use gp a compiler must know at compile time that a datum will end
up linked within a 64Kbyte range of memory locations. In practice it
can’t know, only guess. The usual practice is to put ‘‘small’’ global
data items in the area pointed to by gp, and to get the linker to
complain if it still gets too big. The definition of what is “small” can
typically be specified with a compiler switch (most compilers use “G“). The most common default size is 8 bytes or less.
Not all compilation systems or OS loaders support gp.
• sp : (stack pointer). Since it takes explicit instructions to raise and
lower the stack pointer, it is generally done only on subroutine entry
and exit; and it is the responsibility of the subroutine being called to
do this. sp is normally adjusted, on entry, to the lowest point that the
stack will need to reach at any point in the subroutine. Now the
compiler can access stack variables by a constant offset from sp.
Stack usage conventions are explained in a later chapter.
• fp : (also known as s8). A subroutine will use a ‘‘frame pointer’’ to keep
track of the stack if it wants to use operations which involve extending
the stack by an amount which is determined at run-time. Some
languages may do this explicitly; assembler programmers are always
welcome to experiment; and (for many toolchains) C programs which
use the ‘‘alloca’’ library routine will find themselves doing so.
In this case it is not possible to access stack variables from sp, so fp
is initialized by the function prologue to a constant position relative
to the function’s stack frame. Note that a ‘‘frame pointer’’ subroutine
may call or be called by subroutines which do not use the frame
pointer; so long as the functions it calls preserve the value of fp (as
they should) this is OK.
• ra : (return address). On entry to any subroutine, ra holds the address
to which control should be returned – so a subroutine typically ends
with the instruction ‘‘jr ra’’.
Subroutines which themselves call subroutines must first save ra,
usually on the stack.

Integer multiply unit and registers
MIPS’ architects decided that integer multiplication was important
enough to deserve a hard-wired instruction. This is not so common in
RISCs, which might instead:
• implement a ‘‘multiply step’’ which fits in the standard integer
execution pipeline, and require software routines for every
multiplication (e.g. Sparc or AM29000); or
• perform integer multiplication in the floating point unit – a good
solution but which compromises the optional nature of the MIPS
floating point ‘‘co-processor’’.
The multiply unit consumes a small amount of die area, but
dramatically improves performance (and cache performance) over
“multiply step” operations. It’s basic operation is to multiply two 32-bit
values together to produce a 64-bit result, which is stored in two 32-bit

2–3

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

registers (called ‘‘hi’’ and ‘‘lo’’) which are private to the multiply unit.
Instructions mfhi, mflo are defined to copy the result out into general
registers.
Unlike results for integer operations, the multiply result registers are
interlocked. An attempt to read out the results before the multiplication is
complete results in the CPU being stopped until the operation completes.
The integer multiply unit will also perform an integer division between
values in two general-purpose registers; in this case the ‘‘lo’’ register stores
the quotient, and the ‘‘hi’’ register the remainder.
In the R30xx family, multiply operations take 12 clocks and division
takes 35. The assembler has a synthetic multiply operation which starts
the multiply and then retrieves the result into an ordinary register. Note
that MIPS Corp.’s assembler may even substitute a series of shifts and
adds for multiplication by a constant, to improve execution speed.
Multiply/divide results are written into ‘‘hi’’ and ‘‘lo’’ as soon as they are
available; the effect is not deferred until the writeback pipeline stage, as
with writes to general purpose (GP) registers. If a mfhi or mflo instruction
is interrupted by some kind of exception before it reaches the writeback
stage of the pipeline, it will be aborted with the intention of restarting it.
However, a subsequent multiply instruction which has passed the ALU
stage will continue (in parallel with exception processing) and would
overwrite the ‘‘hi’’ and ‘‘lo’’ register values, so that the re-execution of the
mfhi would get wrong (i.e. new) data. For this reason it is recommended
that a multiply should not be started within two instructions of an mfhi/
mflo. The assembler will avoid doing this where it can.
Integer multiply and divide operations never produce an exception,
though divide by zero produces an undefined result. Compilers will often
generate code to trap on errors, particularly on divide by zero. Frequently,
this instruction sequence is placed after the divide is initiated, to allow it
to execute concurrently with the divide (and avoid a performance loss).
Instructions mthi, mtlo are defined to setup the internal registers from
general-purpose registers. They are essential to restore the values of ‘‘hi’’
and ‘‘lo’’ when returning from an exception, but probably not for anything
else.

Instruction types
A full list of R30xx family integer instructions is presented in Appendix
A. Floating point instructions are listed in Appendix B of this manual.
Currently, floating point instructions are only available in the R3081, and
are described in the R3081 User’s Manual.
The MIPS-1 ISA uses only three basic instruction encoding formats; this
is one of the keys to the high-frequencies attained by RISC architectures.
Instructions are mostly in numerical order; to simplify reading, the list
is occasionally re-ordered for clarity.
Throughout this manual, the description of various instructions will
also refer to various subfields of the instruction. In general, the following
typical nomenclature is used:
op
The basic op-code, which is 6 bits long. Instructions which large
sub-fields (for example, large immediate values, such as required
for the ‘‘long’’ j/jal instructions, or arithmetic with a 16-bit
constant) have a unique ‘‘op’’ field. Other instructions are
classified in groups sharing an ‘‘op’’ value, distinguished by
other fields (‘‘op2’’ etc.).
rs, rs1,
rs2
One or two fields identifying source registers.
rd
The register to be changed by this instruction.
sa
Shift-amount: How far to shift, used in shift-by-constant
instructions.

2–4

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

op2

Sub-code field used for the 3-register arithmetic/logical group of
instructions (op value of zero).
offset 16-bit signed word offset defining the destination of a ‘‘PCrelative’’ branch. The branch target will be the instruction
‘‘offset’’ words away from the ‘‘delay slot’’ instruction after the
branch; so a branch-to-self has an offset of -1.
target 26-bit word address to be jumped to (it corresponds to a 28-bit
byte address, which is always word-aligned). The long j
instruction is rarely used, so this format is pretty much
exclusively for function calls (jal).
The high-order 4 bits of the target address can’t be specified by
this instruction, and are taken from the address of the jump
instruction. This means that these instructions can reach
anywhere in the 256Mbyte region around the instructions’
location. To jump further use a jr (jump register) instruction.
constant
16-bit integer constant for ‘‘immediate’’ arithmetic or logic
operations.
mf
Yet another extended opcode field, this time used by ‘‘coprocessor’’ type instructions.
rg
Field which may hold a source or destination register.
crg
Field to hold the number of a CPU control register (different from
the integer register file). Called ‘‘crs’’/‘‘crd’’ in contexts where it
must be a source/destination respectively.
The instruction encodings have been chosen to facilitate the design of a
high-frequency CPU. Specifically:.
• The instruction encodings do reveal portions of the internal CPU
design. Although there are variable encodings, those fields which are
required very early in the pipeline are encoded in a very regular way:
• Source registers are always in the same place : so that the CPU can
fetch two instructions from the integer register file without any
conditional decoding. Some instructions may not need both registers
– but since the register file is designed to provide two source values
on every clock nothing has been lost.
• 16-bit constant is always in the same place : permitting the
appropriate instruction bits to be fed directly into the ALU’s input
multiplexer, without conditional shifts.

Loading and storing: addressing modes
As mentioned above, there is only one basic ‘‘addressing mode’’. Any
load or store machine instruction can be written as:
operation dest-reg, offset(src-reg)
e.g.:lw $1, offset($2); sw $3, offset($4)

Any of the GP registers can be used for the destination and source. The
offset is a signed, 16-bit number (so can be anywhere between -32768 and
32767); the program address used for the load is the sum of dest-reg and
the offset. This address mode is normally enough to pick out a particular
member of a C structure (‘‘offset’’ being the distance between the start of
the structure and the member required); it implements an array indexed
by a constant; it is enough to reference function variables from the stack
or frame pointer; to provide a reasonable sized global area around the gp
value for static and extern variables.
The assembler provides the semblance of a simple direct addressing
mode, to load the values of memory variables whose address can be
computed at link time.

2–5

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

More complex modes such as double-register or scaled index must be
implemented with sequences of instructions.

Data types in Memory and registers
The R30xx family CPUs can load or store between 1 and 4 bytes in a
single operation. Naming conventions are used in the documentation and
to build instruction mnemonics:
‘‘C’’ name

MIPS name

Size(bytes)

Assembler
mnemonic

int

word

4

‘‘w’’ as in lw

long

word

4

‘‘w’’ as in lw

short

halfword

2

‘‘h’’ as in lh

char

byte

1

‘‘b’’ as in lb

Integer data types
Byte and halfword loads come in two flavors:
• Sign-extend : lb and lh load the value into the least significant bits of
the 32-bit register, but fill the high order bits by copying the ‘‘sign bit’’
(bit 7 of a byte, bit 16 of a half-word). This correctly converts a signed
value to a 32-bit signed integer.
• Zero-extend : instructions lbu and lhu load the value into the least
significant bits of a 32-bit register, with the high order bits filled with
zero. This correctly converts an unsigned value in memory to the
corresponding 32-bit unsigned integer value; so byte value 254
becomes 32-bit value 254.
If the byte-wide memory location whose address is in t1 contains the
value 0xFE (-2, or 254 if interpreted as unsigned), then:
lb
lbu

t2, 0(t1)
t3, 0(t1)

will leave t2 holding the value 0xFFFF FFFE (-2 as signed 32-bit) andt3
holding the value 0x0000 00FE (254 as signed or unsigned 32-bit).
Subtle differences in the way shorter integers are extended to longer
ones are a historical cause of C portability problems, and the modern C
standards have elaborate rules. On machines like the MIPS, which does
not perform 8- or 16-bit precision arithmetic directly, expressions
involving short or char variables are less efficient than word operations.
Unaligned loads and stores
Normal loads and stores in the MIPS architecture must be aligned; halfwords may be loaded only from 2-byte boundaries, and words only from 4byte boundaries. A load instruction with an unaligned address will
produce a trap. Because CISC architectures such as the MC680x0 and
iAPXx86 do handle unaligned loads and stores, this could complicate
porting software from one of these architectures. The MIPS architecture
does provide mechanisms to support this type of operation; in extremity,
software can provide a trap handler which will emulate the desired load
operation and hide this feature from the application.
All data items declared by C code will be correctly aligned.
But when it is known in advance that the program will transfer a word
from an address whose alignment is unknown and will be computed at run
time, the architecture does allow for a special 2-instruction sequence
(much more efficient than a series of byte loads, shifts and assembly). This
sequence is normally generated by the macro-instruction ulw (unaligned
load word).

2–6

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

(A macro-instruction ulh, unaligned load half, is also provided, and is
synthesized by two loads, a shift, and a bitwise ‘‘or’’ operation.)
The special machine instructions are lwl and lwr (load word left, load
word right). ‘‘Left’’ and ‘‘right’’ are arithmetical directions, as in ‘‘shift left’’;
‘‘left’’ is movement towards more significant bits, ‘‘right’’ is towards less
significant bits.
These instructions do three things:
• load 1, 2, 3 or 4 bytes from within one aligned 4-byte (word) location;
• shift that data to move the byte selected by the address to either the
most-significant (lwl) or least-significant (lwr) end of a 32-bit field;
• merge the bytes fetched from memory with the data already in the
destination.
This breaks most of the rules the architecture usually sticks by; it does
a logical operation on a memory variable, for example. Special hardware
allows the lwl, lwr pair to be used in consecutive instructions, even though
the second instruction uses the value generated by the first.
For example, on a CPU configured as big-endian the assembler
instruction:
ulw
add

t1, 0(t2)
t4, t3, t1

is implemented as:
lwl
lwr
nop
add

t1, 0(t2)
t1, 3(t2)
t4, t3, t1

Where:
• the lwl picks up the lowest-addressed byte of the unaligned 4-byte
region, together with however many more bytes which fit into an
aligned word. It then shifts them left, to form the most-significant
bytes of the register value.
• the lwr is aimed at the highest-addressed byte in the unaligned 4-byte
region. It loads it, together with any bytes which precede it in the
same memory word, and shifts it right to get the least significant bits
of the register value. The merge leaves the high-order bits unchanged.
• Although special hardware ensures that a nop is not required between
the lwl and lwr, there is still a load delay between the second of them
and a normal instruction.
Note that if t2 was in fact 4-byte aligned, then both instructions load the
entire word; duplicating effort, but achieving the desired effect.
CPU behavior when operating with little-endian byte order is described
in a later chapter.
Floating point data in memory
Loads into floating point registers from 4-byte aligned memory move
data without any interpretation – a program can load an invalid floating
point number and no FP error will result until an arithmetic operation is
requested with it as an operand.
This allows a programmer to load single-precision values by a load into
an even-numbered floating point register; but the programmer can also
load a double-precision value by a macro instruction, so that:
ldc1

$f2, 24(t1)

is expanded to two loads to consecutive registers:
lwc1
lwc1

2–7

$f2, 24(t1)
$f3, 28(t1)

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

The C compiler aligns 8-byte long double-precision floating point
variables to 8-byte boundaries. R30xx family hardware does not require
this alignment; but it is done to avoid compatibility problems with
implementations of MIPS-2 or MIPS-3 CPUs such as the IDT R4600
(Orion), where the ldc1 instruction is part of the machine code, and the
alignment is necessary.

BASIC ADDRESS SPACE
The way in which MIPS processors use and handle addresses is subtly
different from that of traditional CISC CPUs, and may appear confusing.
Read the first part of this section carefully. Here are some guidelines:
• The addresses put into programs are rarely the same as the physical
addresses which come out of the chip (sometimes they’re close, but
not the same). This manual will refer to them as program addresses
and physical addresses respectively. A more common name for
program addresses is “virtual addresses”; note that the use of the
term “virtual address” does not necessarily imply that an operating
system must perform virtual memory management (e.g. demand
paging from disks...), but rather that the address undergoes some
transformation before being presented to physical memory. Although
virtual address is a proper term, this manual will typically use the
term “program address” to avoid confusing virtual addresses with
virtual memory management requirements.
• A MIPS-1 CPU has two operating modes: user and kernel. In user
mode, any address above 2Gbytes (most-significant bit of the address
set) is illegal and causes a trap. Also, some instructions cause a trap
in user mode.
• The 32-bit program address space is divided into four big areas with
traditional names; and different things happen according to the area
an address lies in:
kuseg 0000 0000 – 7FFF FFFF (low 2Gbytes): these are the addresses
permitted in user mode. In machines with an MMU (“E” versions
of the R30xx family), they will always be translated (more about
the R30xx MMU in a later chapter). Software should not attempt
to use these addresses unless the MMU is set up.
For machines without an MMU (“base” versions of the R30xx
family), the kuseg “program address” is transformed to a
physical address by adding a 1GB offset; the address
transformations for “base versions” of the R30xx family are
described later in this chapter. Note, however, that many
embedded applications do not use this address segment (those
applications which do not require that the kernel and its
resources be protected from user tasks).
kseg0 0x8000 0000 – 9FFF FFFF (512 Mbytes): these addresses are
‘‘translated’’ into physical addresses by merely stripping off the
top bit, mapping them contiguously into the low 512 Mbytes of
physical memory. This transformation operates the same for
both “base” and “E” family members. This segment is referred to
as “unmapped” because “E” version devices cannot redirect this
translation to a different area of physical memory.
Addresses in this region are always accessed through the cache,
so may not be used until the caches are properly initialized. They
will be used for most programs and data in systems using “base”
family members; and will be used for the OS kernel for systems
which do use the MMU (“E” version devices).

2–8

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

kseg1 0xA000 0000 – BFFF FFFF (512 Mbytes): these addresses are
mapped into physical addresses by stripping off the leading three
bits, giving a duplicate mapping of the low 512 Mbytes of
physical memory. However, kseg1 program address accesses will
not use the cache.
The kseg1 region is the only chunk of the memory map which is
guaranteed to behave properly from system reset; that’s why the
after-reset starting point ( 0xBFC0 0000, commonly called the
“reset exception vector”) lies within it. The physical address of
the starting point is 0x1FC0 0000 – which means that the
hardware should place the boot ROM at this physical address.
Software will therefore use this region for the initial program
ROM, and most systems also use it for I/O registers. In general,
IO devices should always be mapped to addresses that are
accessible from Kseg1, and system ROM is always mapped to
contain the reset exception vector. Note that code in the ROM
can then be accessed uncacheably (during boot up) using kseg1
program addresses, and also can be accessed cacheably (for
normal operation) using kseg0 program addresses.
kseg2 0xC000 0000 –
FFFF
FFFF (1 Gbyte): this area is only
accessible in kernel mode. As for kuseg, in “E” devices program
addresses are translated by the MMU into physical addresses;
thus, these addresses must not be referenced prior to MMU
initialization. For “base versions”, physical addresses are
generated to be the same as program addresses for kseg2.
Note that many systems will not need this region. In “E” versions,
it frequently contains OS structures such as page tables; simpler
OS’es probably will have little need for kseg2.

SUMMARY OF SYSTEM ADDRESSING
MIPS program addresses are rarely simply the same as physical
addresses, but simple embedded software will probably use addresses in
kseg0 and kseg1, where the program address is related in an obvious and
unchangeable way to physical addresses.
Physical memory locations from 0x2000 0000 (512Mbyte) upward may
be difficult to access. In “E” versions of the R30xx family, the only way to
reach these addresses is through the MMU. In “base” family members,
certain of these physical addresses can be reached using kseg2 or kuseg
addresses: the address transformations for base R30xx family members is
described later in this chapter.

Kernel vs. user mode
In kernel mode (the CPU resets into this state), all program addresses
are accessible.
In user mode:
• Program addresses above 2Gbytes (top bit set) are illegal and will
cause a trap.
Note that if the CPU has an MMU, this means all valid user mode
addresses must be translated by the MMU; thus, User mode for “E”
devices typically requires the use of a memory-mapped OS.
For “base” CPUs, kuseg addresses are mapped to a distinct area of
physical memory. Thus, kernel memory resources (including IO
devices) can be made inaccessible to User mode software, without
requiring a memory-mapping function from the OS. Alternately, the
hardware can choose to “ignore” high-order address bits when
performing address decoding, thus “condensing” kuseg, kseg2, kseg1,
and kseg0 into the same physical memory.

2–9

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

• Instructions beyond the standard user set become illegal. Specifically,
the kernel can prevent User mode software from accessing the onchip CP0 (system control coprocessor, which controls exception and
machine state and performs the memory management functions of
the CPU).
Thus, the primary differences between User and Kernel modes are:
• User mode tasks can be inhibited from accessing kernel memory
resources, including OS data structures and IO devices. This also
means that various user tasks can be protected from each other.
• User mode tasks can be inhibited from modifying the basic machine
state, by prohibiting accesses to CP0.
Note that the kernel/user mode bit does not change the interpretation
of anything – just some things cease to be allowed in user mode. In kernel
mode the CPU can access low addresses just as if it was in user mode, and
they will be translated in the same way.

Memory map for CPUs without MMU hardware
The treatment of kseg0 and kseg1 addresses is the same for all IDT
R30xx CPUs. If the system can be implemented using only physical
addresses in the low 512Mbytes, and system software can be written to use
only kseg0 and kseg1, then the choice of “base” vs. “E” versions of the
R30xx family is not relevant.
For versions without the MMU (“base versions”), addresses in kuseg and
kseg2 will undergo a fixed address translation, and provide the system
designer the option to provide additional memory.
The base members of the R30xx family provide the following address
translations for kuseg and kseg2 program addresses:
• kuseg: this region (the low 2Gbytes of program addresses) is
translated to a contiguous 2Gbyte physical region between 13Gbytes. In effect, a 1GB offset is added to each kuseg program
address. In hex:
Program address
0x0000 0000 0x7FFF FFFF

Physical Address
→

0x4000 0000 0xBFFF FFFF

• kseg2: these program addresses are genuinely untranslated. So
program addresses from 0xC000 0000 – 0xFFFF FFFF emerge as
identical physical addresses.
This means that “base” versions can generate most physical addresses
(without the use of an MMU), except for a gap between 512Mbyte and
1Gbyte (0x2000 0000 through 0x3FFF FFFF). As noted above, many
systems may ignore high-order address bits when performing address
decoding, thus condensing all physical memory into the lowest 512MB
addresses.
Subsegments in the R3041 – memory width configuration
The R3041 CPU can be configured to access different regions of memory
as either 32-, 16- or 8-bits wide. Where the program requests a 32-bit
operation to a narrow memory (either with an uncached access, or a cache
miss, or a store), the CPU may break a transaction into multiple data
phases, to match the datum size to the memory port width.
The width configuration is applied independently to subsegments of the
normal kseg regions, as follows:
• kseg0 and kseg1: as usual, these are both mapped onto the low
512Mbytes. This common region is split into 8 subsegments
(64Mbytes each), each of which can be programmed as 8-, 16- or 32bits wide. The width assignment affects both kseg0 and kseg1
accesses (that is, one can view these as subsegments of the
corresponding “physical” addresses).

2–10

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• kuseg: is divided into four 512Mbyte subsegments, each
independently programmable for width. Thus, kuseg can be broken
into multiple portions, which may have varying widths. An example of
this may be a 32-bit main memory with some 16-bit PCMCIA font
cards and an 8-bit NVRAM.
• kseg2: is divided into two 512Mbyte subsegments, independently
programmable for width. Again, this means that kseg2 can support
multiple memory subsystems, of varying port width.
Note that once the various memory port widths have been configured
(typically at boot time), software does not have to be aware of the actual
width of any memory system. It can choose to treat all memory as 32-bit
wide, and the CPU will automatically adjust when an access is made to a
narrower memory region. This simplifies software development, and also
facilitates porting to various system implementations (which may or may
not choose the same memory port widths).

2–11

®

EXCEPTION MANAGEMENT

CHAPTER 4

Integrated Device Technology, Inc.

1

This chapter describes the software techniques used to recognize and
decode exceptions, save state, dispatch exception service routines, and
return from exception. Various code examples are provided.

EXCEPTIONS
In the MIPS architecture interrupts, traps, system calls and everything
else which disrupts the normal flow of execution are called ‘‘exceptions’’
and handled by a single mechanism. These kinds of events include:
• External events : interrupts, or a bus error on a read. Note that for the
R30xx floating point exceptions are reported as interrupts, since
when the R3000A was originally implemented the FPA was indeed
external.
Interrupts are the only exception conditions which can be disabled
under software control.
• Program errors and unusual conditions : non-existent instructions
(including ‘‘co-processor’’ instructions executed with the appropriate
SR disabled), integer overflow, address alignment errors, accesses
outside kuseg in user mode.
• Memory translation exceptions : using an invalid translation, or a write
to a write-protected page; and access to a page for which there is no
translation in the TLB.
• System calls and traps : exceptions deliberately generated by software
to access kernel facilities in a secure way (syscalls, conditional traps
planted by careful code, and breakpoints).
Some things do not cause exceptions, although other CPU architectures
may handle them that way. Software must use other mechanisms to
detect:
• bus errors on write cycles (R30xx CPUs don’t detect these as
exceptions at all; the use of a write buffer would make such an
exception “imprecise”, in that the instruction which generated the
store data is not guaranteed to be the one which recognizes the
exception).
• parity errors detected in the cache (the PE bit in SR is set, but no
exception is signalled).

Precise exceptions
The MIPS architecture implements precise exceptions. This is quite a
useful feature, as it provides:
• Unambiguous proof of cause : after an exception caused by any
internal error, the EPC points to the instruction which caused the
error (it might point to the preceding branch for an instruction which
is in a branch delay slot, but will signal occurrence of this using the
BD bit).
• Exceptions are seen in instruction sequence : exceptions can arise at
several different stages of execution, creating a potential hazard. For
example, if a load instruction suffers a TLB miss the exception won’t
be signalled until the ‘‘MEM’’ pipestage; if the next instruction suffers
an instruction TLB miss (at the ‘‘IF’’ pipestage) the logically second
exception will be signalled first (since the IF occurs earlier in the pipe
than MEM).

4–1

CHAPTER 4

EXCEPTION MANAGEMENT

To avoid this problem, early-detected exceptions are not activated
until it is known that all previous instructions will complete
successfully; in this case, the instruction TLB miss is suppressed and
the exception caused by the earlier instruction handled. The second
exception will likely happen again upon return from handling the data
fault.
• Subsequent instructions nullified : because of the pipelining,
instructions lying in sequence after the EPC may well have been
started. But the architecture guarantees that no effects produced by
these instructions will be visible in the registers or CPU state; and no
effect at all will occur which will prevent execution being restarted at
the EPC.
Note that this isn’t quite true of, for example, the result registers in
the integer multiply unit (logically, the architecture considers these
changed by the initiation of a multiply or divide). But provided that
the instruction arrangement rules required by the assembler are
followed, no problems will arise.
The implementation of precise exceptions requires a number of clever
techniques. For example, the FPA cannot update the register file until it
knows that the operation will not generate an exception. However, the
R30xx family contains logic to allow multi-cycle FPA operations to occur
concurrently with integer operations, yet maintain precise exceptions.

When exceptions happen
Since exceptions are precise, the architecture determines that an
exception seems to have happened just before the execution of the
instruction which caused it. The first fetch from the exception routine will
be made within 1 clock of the time when the faulting instruction would
have finished; in practice it is often faster.
On an interrupt, the last instruction to be completed before interrupt
processing starts will be the one which has just finished its MEM stage
when the interrupt is detected. The EPC target will be the one which has
just finished its ALU stage.
However, take care; some of the interrupt inputs to R30xx family CPUs
are resynchronised internally (to support interrupt signalling from
asynchronous sources) and the interrupt will be detected only on the rising
edge of the second clock after the interrupt becomes active.

Exception vectors
Unlike most CISC processors, the MIPS CPU does no part of the job of
dispatching exceptions to specialist routines to deal with individual
conditions. The rationale for this is twofold:
• on CISC CPUs this feature is not so useful in practice as one might
hope. For example, most interrupts are likely to share code for saving
registers and it is common for CISC microcode to spend time
dispatching to different interrupt entry points, where system software
loads a code number and jumps back to a common handler.
• on a RISC CPU ordinary code is fast enough to be used in preference
to microcode.
Only one exception is handled differently; a TLB miss on an address in
kuseg. Although the architecture uses software to handle this condition
(which occurs very frequently in a heavily-used multi-tasking, virtual
memory OS), there is significant architectural support for a ‘‘preferred’’
scheme for TLB refill. The preferred refill scheme can be completed in
about 13 clocks.
It is also useful to have two alternate pairs of entry points. It is essential
for high performance to locate the vectors in cached memory for OS use,
but this is highly undesirable at start-up; the need for a robust and selfdiagnosing start-up sequence mandates the use of uncached read-only
memory for vectors.

4–2

EXCEPTION MANAGEMENT

CHAPTER 4

So the exception system adds four more “magic” addresses to the one
used for system start-up. The reset mechanism on the MIPS CPU is
remarkably like the exception mechanism, and is sometimes referred to as
the reset exception. The complete list of exception vector addresses is
shown in Table 4.1, “Reset and exception entry points (vectors) for R30xx
family”:
Program
address

‘‘segment’’

Physical
Address

Description

0x8000 0000

kseg0

0x0000 0000

TLB miss on kuseg reference only.

0x8000 0080

kseg0

0x0000 0080

All other exceptions.

0xbfc0 0100

kseg1

0x1fc0 0100

Uncached alternative kuseg TLB
miss entry point (used if SR bit
BEV set).

0xbfc0 0180

kseg1

0x1fc0 0180

Uncached alternative for all other
exceptions, used if SR bit BEV set).

0xbfc0 0000

kseg1

0x1fc0 0000

The ‘‘reset exception’’.

Table 4.1. Reset and exception entry points (vectors) for R30xx family

The 128 byte (0x80) gap between the two exception vectors is because
the MIPS architects felt that 32 instructions would be enough to code the
user-space TLB miss routine, saving a branch instruction without wasting
too much memory.
So on an exception, the CPU:
1)
sets up EPC to point to the restart location.
2)
the pre-existing user-mode and interrupt-enable flags in SR are
saved by pushing the 3-entry stack inside SR, and changing to
kernel mode with interrupts disabled.
3)
Cause is setup so that software can see the reason for the
exception. On address exceptions BadVaddr is also set. Memory
management system exceptions set up some of the MMU
registers too; see the chapter on memory management for more
detail.
4)
transfers control to the exception entry point.

Exception handling – basics
Any MIPS exception handler has to go through the same stages:
• Bootstrapping : on entry to the exception handler very little of the state
of the interrupted program has been saved, so the first job is to
provide room to preserve relevant state information.
Almost inevitably, this is done by using the k0 and k1 registers (which
are reserved for ‘‘kernel mode’’ use, and therefore should contain no
application program state), to reference a piece of memory which can
be used for other register saves.
• Dispatching different exceptions : consult the Cause register. The
initial decision is likely to be made on the ‘‘ExcCode’’ field, which is
thoughtfully aligned so that its code value (between 0 and 31) can be
used to index an array of words without a shift. The code will be
something like this:
mfc0
and
lw
jr

4–3

t1, C0_CAUSE
t2, t1, 0x3f
t2, tablebase(t2)
t2

CHAPTER 4

EXCEPTION MANAGEMENT

• Constructing the exception processing environment : complex exception
handling routines may be written in a high level language; in addition,
software may wish to be able to use standard library routines. To do
this, software will have to switch to a suitable stack, and save the
values of all registers which “called subroutines” may use.
• Processing the exception : this is system and cause dependent.
• Returning from an exception : The return address is contained in the
EPC register on exception entry; the value must be placed into a
general purpose register for return from exception (note that the EPC
value may have been placed on the stack at exception entry).
Returning control is now done with a jr instruction, and the change
of state back from kernel to the previous mode is done by an rfe
instruction after the jr, in the delay slot.

Nesting exceptions
In many cases the system may wish to permit (or will not be able to
avoid) further exceptions occurring within the exception processing
routine – nested exceptions.
If improperly handled, this could cause chaos; vital state for the
interrupted program is held in EPC and SR, and another exception would
overwrite them. To permit nested exceptions, these values must be saved
elsewhere. Moreover, once exceptions are re-enabled, software can no
longer rely on the values of k0 and k1, since a subsequent (nested)
exception may alter their values.
The normal approach to this is to define an exception frame; a memoryresident data structure with fields to store incoming register values, so
that they can be retrieved on return. Exception frames are usually
arranged logically as a stack.
Stack resources are consumed by each exception, so arbitrarily nested
exceptions cannot be tolerated. Most systems sort exceptions into a
priority order, and arrange that while an exception is being processed only
higher-priority exceptions are permitted. Such systems need have only as
many exception frames as there are priority levels.
Software can inhibit certain exceptions, as follows:
• Interrupts : can be individually masked by software to conform to
system priority rules;
• Privilege Violations : can’t happen in kernel mode; virtually all
exception service routines will execute in kernel mode;
• Addressing errors and TLB misses : software must be written to
ensure that these never happen when processing higher priority
exceptions.
Typical system priorities are (lowest first): non-exception code, TLB miss
on kuseg address, TLB miss on kseg2 address, interrupt (lowest)...
interrupt (highest), illegal instructions and traps, bus errors.

An exception routine
The following is an exception routine from IDT/sim.
It receives exceptions, saves all state, and calls the appropriate service
routine. It also shows the code used to install the exception handler in
memory.
/*
**
**
**
**
**
**
*/

exception.s - contains functions for setting up and
handling exceptions
Copyright 1989 Integrated Device Technology, Inc.
All Rights Reserved

4–4

EXCEPTION MANAGEMENT

CHAPTER 4

#include
#include
#include
#include
#include

"iregdef.h"
"idtcpu.h"
"idtmon.h"
"setjmp.h"
"excepthdr.h"

/*
**
move_exc_code() - moves the exception code to the utlb and
gen
**
exception vectors
*/
FRAME(move_exc_code,sp,0,ra)
.set
noreorder
la
t1,exc_utlb_code
la
t2,exc_norm_code
li
t3,UT_VEC
li
t4,E_VEC
li
t5,VEC_CODE_LENGTH
1:
lw
t6,0(t1)
lw
t7,0(t2)
sw
t6,0(t3)
sw
t7,0(t4)
addiu t1,4
addiu t3,4
addiu t4,4
subu
t5,4
bne
t5,zero,1b
addiu t2,4
move
t5,ra
# assumes clear_cache doesnt use t5
li
a0,UT_VEC
jal
clear_cache
li
a1,VEC_CODE_LENGTH
nop
li
a0,E_VEC
jal
clear_cache
li
a1,VEC_CODE_LENGTH
move
ra,t5
# restore ra
j
ra
nop
.set
reorder
ENDFRAME(move_exc_code)
/*
** enable_int(mask) - enables interrupts - mask is positoned so it
only
**
needs to be or'ed into the status reg. This
**
also does some other things !!!! caution
should
**
be used if invoking this while in the middle
**
of a debugging session where the client may
have
**
nested interrupts.
**
*/
FRAME(enable_int,sp,0,ra)
.set
noreorder
la
t0,client_regs
lw
t1,R_SR*4(t0)
nop
or
t1,0x4
or
t1,a0
sw
t1,R_SR*4(t0)
mfc0
t0,C0_SR
or
a0,1
or
t0,a0
mtc0
t0,C0_SR
j
ra

4–5

CHAPTER 4

EXCEPTION MANAGEMENT

nop
.set
reorder
ENDFRAME(enable_int)
/*
**
disable_int(mask) - disable the interrupt - mask is the
compliment
**
of the bits to be cleared - i.e. to clear
ext int
**
5 the mask would be - 0xffff7fff
*/
FRAME(disable_int,sp,0,ra)
.set
noreorder
la
t0,client_regs
lw
t1,R_SR*4(t0)
nop
and
t1,a0
sw
t1,R_SR*4(t0)
mfc0
t0,C0_SR
nop
and
t0,a0
mtc0
t0,C0_SR
j
ra
nop
.set
reorder
ENDFRAME(disable_int)
/*
** the following sections of code are copied to the vector area
**
at location 0x80000000 (utlb miss) and location 0x80000080
**
(general exception).
**
*/
.set
.set

noreorder
noat

# must be set so la does not use at

FRAME(exc_norm_code,sp,0,ra)
la
k0,except_regs
sw
AT,R_AT*4(k0)
sw
gp,R_GP*4(k0)
sw
v0,R_V0*4(k0)
li
v0,NORM_EXCEPT
la
AT,exception
j
AT
nop
ENDFRAME(exc_norm_code)
FRAME(exc_utlb_code,sp,0,ra)
la
k0,except_regs
sw
AT,R_AT*4(k0)
sw
gp,R_GP*4(k0)
sw
v0,R_V0*4(k0)
li
v0,UTLB_EXCEPT
la
AT,exception
j
AT
nop
.set

reorder

/*
** common exception handling code
** Save various registers so we can print informative messages
** for faults (whether in monitor or client mode)
**
Reg.(k0) points to the exception register save area.
**
If we are in client mode then some of these values will
**
have to be copied to the client register save area.
*/
.set
noreorder

4–6

EXCEPTION MANAGEMENT

CHAPTER 4
exception:
sw
v0,R_EXCTYPE*4(k0) # save exception type (gen or
utlb)
sw
v1,R_V1*4(k0)
mfc0
v0,C0_EPC
mfc0
v1,C0_SR
sw
v0,R_EPC*4(k0)# save the pc at the time of the
exception
sw
v1,R_SR*4(k0)
.set
noat
la
AT,client_regs# get address of client reg save area
mfc0
v0,C0_BADVADDR
mfc0
v1,C0_CAUSE
sw
v0,R_BADVADDR*4(k0)
sw
v0,R_BADVADDR*4(AT)
sw
v1,R_CAUSE*4(k0)
sw
v1,R_CAUSE*4(AT)
sw
sp,R_SP*4(k0)
sw
sp,R_SP*4(AT)
lw
v0,user_int_fast#see if a client wants a shot at it
sw
a0,R_A0*4(k0)
sw
a0,R_A0*4(AT)
sw
ra,R_RA*4(k0)
sw
ra,R_RA*4(AT)
lw
sp,fault_stack # use "fault" stack
beq
v0,zero,1f
# skip the following if no client
nop
move
a0,AT
jal
v0
nop
la
k0,except_regs
la
AT,client_regs
beq
v0,zero,1f
# returns false if user did not
handle
nop
la
v1,except_regs
lw
ra,R_RA*4(v1)
lw
AT,R_AT*4(v1)
lw
gp,R_GP*4(v1)
lw
v0,R_V0*4(v1)
lw
sp,R_SP*4(v1)
lw
a0,R_A0*4(v1)
lw
k0,R_EPC*4(v1)
lw
v1,R_V1*4(v1)
j
k0
rfe
/*
** Save registers if in client mode
** then change mode to prom mode currently k0 is pointing
** exception reg. save area - v0, v1, AT, gp, sp regs were saved
** epc, sr, badvaddr and cause were also saved.
*/
1:
lw
v0,R_MODE*4(AT)# get the current op. mode
lw
v1,R_EXCTYPE*4(k0)
sw
v0,R_MODE*4(k0)# save the current prom mode
sw
v1,R_EXCTYPE*4(AT)
li
v1,MODE_MONITOR# see if it
beq
v0,v1,nosave # was in prom mode
nop
li
v0,MODE_MONITOR
sw
v0,R_MODE*4(AT)# now in prom mode
lw
v0,R_GP*4(k0)
lw
v1,R_EPC*4(k0)
sw
v0,R_GP*4(AT)
sw
v1,R_EPC*4(AT)
lw
v0,R_SR*4(k0)
lw
v1,R_AT*4(k0)

4–7

CHAPTER 4
sw
sw
lw
lw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
li
sw
sw
sw
sw
lw
move
and
beq
present
move
lw
and
mtc0
nop
cfc1
cfc1
sw
sw
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1

EXCEPTION MANAGEMENT
v0,R_SR*4(AT)
v1,R_AT*4(AT)
v0,R_V0*4(k0)
v1,R_V1*4(k0)
v0,R_V0*4(AT)
v1,R_V1*4(AT)
a1,R_A1*4(AT)
a2,R_A2*4(AT)
a3,R_A3*4(AT)
t0,R_T0*4(AT)
t1,R_T1*4(AT)
t2,R_T2*4(AT)
t3,R_T3*4(AT)
t4,R_T4*4(AT)
t5,R_T5*4(AT)
t6,R_T6*4(AT)
t7,R_T7*4(AT)
s0,R_S0*4(AT)
s1,R_S1*4(AT)
s2,R_S2*4(AT)
s3,R_S3*4(AT)
s4,R_S4*4(AT)
s5,R_S5*4(AT)
s6,R_S6*4(AT)
s7,R_S7*4(AT)
t8,R_T8*4(AT)
v0,0xbababadd #This reg (k0) is invalid
t9,R_T9*4(AT)
v0,R_K0*4(AT) # should be obvious
k1,R_K1*4(AT)
fp,R_FP*4(AT)
v0,status_base
v1,AT
v0,SR_CU1
v0,zero,1f
# only save fpu regs if
AT,v1
v1,R_SR*4(AT)
v0,v1
v0,C0_SR
v0,$30
v1,$31
v0,R_FEIR*4(AT)
v1,R_FCSR*4(AT)
fp0,R_F0*4(AT)
fp1,R_F1*4(AT)
fp2,R_F2*4(AT)
fp3,R_F3*4(AT)
fp4,R_F4*4(AT)
fp5,R_F5*4(AT)
fp6,R_F6*4(AT)
fp7,R_F7*4(AT)
fp8,R_F8*4(AT)
fp9,R_F9*4(AT)
fp10,R_F10*4(AT)
fp11,R_F11*4(AT)
fp12,R_F12*4(AT)
fp13,R_F13*4(AT)
fp14,R_F14*4(AT)
fp15,R_F15*4(AT)
fp16,R_F16*4(AT)
fp17,R_F17*4(AT)
fp18,R_F18*4(AT)
fp19,R_F19*4(AT)
fp20,R_F20*4(AT)
fp21,R_F21*4(AT)
fp22,R_F22*4(AT)

4–8

EXCEPTION MANAGEMENT

CHAPTER 4
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1

fp23,R_F23*4(AT)
fp24,R_F24*4(AT)
fp25,R_F25*4(AT)
fp26,R_F26*4(AT)
fp27,R_F27*4(AT)
fp28,R_F28*4(AT)
fp29,R_F29*4(AT)
fp30,R_F30*4(AT)
fp31,R_F31*4(AT)

mflo
mfhi
sw
sw
mfc0
mfc0
sw
sw
mfc0
mfc0
sw
mfc0
sw
sw
.set
nosave:
.set
j

v0
v1
v0,R_MDLO*4(AT)
v1,R_MDHI*4(AT)
v0,C0_INX
v1,C0_RAND
v0,R_INX*4(AT)
v1,R_RAND*4(AT)
v0,C0_TLBLO
v1,C0_TLBHI
v0,R_TLBLO*4(AT)
v0,C0_CTXT
v1,R_TLBHI*4(AT)
v0,R_CTXT*4(AT)
at

1:

reorder
exception_handler

ENDFRAME(exc_utlb_code)
/*
** resume -- resume execution of client code
*/
FRAME(resume,sp,0,ra)
jal
install_sticky
jal
clr_extern_brk
jal
clear_remote_int
.set
noat
.set
noreorder
la
AT,client_regs
lw
v0,status_base
move
v1,AT
and
v0,SR_CU1
beq
v0,zero,1f
# only save fpu regs if present
move
AT,v1
lw
v1,R_SR*4(AT)
nop
or
v0,v1
mtc0
v0,C0_SR
lw
v1,R_FCSR*4(AT)
lwc1
fp0,R_F0*4(AT)
ctc1
v1,$31
lwc1
fp1,R_F1*4(AT)
lwc1
fp2,R_F2*4(AT)
lwc1
fp3,R_F3*4(AT)
lwc1
fp4,R_F4*4(AT)
lwc1
fp5,R_F5*4(AT)
lwc1
fp6,R_F6*4(AT)
lwc1
fp7,R_F7*4(AT)
lwc1
fp8,R_F8*4(AT)
lwc1
fp9,R_F9*4(AT)
lwc1
fp10,R_F10*4(AT)
lwc1
fp11,R_F11*4(AT)
lwc1
fp12,R_F12*4(AT)
lwc1
fp13,R_F13*4(AT)
lwc1
fp14,R_F14*4(AT)
lwc1
fp15,R_F15*4(AT)
lwc1
fp16,R_F16*4(AT)

4–9

CHAPTER 4

EXCEPTION MANAGEMENT
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1

fp17,R_F17*4(AT)
fp18,R_F18*4(AT)
fp19,R_F19*4(AT)
fp20,R_F20*4(AT)
fp21,R_F21*4(AT)
fp22,R_F22*4(AT)
fp23,R_F23*4(AT)
fp24,R_F24*4(AT)
fp25,R_F25*4(AT)
fp26,R_F26*4(AT)
fp27,R_F27*4(AT)
fp28,R_F28*4(AT)
fp29,R_F29*4(AT)
fp30,R_F30*4(AT)
fp31,R_F31*4(AT)

1:
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
mtlo
mthi
lw
lw
mtc0
mtc0
lw
lw
mtc0
mtc0
lw
lw
mtc0
move
and
intr */
mtc0
li
move
sw
lw
lw
lw
lw

a0,R_A0*4(AT)
a1,R_A1*4(AT)
a2,R_A2*4(AT)
a3,R_A3*4(AT)
t0,R_T0*4(AT)
t1,R_T1*4(AT)
t2,R_T2*4(AT)
t3,R_T3*4(AT)
t4,R_T4*4(AT)
t5,R_T5*4(AT)
t6,R_T6*4(AT)
t7,R_T7*4(AT)
s0,R_S0*4(AT)
s1,R_S1*4(AT)
s2,R_S2*4(AT)
s3,R_S3*4(AT)
s4,R_S4*4(AT)
s5,R_S5*4(AT)
s6,R_S6*4(AT)
s7,R_S7*4(AT)
t8,R_T8*4(AT)
t9,R_T9*4(AT)
k1,R_K1*4(AT)
gp,R_GP*4(AT)
fp,R_FP*4(AT)
ra,R_RA*4(AT)
v0,R_MDLO*4(AT)
v1,R_MDHI*4(AT)
v0
v1
v0,R_INX*4(AT)
v1,R_TLBLO*4(AT)
v0,C0_INX
v1,C0_TLBLO
v0,R_TLBHI*4(AT)
v1,R_CTXT*4(AT)
v0,C0_TLBHI
v1,C0_CTXT
v0,R_CAUSE*4(AT)
v1,R_SR*4(AT)
v0,C0_CAUSE
/* only sw0 and 1 writable */
v0,AT
v1,~(SR_KUC|SR_IEC|SR_PE)/* make sure we aren't
v1,C0_SR
k0,MODE_USER
AT,v0
k0,R_MODE*4(AT)
v1,R_V1*4(AT)
sp,R_SP*4(AT)
k0,R_EPC*4(AT)
v0,R_V0*4(AT)

/* reset

mode */

4–10

EXCEPTION MANAGEMENT

CHAPTER 4
lw
AT,R_AT*4(AT)
j
k0
rfe
.set
reorder
.set
at
ENDFRAME(resume)
/*
** do_call(procedure, arg1, arg2, arg3, arg4, arg5, arg6, arg7,
arg8)
** interface for call command to client code
** copies arguments to new frame and sets up gp for client
*/
#define CALLFRM ((8*4)+4+4)
FRAME(do_call, sp,CALLFRM,ra)
subu
sp,CALLFRM
sw
ra,CALLFRM-4(sp)
sw
gp,CALLFRM-8(sp)
move
v0,a0
move
a0,a1
move
a1,a2
move
a2,a3
lw
a3,CALLFRM+(4*4)(sp)
lw
v1,CALLFRM+(5*4)(sp)
sw
v1,4*4(sp)
lw
v1,CALLFRM+(6*4)(sp)
sw
v1,5*4(sp)
lw
v1,CALLFRM+(7*4)(sp)
sw
v1,6*4(sp)
lw
v1,CALLFRM+(8*4)(sp)
sw
v1,7*4(sp)
la
t1,client_regs
lw
gp,R_GP*4(t1)
jal
v0
lw
gp,CALLFRM-8(sp)
lw
ra,CALLFRM-4(sp)
addu
sp,CALLFRM
j
ra
ENDFRAME(do_call)
/*
** clear_stat() -- clear status register
** returns current sr
*/
FRAME(clear_stat,sp,0,ra)
.set
noreorder
lw
v1,status_base
mfc0
v0,C0_SR
mtc0
v1,C0_SR
j
ra
nop
ENDFRAME(clear_stat)
.set

reorder

/*
** setjmp(jmp_buf) -- save current context for non-local goto's
** return 0
*/
FRAME(setjmp,sp,0,ra)
sw
ra,JB_PC*4(a0)
sw
sp,JB_SP*4(a0)
sw
fp,JB_FP*4(a0)
sw
s0,JB_S0*4(a0)
sw
s1,JB_S1*4(a0)
sw
s2,JB_S2*4(a0)
sw
s3,JB_S3*4(a0)
sw
s4,JB_S4*4(a0)

4–11

CHAPTER 4

EXCEPTION MANAGEMENT
sw
s5,JB_S5*4(a0)
sw
s6,JB_S6*4(a0)
sw
s7,JB_S7*4(a0)
move
v0,zero
j
ra
ENDFRAME(setjmp)

/*
** longjmp(jmp_buf, rval)
*/
FRAME(longjmp,sp,0,ra)
lw
ra,JB_PC*4(a0)
lw
sp,JB_SP*4(a0)
lw
fp,JB_FP*4(a0)
lw
s0,JB_S0*4(a0)
lw
s1,JB_S1*4(a0)
lw
s2,JB_S2*4(a0)
lw
s3,JB_S3*4(a0)
lw
s4,JB_S4*4(a0)
lw
s5,JB_S5*4(a0)
lw
s6,JB_S6*4(a0)
lw
s7,JB_S7*4(a0)
move
v0,a1
j
ra
ENDFRAME(longjmp)
/*
** wbflush() flush the write buffer - this is specific for each
hardware
**
configuration.
*/
FRAME(wbflush,sp,0,ra)
.set noreorder
lw
t0,wbflush#read an uncached memory location
j
ra
nop
.set reorder
ENDFRAME(wbflush)

INTERRUPTS
The MIPS CPUs are provided with 6 individual hardware interrupt bits,
activated by CPU input pins (in the case of the R3081, one pin is used
internally by the FPA), and 2 additional software-settable interrupt bits. An
active level on any pin is sensed in each cycle, and will cause an exception
if enabled.
The interrupt enable comes in two parts:
• The global interrupt enable bit (IEc) in the status register – when zero
no interrupt exception will occur. Simple, fast and comprehensive,
this is what prevents interrupts occurring during the early and
vulnerable stages of processing exceptions. Also, the global interrupt
enable is usually switched back on by an rfe instruction at the end of
an exception routine; this means that the interrupt cannot take effect
until the CPU has returned from the exception and finished with the
EPC register, avoiding undesirable recursion in the interrupt routine.
• The individual interrupt mask bits IM in the status register, one for
each interrupt. Set the bit 1 to enable the corresponding interrupt.
These are manipulated by software to allow whichever interrupts are
appropriate to the system.

4–12

EXCEPTION MANAGEMENT

CHAPTER 4

Changes to the individual bits are usually made “under cover”, with
the global interrupt enable off.
What are the software interrupt bits for?
One commonly asked question is: “Why does the CPU provide two bits in
the Cause register which, when set, immediately cause an interrupt
unless masked?”
The clue is in ‘‘unless masked’’. Typically this is used as a mechanism for
high-priority interrupt routines to flag actions which will be performed by
lower-priority interrupt routines, once the system has dealt with all high
priority business. As the high-priority processing completes, the software
will open up the interrupt mask, and the pending software interrupt will
occur.
There is no definitive reason why the same effect should not be simulated
by system software (using flags in memory, for example) but the soft
interrupt bits are convenient because they fit in with the already
provided interrupt handling mechanism.
Pin

SR/Cause
bit no

Notes

8

software interrupt

9

software interrupt

Int0*

10

Cause bit reads 1 when pin low (active)

Int1*

11

Int2*

12

Int3*

13

Int4*

14

Int5*

15

Usual choice for FPA. The pin corresponding to the
interrupt selected for FPA interrupts on an R3081 is
effectively a no-connect.

Table 4.2. Interrupt bitfields and interrupt pins

Interrupt processing proper begins after an exception is received and the
Type field in Cause signals that it was caused by an interrupt. Table 4.2,
“Interrupt bitfields and interrupt pins” describes the relationship between
Cause bits and input pins.
Once the interrupt exception is “recognized” by the CPU, the stages are:
• Consult the Cause register IP field, logically-‘‘and’’ it with the current
interrupt masks in the SR IM field to obtain a bit-map of active,
enabled interrupt requests. There may be more than one, and any of
them would have caused the interrupt.
• Select one active, enabled interrupt for attention. The selection can be
done simply by using fixed priorities; however, software is free to
implement whatever priority mechanism is appropriate for the
system.
• Software needs to save the old interrupt mask bits of the SR register,
but it is quite likely that the whole SR register was saved in the main
exception routine.
• Change IM in SR to ensure that the current interrupt and all
interrupts of equal or lesser priority are inhibited.
• If not already performed by the main exception routine, save the state
required for nested exception processing.
• Set the global interrupt enable bit IEc in SR to allow higher-priority
interrupts to be processed.
4–13

CHAPTER 4

EXCEPTION MANAGEMENT

• Call the particular interrupt service routine for the selected, current
interrupt.
• On return, disable interrupts again by clearing IEc in SR, before
returning to the normal exception stream.

Conventions and Examples
The following is as simple as an exception routine can be. It does nothing
except increment a counter on each exception:
.set
.set
xcptgen:
la
lw
nop
addu
sw
mfc0
nop
j
rfe
.set
.set

noreorder
noat
k0,xcptcount# get address of counter
k1,0(k0)# load counter
# (load delay)
k1,1
# increment counter
k1,0(k0)# store counter
k0,C0_EPC# get EPC
# (load delay, mfc0 slow)
k0
# return to program
# branch delay slot
at
reorder

Note that this routine cannot survive a nested exception (the original
return address in EPC would be lost, for example). It doesn’t re-enable
interrupts; but note that the counter xcptcount should be at an address
which can’t possibly suffer a TLB miss.

4–14

®

CACHE MANAGEMENT

CHAPTER 5

Integrated Device Technology, Inc.

1

CACHES AND CACHE MANAGEMENT
R30xx family CPUs implement separate on-chip caches for instructions
(I-cache) and data (D-cache). Following RISC principles, hardware
functions are provided only for normal operation of the caches; software
routines must be provided to initialize the cache following system start-up,
and to invalidate cache data when required†.
Cache Memory
tagstore

memory address
higher bits

lo bits

cache data store

0

index

match?

hit?
Figure 5.1.

data
Direct mapped cache

The cache’s job is to hold a copy of memory data which has been recently
read or written, so it can be returned quickly to the CPU; in the R30xx
architecture data accesses in the cache take just one clock, and an I-cache
and a D-cache operation can occur together.
When a cacheable location is read (a data load):
• It will be returned from the D-cache if the cache contains the
corresponding physical address and the cache line is valid there
(called a cache ‘‘hit’’). In this case nothing happens at the CPUs
memory interface, so the read is invisible to the outside world.
• If the data is not found in the D-cache (called a cache “miss”), the data
will be read from external memory. According to the CPU type and
how it is set up, it may read one or more words from memory. The
data is loaded into the cache, and normal operation then resumes.
In normal operation, cache miss processing will cause the targeted
cache line to “invalidate” the valid data already present in the cache.
In the R30xx caches, cache data is never more up-to-date than
memory (because the cache is write-through, described below), so the
previously cached data can be discarded without any trouble.

† Note that the R3071 and R3081 do implement a DMA protocol
that allows automatic, hardware-based data cache invalidation.
5–1

CHAPTER 5

CACHE MANAGEMENT

When data is loaded from an uncacheable location, it is always obtained
from external memory (or a memory-mapped IO location). Most systems
never access the same data locations as cached and uncached; however,
the results of such a system would be predictable. On an uncacheable load
cache data is neither used nor updated.
When software writes a cached location:
• If the CPU is doing a 32-bit store, the cache is always updated
(possibly discarding data from a previously cached location).
• For byte or half-word stores, the cache will only be updated if the
reference hits in the cache; then data will be extracted from the cache,
merged with the store data, and written back†.
• If the partial-word store misses in the cache, then the cache is left
alone.
• In all cases, the write is also made to main memory.
When the store target is an uncached location the cache is not consulted
or modified.
Figure 5.1, “Direct mapped cache” is a diagrammatic representation of
the way the MIPS cache works. Both caches are:
• Physically indexed, physically tagged: the CPUs program address
(virtual address) is translated to a physical address, just as is used to
address real memory, before being used for the cache lookup. The
TAG comparison (checking for a hit) is also based on physical
addresses.
On certain other CPU families the cache index is based on program
addresses (which are available a bit earlier); some CPUs even use
virtual TAGs, which then require that the cache be flushed at context
switch. But physical caches are easier to manage.
• Direct mapped : Each physical address has only one location in each
cache where it may reside. At each cache index there is only one data
item stored – this will be just one word in the D-cache but is usually
a 4-word line for the I-cache (see Figure 5.1, “Direct mapped cache”).
Next to the data is kept the tag, which stores the memory address for
which this data is a copy.
If the tag matches the high-order (higher number) address bits then
the cache line contains the data the CPU is looking for; the data is
returned and execution continues.
For an I-cache access, the CPU must select one of the four words
based on the lowest address bits.
This is a direct mapped cache because there is only one tag/data pair
at each cache index. More complex caches may have more than one
tag field, and compare them simultaneously with the physical
address.
A direct-mapped cache is very simple, but can suffer from cache
thrashing; so the CPU can run slowly if a program loop is regularly
accessing a pair of locations whose low-order addresses happen to be
equal. To avoid this situation, the R30xx family implements relatively
large caches, which minimize the probability of reasonable program
loops causing CPU thrashing.
• Cache lines : the line size is the number of data elements stored with
each tag. For R30xx family CPUs the I-cache implements a 4-word
line size; the D-cache always has 1-word lines.

† In the R30xx family, the data will be merged in the D-Cache.
However, the CPU bus will perform the store only to the bytes
which were actually changed (i.e. the store datum size), facilitating
debugging.
5–2

CACHE MANAGEMENT

CHAPTER 5

When a cache miss occurs the whole line must be filled from memory.
But it is quite possible to fetch more than a line’s worth of data; and
R30xx family CPUs can be configured to fetch 4 words of data on a Dcache miss, refilling 4 1-word ‘‘lines’’.
• Write through : the D-cache is write-through, meaning that all store
operations result in a store to main memory. This means that all data
in the cache is duplicated in main memory, and can therefore be
discarded at any time. In particular, when data is being read following
a cache miss it can always be stored in the cache without regard for
the data which was previously stored at the same index.
• Partial word write implementations : when the CPU writes only part of
a word, it is essential that any valid cache data should still end up as
a duplicate of main memory. One simple approach is to invalidate the
cache line and to write only to main memory (the main memory must
be byte-addressable). But the R30xx family uses a more efficient
strategy:
a)
if the location being written is present in the cache (cache hit) the
cache data is read into the CPU, the partial-word data merged
with it, the whole word written back to the cache, and the
partial-word written to memory.
b)
where the write misses in the cache the partial-word write is
performed to memory only, and the cache left alone.
Note that this takes an extra clock, so a partial-word write which hits
in the cache is slower than a whole-word write.

Cache isolation and swapping
No special instructions are provided to explicitly access the caches;
everything has to be done with load and store instructions.
To distinguish operations for cache management from regular memory
references, without having to dedicate a special address region for this
purpose, the R30xx architecture provides bits in the SR to support cache
management:
• The SR mode bit “IsC” will isolate the D-cache; in this mode loads and
stores affect only the cache, and loads also ‘‘hit’’ regardless of whether
the tag matches. As a special mechanism, with the D-cache isolated
a partial-word write will invalidate the appropriate cache line.
Caution: when the D-cache is isolated, not even loads/stores marked
by their address or TLB entry as ‘‘uncached’’ will operate normally.
One consequence of this is that the cache management routines must
not make any data accesses; they are typically written in assembler,
using only register variables.
• The CPU provides a mode where the caches are swapped (SR SwC
bit), to allow the I-Cache to be targeted by store instructions; then the
D-cache acts as an I-cache, and the I-cache acts as the D-cache. Once
the caches are swapped and isolated I-cache entries may be read,
written and invalidated (invalidation uses the same partial word write
mechanism described above).
Note that cache isolation does not stop instruction fetches from
referencing main memory.
The D-cache behaves ‘‘perfectly’’ as an I-cache (provided it was
sufficiently initialized to work as a D-cache) but the I-cache does not
behave properly as a D-cache. It is unlikely that it will ever be useful
to have the caches swapped but not isolated.
If software does use a swapped I-cache for word stores (a partial-word
store invalidates the line, as before) it must make sure those locations
are invalidated before returning to normal operation.

5–3

CHAPTER 5

CACHE MANAGEMENT

Initializing and sizing the caches
At machine start-up the caches are in a random state, so the result of a
cached read is unpredictable. In addition, following a reset the status
register SwC and IsC bits are also in a random state, so start-up software
had better set them to a known state before attempting any load or store
(even uncached).
Different members of the R3051 family have different cache sizes.
Software will be more portable if it dynamically determines the size of the
I-cache and D-cache at initialization time, rather than hard-wiring a
particular value.
A number of algorithms are possible. Shown below is the code contained
in IDT/sim for cache sizing. The basic algorithm works as follows:isolate
the D-cache;
• swap the caches when sizing the I-cache;
• Write a marker into the initial cache entry.
• Start with the smallest permissible cache size.
• Read memory at the location for the current cache size. If it contains
the marker, that is the correct size. Otherwise, double the size to try
and repeat this step until the marker is found.
/*
** Config_cache() -- determine sizes of i and d caches
** Sizes stored in globals dcache_size and icache_size
*/
#define CONFIGFRM ((4*4)+4+4)
FRAME(config_cache,sp, CONFIGFRM, ra)
.set
noreorder
subu
sp,CONFIGFRM
sw
ra,CONFIGFRM-4(sp)# save return address
sw
s0,4*4(sp)
# save s0 in first regsave slot
mfc0
s0,C0_SR
# save SR
mtc0
zero,C0_SR
# disable interrupts
.set
reorder
jal
_size_cache
sw
v0,dcache_size
li
v0,SR_SWC
# swap caches
.set
noreorder
mtc0
v0,C0_SR
jal
_size_cache
nop
sw
v0,icache_size
mtc0
zero,C0_SR
# swap back caches
and
s0,~SR_PE
# do not inadvertantly clear PE
mtc0
s0,C0_SR
# restore SR
.set
reorder
lw
s0,4*4(sp)
# restore s0
lw
ra,CONFIGFRM-4(sp)# restore ra
addu
sp,CONFIGFRM # pop stack
j
ra
ENDFRAME(config_cache)
/*
** _size_cache()
** return size of current data cache
*/
FRAME(_size_cache,sp,0,ra)
.set
noreorder
mfc0
t0,C0_SR
# save current sr
and
t0,~SR_PE
# do not inadvertently clear PE
or
v0,t0,SR_ISC # isolate cache
mtc0
v0,C0_SR
/*
* First check if there is a cache there at all
*/
move
v0,zero
li
v1,0xa5a5a5a5 # distinctive pattern

5–4

CACHE MANAGEMENT

CHAPTER 5
sw
v1,K0BASE
lw
t1,K0BASE
nop
mfc0
t2,C0_SR
nop
.set
reorder
and
t2,SR_CM
bne
t2,zero,3f
bne
v1,t1,3f
/*
* Clear cache size
*/
li
v0,MINCACHE

# try to write into cache
# try to read from cache

# cache miss, must be no cache
# data not equal -> no cache
boundries to known state.

1:

2:

3:

sw
sll
ble

zero,K0BASE(v0)
v0,1
v0,MAXCACHE,1b

li
sw
li

v0,-1
v0,K0BASE(zero)# store marker in cache
v0,MINCACHE # MIN cache size

lw
v1,K0BASE(v0) # Look for marker
bne
v1,zero,3f
# found marker
sll
v0,1
# cache size * 2
ble
v0,MAXCACHE,2b# keep looking
move
v0,zero
# must be no cache
.set
noreorder
mtc0
t0,C0_SR
# restore sr
j
ra
nop
ENDFRAME(_size_cache)
.set
reorder

In a properly initialized cache, every cache entry is either invalid or
correctly corresponds to a memory location, and also contains correct
parity. Again, the sample code shown is from IDT/sim. The code works as
follows:
• Check that SR bit PZ is cleared to zero (1 disables parity; the R3071
and R3081 contain parity bits, and thus PZ=1 could cause the caches
to be initialized improperly).
• Isolate the D-cache, swap to access the I-cache.
• For each word of the cache: first write a word value (writing correct
tag, data and parity), then write a byte (invalidating the line).
Note that for an I-cache with 4 words per line this is inefficient; it
would be enough to write just one byte in the line to invalidate the
entry. Unless the system uses the invalidate routine often it doesn’t
seem worth the trouble.
FRAME(flush_cache,sp,0,ra)
lw
t1,icache_size
lw
t2,dcache_size
.set
noreorder
mfc0
t3,C0_SR
# save SR
nop
and
t3,~SR_PE
# dont inadvertently clear PE
beq
t1,zero,_check_dcache# if no i-cache check d-cache
nop
li
v0,SR_ISC|SR_SWC# disable intr, isolate and swap
mtc0
v0,C0_SR
li
t0,K0BASE
.set
reorder
or
t1,t0,t1
1:

sb

zero,0(t0)

5–5

CHAPTER 5

CACHE MANAGEMENT

sb
zero,4(t0)
sb
zero,8(t0)
sb
zero,12(t0)
sb
zero,16(t0)
sb
zero,20(t0)
sb
zero,24(t0)
addu
t0,32
sb
zero,-4(t0)
bne
t0,t1,1b
/*
* flush data cache
*/
_check_dcache:
li
v0,SR_ISC
# isolate and swap back caches
.set
noreorder
mtc0
v0,C0_SR
nop
beq
t2,zero,_flush_done
.set
reorder
li
t0,K0BASE
or
t1,t0,t2
1:

sb
sb
sb
sb
sb
sb
sb
addu
sb
bne

zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)
zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t1,1b

.set
noreorder
_flush_done:
mtc0
t3,C0_SR
# un-isolate, enable interrupts
.set
reorder
j
ra
ENDFRAME(flush_cache)

Invalidation
Invalidation refers to the act of setting specified cache lines to contain
no valid references to main memory, but to otherwise be consistent (e.g.
valid parity). Software needs to invalidate:
• the D-cache when memory contents have been changed by something
other than store operations from the CPU. Typically this is done when
some DMA device is reading into memory.
• the I-cache when instructions have been either written by the CPU or
obtained by DMA. The hardware does nothing to prevent the same
locations being used in the I- and D-cache; and an update by the
processor will not change the I-cache contents.
Note that the system could be constructed to use unmapped accesses to
those variables shared with a DMA device; the only difference is in
performance. In general small areas where DMA is frequent compared to
CPU activity should be mapped uncached; and larger areas where CPU
activity predominates should be invalidated by the driver at appropriate
points. Bear in mind that invalidating a word of data in the cache is faster
(probably 4-7 times faster) than an uncached load.
To invalidate the cache:
• Figure out the address range to invalidate. Invalidating a region larger
than the cache size is a waste of time.

5–6

CACHE MANAGEMENT

CHAPTER 5

• isolate the D-cache. Once it is isolated, the system must insure at all
costs against an exception (since the memory interface will be
temporarily disabled). Disable interrupts and ensure that software
which follows cannot cause a memory access exception;
• to work on the I-cache, swap the caches;
• write a byte value to each cache line in the range;
• (unswap and) unisolate.
The invalidate routine is normally executed with its instructions
cacheable. This sounds like a lot of trouble; but in fact shouldn’t require
any extra steps to run cached. An invalidation routine in uncached space
will run 4-10 times slower.
Again, the example code fragment shown is taken from IDT/sim:
/*
** clear_cache(base_addr, byte_count)
** flush portion of cache
*/
FRAME(clear_cache,sp,0,ra)

1:

/*
* flush instruction cache
*/
lw
t1,icache_size
lw
t2,dcache_size
.set
noreorder
mfc0
t3,C0_SR
# save SR
and
t3,~SR_PE
# dont inadvertently clear PE
nop
nop
li
v0,SR_ISC|SR_SWC# disable intr, isolate and swap
mtc0
v0,C0_SR
.set
reorder
bltu
t1,a1,1f
# cache is smaller than region
move
t1,a1
addu
t1,a0
# ending address + 1
move
t0,a0
sb
sb
sb
sb
sb
sb
sb
addu
sb
bltu

zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)
zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t1,1b

/*
* flush data cache
*/

1:

1:

.set
nop
li
mtc0
nop
.set
bltu
move
addu
move
sb
sb
sb
sb

noreorder
v0,SR_ISC
v0,C0_SR
reorder
t2,a1,1f
t2,a1
t2,a0
t0,a0
zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)

5–7

# isolate and swap back caches

# cache is smaller than region
# ending address + 1

CHAPTER 5

CACHE MANAGEMENT
sb
sb
sb
addu
sb
bltu

zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t2,1b

.set
noreorder
mtc0
t3,C0_SR
# un-isolate, enable interrupts
.set
reorder
j
ra
ENDFRAME(clear_cache)

Testing and probing
During test, debug or when profiling, it may be useful to build up a
picture of the cache contents. Software cannot read the tag value directly,
but, for a valid line, can determine the tag value by exhaustive search:
• isolate the cache;
• load from the cache line at each possible line start address (low order
bits fixed, high order bits ranging over physical memory which exists
in the system). After each load consult the CM bit in SR, which will be
‘‘0’’ only when the tag value matches.
This takes a long time by computer terms; but to fully search a 1K Dcache with 4Mbytes of cacheable physical memory on a 20Mhz processor
will take only a couple of seconds, and will provide very valuable debugging
information. IDT/sim provides this capability.

Configuration (R3041/71/81 only)
The R3041, R3071, and R3081 processors allow the programmer to
make choices about the cache by setting fields in the Config register:
• Cache refill burst size (R3041/71/81) : by default the R3041 refills
only 1 word in the D-cache on a cache miss; but software can program
it to use 4-word burst reads instead, by setting the Config DBR bit.
The bit can be changed at any time, without needing to invalidate the
cache.
The refill of R3071 and R3081 processors can be configured by
hardware at reset-time, but software can override that choice.
This support is provided in the hope of enhancing performance. The
proper selection for a given system will depend on both the hardware
and the application. Some systems may find an advantage in
“toggling” the bit for various portions of the software. In general, the
proper burst size selection can be determined as follows:
Burst reads make most sense when the memory is capable of
returning a burst of data significantly faster than it can return 4
individual words. Many DRAM systems are like this; most ROM and
static RAM memories are not. Similarly, data accessed from narrow
memory ports should rarely be configured for a multi-word burst.
If programs tend to access memory sequentially (working up or down
a large array, for example) then the burst refill will offer a very useful
degree of data prefetch, and performance will be enhanced. If cache
access is more random, the burst refill may actually reduce
performance (since it involves overwriting cached data with memory
data the program may never use).
As a general rule, the bigger the D-cache, the smaller the penalty for
burst refills.
• Bigger I-cache in exchange for smaller D-cache (R3071/81) : the R3081
cache can be organized either with both I-cache and D-cache 8Kbytes
in size, or with a 16Kbyte I-cache and 4Kbyte D-cache. The
configuration is programmed using the AC bit in the Config register.

5–8

CACHE MANAGEMENT

CHAPTER 5

After changing the cache configuration both caches should be reinitialized, while running uncached. This means that most systems
will not dynamically reconfigure the caches.
Which configuration is best for a given system is mainly dependent on
the software. Cache effects are extremely hard to predict, and it is
recommended that both configurations be tried and measured, while
running as much of the real system as possible.
As a general rule: with large applications (like in a big OS) the big Icache will probably be best. If the system spends most of its time
manipulating lots of data from tight program loops, the big D-cache
may be better.

WRITE BUFFER
The write-through cache common to all R30xx family CPUs can be a big
performance bottleneck. In the average C program only about 10% of
instructions are stores, but these accesses tend to come in bursts; for
example, when a function prologue saves a few registers.
DRAM memory frequently has the characteristic that the first write of a
group takes quite a long time (5-10 clocks typical on these CPUs), and
subsequent ones are relatively fast so long as they follow quickly.
If the CPU simply waits for all writes to complete, the performance hit
will be significant. So the R30xx provides a write buffer, a FIFO store which
keeps a number of entries each containing both data to be written, and the
address at which to write it. The 4-entry queue provided by R30xx family
CPUs is efficient for well-tuned DRAM.
In general, the operation of the write buffer is completely transparent to
software. Occasionally, the programmer needs to be aware of what is
happening:
• Timing relations for IO register accesses : When software performs a
store to write an IO register, the store reaches memory after a small,
but indeterminate, delay. Some consequences are:
— other communication with the IO system (e.g. interrupts) may
happen more quickly – for example, the CPU may get an interrupt
from a device ‘‘after’’ it has been programmed to generate no
interrupts.
— if the IO device needs some time to recover after a write the program
must ensure that the write buffer FIFO is empty before counting
out that time period.
— at the end of interrupt service, when writing to an IO device to clear
the interrupt it is asserting, software must insure that the
command is actually written to the device, and that it has had to
respond, before re-enabling that interrupt; otherwise, spurious
interrupts may be signalled.
In these cases the programmer must ensure that the CPU waits while
the write buffer empties. It is good practice to define a subroutine
which does this job; it is traditionally called wbflush(). Hints on
implementing this function are provided later in this chapter.
On CPUs outside the R30xx family, even stranger things can happen:
• Reads overtaking writes : a load instruction (uncached or missing in
the cache) executed while the write buffer FIFO is not empty gives the
CPU a choice: should it finish off the write, or use the memory
interface to fetch data for the load?
The R3041, R3051, R3052 and R3081 all have the same rule, which
avoids potential problems: the write buffer is emptied before the load
occurs.
Although it seems tempting to instead implement a scheme which
checks for conflicts, and allows the read to progress if no write buffer
entry matches the read target address, such a scheme does not avoid
the possible system problems. Specifically, writes to locations which

5–9

CHAPTER 5

CACHE MANAGEMENT

may have side effects (e.g. semaphores, IO registers, etc.), are not
detected under such a scheme, and can cause great headaches to the
programmer.
• Byte gathering : some write buffers watch for partial-word writes
within the same memory word, and will combine those partial writes
into a single operation. This is not done by any current R30xx family
CPU, because such operation would pose problems with IO register
writes.

Implementing wbflush()
IDT R30xx family CPUs enforce strict write priority (all pending writes
retired to memory before main memory is read). Thus, implementing
wbflush() is as simple as implementing an uncached load (e.g. from the
boot PROM vector). This will stall the CPU until the writes have finished,
and the load finished too. Alternately, the overhead can be minimized by
performing an uncached load from the fastest memory available in the
system.
The code fragment below shows an implementation of WbFlush, taken
from IDT/sim:
/*
** wbflush() flush the write buffer - this is specific for each
hardware
**
configuration.
*/
FRAME(wbflush,sp,0,ra)
.set noreorder
lw
t0,wbflush#read an uncached memory location
j
ra
nop
.set reorder
ENDFRAME(wbflush)

5–10

®

MEMORY MANAGEMENT AND
THE TLB

CHAPTER 6

Integrated Device Technology, Inc.

1

MEMORY MANAGEMENT AND THE TLB
Some R30xx family processors (“E” versions) have on-chip memory
management hardware. This provides a mechanism for dynamically
translating program addresses in the kuseg and kseg2 regions. The key
piece of hardware is the ‘‘TLB†’’.
The memory management is paged: with a fixed page size of 4Kbytes.
The low-order 12 bit of the program address are used directly as the low
order bits of the physical address, so address translation operates in 4K
chunks.
The TLB is a 64-entry associative memory. Each entry in an associative
memory consists of a key field and a data field; when presented with a key,
the memory returns the data of any entry where the key matches.
In the R30xx family, the TLB is referred to as ‘‘fully-associative’’; this
emphasizes that all keys are really compared with the input value in
parallel.
The TLB’s key field contains two sections:
• Virtual page number : (VPN) this is just a program address with the low
12 bits cut off, since the low-order bits don’t participate in the
translation process.
• Address Space Identifier. (ASID): this is a magic number used to
stamp translations, and (optionally) is compared with an extended
part of the key. Why?
In multi-tasking systems it is common to have all user-level tasks
executing at the same sort of program addresses (though of course
they are using different physical addresses); they are said to be using
different address spaces. So translation records for different tasks
will often share the same value of ‘‘VPN’’. If the TLB mechanism was
not supported with an ASID, when the OS switches from one task to
another, it would have to find and invalidate all TLB translations
relating to the old task’s address space, to prevent them from being
erroneously used for the new one. This would be desperately
inefficient.
Instead, the OS assigns a 6-bit unique code to each task’s distinct
address space. During normal running this code is kept in the ASID
field of the EntryHi register, and is used together with the program
address to form the lookup key; so a translation with an ASID code
which doesn’t match is quietly ignored.
Since the ASID is only 6 bits long, OS software does have to lend a
hand if there are ever more than 64 address spaces in concurrent use;
but it probably won’t happen too often. In such a system, new tasks
are assigned new ASIDs until all 64 are assigned; at that time, all
tasks are flushed of their ASIDs “de-assigned” and the TLB flushed;
as each task is re-entered, a new ASID is given. Thus, ASID flushing
is relatively infrequent.
The TLB data field includes:
• Physical frame number (PFN) : the physical address with the low 12
bits cut off. In an address translation, the VPN bits are replaced by
the corresponding PFN bits to form the true physical address.
• Cache control bit (N) : set 1 to make the page uncacheable.

† This is an acronym for ‘‘translation lookaside buffer’’, which is a
look-up table of virtual to physical address translations.
6–1

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

• Write control bit (D) : set 1 to allow stores to this page to happen. The
‘‘D’’ comes from this being called the ‘‘dirty bit’’; a later section on
“Simulating dirty bits” describes a typical use for these bits.
• Valid bit (V) : set 0 to make this entry usable. This seems pretty
pointless; why have a record loaded into the TLB if the translation is
not usable? But an access to an invalid page produces a different trap
from a TLB refill exception, so making a page invalid means that some
strange conditions can be made to take a different trap, which does
not have to be handled by the superfast refill code.
• Global bit (G) : set to disable the ASID-matching scheme, allowing an
OS to map some program addresses to the same physical address for
all tasks; it can be useful to have some corner of each address space
mapped to the same physical locations. Sharp-eyed or experienced
readers will notice that this means that the global bit is really more
like part of the key than part of the data; the distinction tends to get
blurred in associative memories.
Translating an address is now simple, and goes like this:
• CPU generates a program address : either for an instruction fetch, a
load or a store, in one of the translated address regions. The low 12
bits are separated off, and the resulting VPN together with the current
value of the ASID field in EntryHi used as the key to the TLB.
• TLB matches key : selecting the matching entry. The PFN is glued to
the low-order bits of the program address to form a complete physical
address.
• Valid? : the V and D bits are consulted. If it isn’t valid, or a store is
being attempted with D cleared, the CPU takes a trap. As with all
translation traps, the BadVaddr register will be filled with the
offending program address and TLB registers Context and EntryHi
pre-filled with relevant information. The system software can use
these registers to obtain data for exception service.
• Cached? : if the N bit is set the CPU looks in the cache for a copy of
the physical location’s data; if it isn’t there it will be fetched from
memory and a copy left in the cache. Where the C bit is clear the CPU
neither looks in nor refills the cache.
Of course, there are only 64 entries in the TLB, which can hold
translations for a maximum of 256 Kbytes of program addresses. This is
far short of enough for most systems. The TLB is almost always going to be
used as a software-maintained ‘‘cache’’ for a much larger set of
translations.
When a program address lookup in the TLB fails, a TLB refill trap is
taken. System software has the job of:
• figuring out whether there is a correct translation; if not the trap will
be dispatched to the software which handles address errors.
• if there is a correct translation, constructing a TLB entry which will
implement it;
• if the TLB is already full (and it almost always is full in running
systems), selecting an entry which can be discarded;
• writing the new entry into the TLB.

6–2

MEMORY MANAGEMENT AND THE TLB

CHAPTER 6

See below for how this can be tackled; but note here that although
special CPU features help out with one particular class of
implementations, the software can refill the TLB any way it likes.
Register
Mnemonic
EntryHi

Description

CP0
reg no

Together these registers hold a TLB entry. All reads and
writes to the TLB must be staged through them.
EntryHi also remembers the current ASID.

10

Index

Determines which TLB entry will be read/written by
appropriate instructions

0

Random

pseudo-random value (actually a free-running counter)
used by a tlbwr to write a new TLB entry into a ‘‘randomly’’
selected location.

1

Context

Convenience register provided to speed up the processing
of TLB refill traps. The high-order bits are read/write; the
low-order 21 bits reflect the BadVaddr value.
(The register is designed so that, if the system uses the
‘‘favored’’ arrangement of memory-held copies of memory
translation records, it will be setup by a TLB refill trap to
point to the memory location of the record needed to map
the offending address. This speeds up the process of
finding the current memory mapping, and arranging
EntryHi/Lo properly).

4

EntryLo

2

Table 6.1. CPU control registers for memory management

MMU registers described
EntryHi, EntryLo
31

12

VPN

11

6

5

ASID

0

0

EntryHi Register (TLB key fields)
Figure 6.1.

EntryHi and EntryLo register fields

31

12

PFN

11

10

9

8

7

N

D

V

G

0

0

EntryLo Register (TLB data fields)
Figure 6.2.

EntryHi and EntryLo register fields

These two registers represent a TLB entry, and are best considered as a
pair. Fields in EntryHi are:
• VPN : ‘‘virtual page number’’, the high-order bits of a program address.
On a refill exception this field is set up automatically to match the
program address which could not be translated. To write a different
TLB entry, or attempt a TLB probe, software must set it up
“manually”.
• ASID : ‘‘address space identifier’’, normally left holding the OS’ value
for the current address space. This is not changed by exceptions.
Most software systems will deliberately write this field only to setup
the current address space.
However, software must be careful when using tlbr to inspect TLB
entries; the operation overwrites the whole of EntryHi, so software
needs to restore the correct current ASID value afterwards.

6–3

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

Fields in EntryLo are:
• PFN : the high-order bits of the physical address to which values
matching EntryHi’s VPN will be translated.
• N : ‘‘noncacheable’’; 0 to make the access cacheable, 1 for
uncacheable.
• D : ‘‘dirty’’, but really a write-enable bit. 1 to allow writes, 0 and any
store using this translation will be trapped.
• V : ‘‘valid’’, if 0 any address matching this entry will cause an
exception.
• G : ‘‘global’’. When the G bit in a TLB entry is set, that TLB entry will
match solely on the VPN field, regardless of whether the TLB entry’s
ASID field matches the value in EntryHi.
• Fields called ‘‘0’’ : these fields always return zero; but unlike many
reserved fields, they do not need to be written as zero (nothing
happens regardless of the data written). This is important; it means
that the memory-resident data which is used to generate EntryLo
when refilling the TLB can contain some software-interpreted data in
these fields, which the TLB hardware will ignore without the need to
spend precious CPU cycles masking it.
Index
31

30

P

×

14

13

8

7
×

Index
Figure 6.3.

0

Fields in the Index register

The ‘‘P’’ field is set when a tlbp instruction (tlb probe, used to see if the
TLB can translate a particular VPN) failed to find a valid translation; since
it is the top bit it appears to make the 32-bit value negative, which is easy
to test for.
Random
31

14

×

13
Random

Figure 6.4.

8

7

0

×

Fields in the Random register

Most systems never have to read or write the Random register, shown as
Figure 6.4, “Fields in the Random register”, in normal use; but it may be
useful for diagnostics. The hardware initializes the Random field to its
maximum value (63) on reset, and it decrements every clock period until it
reaches 8, when it wraps back to 63 and starts again.
Context
31
PTEBase

21

20

2

Bad VPN
Figure 6.5.

1

0

0

Fields in the Context Register

• PTEBase : a location which just stores what is put in it. In the
‘‘standard’’ refill handler, this will be the high-order bits of the
(1Mbyte aligned) starting address of a memory-resident page table.
• Bad VPN : following an addressing exception this holds the high-order
bits of the address; exactly the same as the high-order bits of
BadVaddr. However, if the system uses the ‘‘standard’’ TLB refill

6–4

MEMORY MANAGEMENT AND THE TLB

CHAPTER 6

exception handling code the 32-bit value formed by Context is directly
usable as a pointer to the memory-resident page table, considerably
shortening the refill exception code.
• Fields marked 0 : can be written with any value, but they will always
read zero.

MMU control instructions
tlbr

–

Read TLB entry at index

tlbwi

–
Write TLB entry at index
The above two instructions move MMU data between the TLB entry
selected by the Index register and the EntryHi and EntryLo registers.
tlbwr
–
Write TLB entry selected by Random
copies the contents of EntryHi & EntryLo into the TLB entry indexed
by the random register. This saves time when using the
recommended random replacement policy. In practice, tlbwr will be
used to write a new TLB entry in a TLB refill exception handler; tlbwi
will be used anywhere else.
tlbp
–
TLB lookup
searches (probes) the TLB for an entry whose virtual page number
and ASID matches those currently in EntryHi, and stores the index
of that entry in the index register (index is set to a negative value if
nothing matches). If more than one entry matches, anything might
happen. Note that tlbp does not fetch data from the TLB. The
instruction following a tlbp must not be a load or store.

Programming interface to the TLB
TLB entries are set up by writing the required fields into EntryHi and
EntryLo and using a tlbwr or tlbwi instruction to copy that entry into the
TLB proper.
When handling a TLB refill exception, EntryHi has been set up
automatically, with the current ASID and the required VPN.
Be very careful not to create two entries which will match the same
program address/ASID pair. If the TLB contains duplicate entries an
attempt to translate such an address, or probe for it, produces a fatal ‘‘TLB
shutdown’’ condition (indicated by the TS bit in SR being set). It can be
cleared only by a hardware reset.
System software often won’t need to read TLB entries at all. But if
necessary, software can find the TLB entry matching some particular
program address using tlbp to setup the Index register. Don’t forget to save
EntryHi and restore it afterwards because its ASID field is likely to be
important.
Use a tlbr to read the TLB entry into EntryHi and EntryLo.
How refill happens
When a program makes an access in kuseg or kseg2 to a page for which
no translation record is present, the CPU takes a TLB refill exception. The
assumption is that system software is maintaining a large number of page
translations and is using the TLB as a cache of recently-used translations;
so the refill exception will normally be handled by finding a correct
translation, installing it, and returning to user code.
In ‘‘CISC’’ CPUs the TLB is a cache (usually implemented by microcode),
and the CPU automatically reads memory-resident ‘‘page tables’’ whose
structure is part of the CPU architecture.
In the MIPS architecture software is fast enough, and offers greater
flexibility.
To save time on user-program TLB refill exceptions (which will happen
frequently in a ‘‘big’’ OS):
• refill exceptions on kuseg program addresses are vectored through a
low-memory address used for no other exception;

6–5

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

• special exception rules permit the kuseg refill handler to risk a nested
TLB refill exception on a kseg2 address.
The problem is that before an exception routine can itself suffer an
exception it must first save the previous program state, represented
by the EPC return address and some SR bits. This is helped out by a
hardware feature and a software convention:
a)
the KUo, IEo bits in the status register act as a third level of the
processor-state stack, so that the CPU state already saved as a
result of the kuseg refill exception can be preserved during the
nested exception.
b)
The kuseg refill handler copies EPC into the k1 register; the
general exception code and kseg2 refill handler are then careful
to preserve its value, enabling a clean return.
Refill exceptions on kseg2 addresses are expected to be rare enough that
it will not matter if they share in the overhead of the ‘‘all other exceptions’’
entry point. However, once software determines the type of exception the
handling is similar.
Using ASIDs
By setting up TLB entries with a particular ASID setting and with the
EntryLo G bit zero, those entries will only ever match a program address
when the CPU’s ASID register is set the same. This allows software to map
up to 64 different address spaces simultaneously, without requiring that
the OS clear out the TLB on a context change.
In typical usage, new tasks are assigned an “un-initialized” ASID. The
first time the task is invoked, it will presumably miss in the TLB, allowing
the assignment of an ASID. If the system does run out of new ASIDs, it will
flush the TLB and mark all tasks as “new”. Thus, as each task is reentered, it will be assigned a new ASID. This sequence is expected to
happen infrequently if ever.
The Random register and wired entries
The hardware offers no way of finding out which TLB entries have been
used most recently. When the system needs to replace a mapping
dynamically (using the TLB as a cache) the only practicable strategy is to
replace an entry at random. The CPU makes this easy by maintaining the
Random register, which counts (down) with every processor cycle.
However, it is often useful to have some TLB entries which are
guaranteed to stay there unless explicitly removed. These may be useful to
map pages which are known to be required very often; they are critical
because they allow the system to map pages and guarantee that no refill
exception will be generated on them.
The stable TLB entries are described as ‘‘wired’’ and on R30xx family
CPUs consist of TLB entries 0 through 7. There is nothing special about
these entries; the magic is in the Random register, which never takes
values 0-7; it cycles directly from 63 down to 8 before reloading with 63.
So conventional random replacement leaves TLB entries 0 through 7
unaffected, and entries written there will stay until explicitly removed.

Memory translation – setup
The following code fragment initializes the TLB to ensure no match on
any kuseg or kseg2 address. This is important, and is preferable to
initializing with all “0”’s (which is a kuseg address, and which would cause
multiple matches if referenced):
LEAF(mips_init_tlb)
mfc0
t0,C0_ENTRYHI # save asid
mtc0
zero,C0_ENTRYLO# tlblo = !valid
li
a1,NTLBID<vaddr) >> VMPGSHIFT;
unsigned vpn = xcp->vaddr >> VMPGSHIFT;
unsigned asid = 0;
/* write a random tlb (entryhi, entrylo) pair */
/* mark it valid, global, uncached, and not writable/dirty */
r3k_tlbwr ((vpn <

Navigation menu