R3000 Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 354

Download
Open PDF In Browser	View PDF

Table of Contents

IDT R30xx Family
Software Reference Manual

Revision 1.0

1994 Integrated Device Technology, Inc.
Portions 1994 Algorithmics, Ltd.
Chapter 16 contains some material that is 1988 Prentice-Hall.
Appendices A & B contain material that is 1994 by Mips Technology, Inc.

i–1

Table of Contents

About IDT
Integrated Device Technology, Inc. has been a MIPS semiconductor
partner since 1988, and has led efforts to bring the high-performance
inherent in the MIPS architecture to embedded systems engineers. These
efforts include derivatives of MIPS R3xxx and R4xxx CPUs, development
tools, and applications support.
Additional information about IDT’s RISC family can be obtained from
your local sales representative. Alternately, IDT can be reached directly at:
Corporate Marketing

(800) 345-7015

RISC Applications "Hotline"

(408) 492-8208

RISC Applications FAX

(408) 492-8469

RISC Applications Internet

rischelp@idtinc.com

About Algorithmics
Much of this manual was written by Dominic Sweetman and Nigel
Stephens of Algorithmics Ltd in London, England, under contract to IDT.
Algorithmics were early enthusiasts for the MIPS architecture, designing
their first MIPS systems and system software in 1986/87. A small
engineering company, Algorithmics provide enabling technologies for
companies designing in both R30xx family CPUs and the 64-bit R4x00
architecture. This includes training, toolkits, GNU C support, and
evaluation boards. Dominic Sweetman can be reached at the following:.
Dominic Sweetman
Algorithmics Ltd
3 Drayton Park
London N5 1NU
ENGLAND.

phone: +44 71 700 3301
fax: +44 71 700 3400
email: dom@algor.co.uk

i–2

Table of Contents

About This Manual

This manual is targeted to a systems programmer building an R30xxbased system. It contains the architecture specific operations and
programming conventions relevant to such a programmer.
This manual is not intended to be a tutorial on structured programming,
real-time operating systems, any particular high-level programming
language, or any particular toolchain. Other references are better suited to
those topics.
This manual does contain specific code fragments and the most
common programming conventions that are specific to the IDT R30xx
RISController family. The manual was consciously limited to the R30xx
family; information relevant to the R4xxx family of processors may be
found, but the device specific programs (such as cache management,
exception handling, etc.) shown as examples are specific to the R30xx
family.
This manual contains references to the toolchains most commonly used
by the authors (IDT, Inc., and Algorithmics, Ltd.). Code fragments shown
are typically from software used by and/or provided by these companies,
includeing development tools such as IDT/c and software utilities (such as
IDT/kit, IDT/sim, and Micromonitor). A wide variety of other, 3rd party
products, are also available to support R30xx development, under the
Advantage-IDT program. The reader of this manual is encouraged to look
at all the available tools to determine which toolchains and utilities best fit
the system development requirements.
Additional information on the IDT family of RISC processors, and their
support tools, is available from your local IDT salesman.

i–3

Table of Contents

Integrated Device Technology, Inc. reserves the right to make changes to its products or specifications at
any time, without notice, in order to improve design or performance and to supply the best possible product.
IDT does not assume any responsibility for use of any circuitry described other than the circuitry embodied
in an IDT product. The Company makes no representations that circuitry described herein is free from patent
infringement or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent, patent rights or other rights, of Integrated Device Technology, Inc.

LIFE SUPPORT POLICY
Integrated Device Technology's products are not authorized for use as critical components in life
support devices or systems unless a specific written agreement pertaining to such intended use is
executed between the manufacturer and an officer of IDT.
1. Life support devices or systems are devices or systems which (a) are intended for surgical implant
into the body or (b) support or sustain life and whose failure to perform, when properly used in
accordance with instructions for use provided in the labeling, can be reasonably expected to result in
a significant injury to the user.
2. A critical component is any components of a life support device or system whose failure to
perform can be reasonably expected to cause the failure of the life support device or system, or to
affect its safety or effectiveness.
The IDT logo is a registered trademark and BiCameral, BurstRAM, BUSMUX, CacheRAM, DECnet,
Double-Density, FASTX, Four-Port, FLEXI-CACHE, Flexi-PAK, Flow-thruEDC, IDT/c, IDTenvY, IDT/sae,
IDT/sim, IDT/ux, MacStation, MICROSLICE, Orion, PalatteDAC, REAL8, R3041, R3051, R3052, R3081,
R3721, R4600, RISCompiler, RISController, RISCore, RISC Subsystem, RISC Windows, SARAM, SmartLogic,
SyncFIFO, SyncBiFIFO, SPC, TargetSystem and WideBus are trademarks of Integrated Device Technology,
Inc.
MIPS is a registered trademark of MIPS Computer Systems, Inc
All others are trademarks of their respective companies..

i–4

Table of Contents

IDT R30xx Family
Software Reference Manual
Table of Contents
Introduction........................................................................................................................1
What is a RISC?......................................................................................................... 1-1
PIPELINES ................................................................................................................ 1-2
The IDT R3xxx Family CPUs ................................................................................... 1-3
MIPS Architecture Levels.......................................................................................... 1-4
MIPS-1 Compared with CISC Archtectures.............................................................. 1-4
Unusual Instruction Encoding Features ............................................................... 1-5
Addressing and Memory Accesses ...................................................................... 1-5
Operations not Directly Supported ...................................................................... 1-6
Multiply and Divide Operations ................................................................................ 1-7
Programmer-visible Pipeline Effects ......................................................................... 1-7
A Note on Machine and Assembler Language .......................................................... 1-8
MIPs-1 (R30xx) Architecture............................................................................................2
Programmer’s View of the Processor Archtecture..................................................... 2-1
Registers..................................................................................................................... 2-1
Conventional Names and Uses of General-Purpose Registers .................................. 2-2
Notes on Conventional Register Names ............................................................. 2-2
Integer Multiply Unit and Registers .......................................................................... 2-3
Instruction Types ....................................................................................................... 2-4
Loading and Storing: Addressing Modes .................................................................. 2-5
Data types in Memory and Registers ......................................................................... 2-6
Integer Data Types .............................................................................................. 2-6
Unaligned Loads and Stores ............................................................................... 2-6
Floating Point Data in Memory .......................................................................... 2-7
Basic Address Space .................................................................................................. 2-8
Summary of System Addressing................................................................................ 2-9
Kernel vs. User Mode .......................................................................................... 2-9
Memory map for CPUs without MMU Hardware............................................. 2-10
Subsegments in the R3041 – Memory Width Configuration ...................... 2-10
System Control Coprocessor Architecture......................................................................3
CPU Control Summary .............................................................................................. 3-1
CPU Control and ‘‘CO-PROCESSOR 0’’................................................................. 3-2
CPU Control Instructions..................................................................................... 3-2
Standard CPU control registers............................................................................ 3-3
PRId Register ................................................................................................ 3-4
SR Register .................................................................................................... 3-4
Cause Register ............................................................................................... 3-7
EPC Register ................................................................................................. 3-8
BadVaddr Register ........................................................................................ 3-8
R3041, R3071, and R3081 Specific Registers..................................................... 3-8
i–5

Table of Contents

Count and Compare Registers (R3041 only) .................................................3-8
Config Register (R3071 and R3081) .............................................................3-8
Config Register (R3041) ...............................................................................3-9
BusCtrl Register (R3041 only) ....................................................................3-10
PortSize Register (R3041 only) ...................................................................3-11
What registers are relevant when?......................................................................3-11
Exception Management.....................................................................................................4
Exceptions ..................................................................................................................4-1
Precise Exceptions................................................................................................4-1
When Exceptions Happen ....................................................................................4-2
Exception vectors .................................................................................................4-2
Exception Handling – Basics................................................................................4-3
Nesting Exceptions ...............................................................................................4-4
An Exception Routine ..........................................................................................4-4
Interrupts...................................................................................................................4-12
Conventions and Examples ................................................................................4-14
Cache Management ...........................................................................................................5
Caches and Cache Management .................................................................................5-1
Cache Isolation and Swapping .............................................................................5-3
Initializing and Sizing the Caches ........................................................................5-4
Invalidation...........................................................................................................5-6
Testing and Probing..............................................................................................5-8
Configuration (R3041/71/81 only) .......................................................................5-8
Write Buffer................................................................................................................5-9
Implementing wbflush()......................................................................................5-10
Memory Management and the TLB ................................................................................6
Memory Management and the TLB ...........................................................................6-1
MMU Registers Described ...................................................................................6-3
EntryHi, EntryLo ...........................................................................................6-3
Index ..............................................................................................................6-4
Random ..........................................................................................................6-4
Context ...........................................................................................................6-4
MMU Control Instructions ...................................................................................6-5
Programming Interface to the TLB.......................................................................6-5
How Refill Happens ......................................................................................6-5
Using ASIDs ..................................................................................................6-6
The Random Register and Wired Entries ......................................................6-6
Memory Translation – Setup ................................................................................6-6
TLB Exception Sample Code ...............................................................................6-7
Basic Exception Handler ...............................................................................6-7
Fast kuseg Refill from Page Table ................................................................6-7
Simulating Dirty Bits............................................................................................6-8
Use of TLB in Debugging ..........................................................................................6-8
TLB Management Utilities.........................................................................................6-9
Reset Initialization.............................................................................................................7
Starting Up..................................................................................................................7-1
Probing and Recognizing the CPU .......................................................................7-4
Bootstrap Sequences .............................................................................................7-5
Starting Up an Application ...................................................................................7-5
i–6

Table of Contents

Floating Point Coprocessor...............................................................................................8
The IEEE754 Standard and its Background .............................................................. 8-1
What is Floating Point?.............................................................................................. 8-2
IEEE exponent field and bias............................................................................... 8-3
IEEE mantissa and normalization........................................................................ 8-3
Strange values use reserved exponent values ...................................................... 8-3
MIPS FP Data formats ......................................................................................... 8-4
MIPS Implementation of IEEE754............................................................................ 8-5
Floating Point Registers............................................................................................. 8-6
Floating Point Eeceptions/Interrupts.......................................................................... 8-6
The Floating Point Control/Status Register ............................................................... 8-6
Floating Point Implementation/Revision Register..................................................... 8-8
Guide to FP Instructions ............................................................................................ 8-8
Load/Store............................................................................................................ 8-8
Move Between Registers ..................................................................................... 8-9
3-Operand Arithmetic Operations........................................................................ 8-9
Unary (sign-changing) Operations..................................................................... 8-10
Conversion Operations....................................................................................... 8-10
Conditional Branch and Test Instructions.......................................................... 8-10
Instruction Timing Requirements ............................................................................ 8-12
Instruction Timing for Speed ................................................................................... 8-12
Initialization and Enable On Demand...................................................................... 8-12
Floating Point Emulation ......................................................................................... 8-13
Assembler Language Programming.................................................................................9
Syntax Overview........................................................................................................ 9-1
Key Points to Note ............................................................................................... 9-1
Register-to-Register Instructions ............................................................................... 9-2
Immediate (Constant) Operands ................................................................................ 9-3
Multiply/Divide.......................................................................................................... 9-4
Load/Store Instructions.............................................................................................. 9-5
Unaligned Loads and Store.................................................................................. 9-5
Addressing Modes ..................................................................................................... 9-6
Gp-Relative Addressing....................................................................................... 9-6
Jumps, Subroutine Calls and Branches...................................................................... 9-8
Conditional Branches................................................................................................. 9-8
Co-processor Conditional Branches .................................................................... 9-9
Compare and Set ........................................................................................................ 9-9
Coprocessor Transfers ............................................................................................... 9-9
Coprocessor Hazards ......................................................................................... 9-10
Assembler Directives ............................................................................................... 9-10
Sections .............................................................................................................. 9-10
.text, .rdata, .data ......................................................................................... 9-10
.lit4, .lit8 ...................................................................................................... 9-10
Program Segments in Memory ................................................................... 9-11
.bss .............................................................................................................. 9-12
.sdata, .sbss .................................................................................................. 9-12
Stack and Heap ........................................................................................... 9-12
Special Symbols .......................................................................................... 9-12
Data Definition and Alignment.......................................................................... 9-12
i–7

Table of Contents

.byte, .half, .word ........................................................................................ 9-13
.float, .double .............................................................................................. 9-13
.ascii, .asciiz ................................................................................................ 9-13
.align ............................................................................................................ 9-13
.comm, .lcomm ........................................................................................... 9-13
.space ........................................................................................................... 9-14
Symbol Binding Attributes ................................................................................ 9-14
.globl ........................................................................................................... 9-14
.extern .......................................................................................................... 9-15
.weakext ...................................................................................................... 9-15
Function Directives............................................................................................ 9-15
.ent, .end ...................................................................................................... 9-15
.aent ............................................................................................................. 9-16
.frame, .mask, .fmask .................................................................................. 9-16
Assembler Control (.set) .................................................................................... 9-17
.set noreorder/reorder .................................................................................. 9-17
.set volatile/novolatile ................................................................................. 9-17
.set noat/at ................................................................................................... 9-18
.set nomacro/macro ..................................................................................... 9-18
.set nobopt/bopt ........................................................................................... 9-18
The Complete Guide to Assembler Instructions...................................................... 9-18
Alphabetic List of Assembler Instructions .............................................................. 9-30
C Programming................................................................................................................10
The Stack, Subroutine Linkage, Parameter Passing ................................................ 10-1
Stack Argument Structure.................................................................................. 10-1
Which Arguments go in What Registers ........................................................... 10-1
Examples from the C Library ............................................................................ 10-2
Exotic Example; Passing Structures .................................................................. 10-2
How Printf() and Varargs Work ........................................................................ 10-3
Returning Value from a Function ...................................................................... 10-4
Macros for Prologues and Epilogues ................................................................. 10-4
Stack-Frame Allocation ..................................................................................... 10-4
Leaf Functions ............................................................................................ 10-4
Non-Leaf Functions .................................................................................... 10-5
Functions Needing Run-Time Computed Stack Locations ........................ 10-7
Shared and Non-Shared Libraries............................................................................ 10-9
Sharing Code in Single-Address Space Systems ............................................... 10-9
Sharing Code Across Address Spaces ............................................................. 10-10
An Introduction to Optimization............................................................................ 10-11
Common Optimizations ................................................................................... 10-11
How to Prevent Unwanted Effects From Optimization................................... 10-14
Optimizer-Unfriendly Code and How to Avoid It........................................... 10-15
Portability Considerations ..............................................................................................11
Writing Portable C ................................................................................................... 11-1
C Language Standards ...................................................................................... 11-1
C Library Functions and POSIX ....................................................................... 11-2
Data Representations and Alignment....................................................................... 11-3
Notes on Structure Layout and Padding ............................................................ 11-3
Isolating System Dependencies ............................................................................... 11-5
i–8

Table of Contents

Locating System Dependencies ......................................................................... 11-5
Fixing Up Dependencies.................................................................................... 11-5
Isolating Non-Portable Code ....................................................................... 11-6
Using Assembler................................................................................................ 11-6
Endianness ............................................................................................................... 11-7
What It Means to the Programmer..................................................................... 11-8
Bitfield Layout and Endianness .................................................................. 11-9
Changing the Endianness of a MIPS CPU....................................................... 11-10
Designing and Specifying for Configurable Endianness ................................. 11-10
Read-Only Instruction Memory ................................................................ 11-10
Writable (Volatile) Memory ..................................................................... 11-11
Byte-Lane Swapping ................................................................................. 11-11
Configurable IO Controllers ..................................................................... 11-12
Portability and Endianness-Independent Code ................................................ 11-13
Endianness-Independent Code .................................................................. 11-13
Compatibility Within the R30XX Family.............................................................. 11-13
Porting to MIPS: Frequently Encountered Issues.................................................. 11-15
Considerations for Portability to Future Devices................................................... 11-16
Writing Power-On Diagnostics.......................................................................................12
Golden Rules for Diagnostics Programming ........................................................... 12-1
What Should Tests Do? ........................................................................................... 12-2
How to Test the Diagnostic Tests? .......................................................................... 12-3
Overview of Algorithmics’ Power-On Selftest........................................................ 12-3
Starting Points.................................................................................................... 12-3
Control and Environment Variables .................................................................. 12-4
Reporting............................................................................................................ 12-4
Unexpected Exceptions During Test Sequence ................................................. 12-5
Driving Test Output Devices ............................................................................. 12-5
Restarting the System ........................................................................................ 12-5
Standard Test Sequence ..................................................................................... 12-5
Notes on the Test Sequence ............................................................................... 12-6
Annotated Examples from the Test Code .......................................................... 12-9
Instruction Timing and Optimization............................................................................13
Notes and Examples........................................................................................... 13-1
Additional Hazards .................................................................................................. 13-2
Early Modification of HI and LO ...................................................................... 13-2
Bitfields in CPU Control Registers.................................................................... 13-3
Non-Obvious Hazards........................................................................................ 13-3
Software Tools for Board Bring-Up...............................................................................14
Tools Used in Debug ............................................................................................... 14-1
Initial Debugging ..................................................................................................... 14-2
Porting Micromonitor .............................................................................................. 14-2
Running Micromonitor ............................................................................................ 14-2
Initial IDT/SIM Activity .......................................................................................... 14-2
A Final Note on IDT/KIT ........................................................................................ 14-3
Software Design Examples ..............................................................................................15
Application Software ............................................................................................... 15-1
Memory Map ..................................................................................................... 15-1
Starting Up ......................................................................................................... 15-1
i–9

Table of Contents

C Library Functions ........................................................................................... 15-2
Input and Output ......................................................................................... 15-3
Character Class Tests .................................................................................. 15-3
String Functions .......................................................................................... 15-3
Mathematical Functions .............................................................................. 15-3
Utility Functions ......................................................................................... 15-3
Diagnostics .................................................................................................. 15-4
Variable Argument Lists ............................................................................. 15-4
Non-Local Jumps ........................................................................................ 15-4
Signals ......................................................................................................... 15-4
Date and Time ............................................................................................. 15-4
Running the Program ......................................................................................... 15-4
Debugging the Program ..................................................................................... 15-5
Embedded System Software .................................................................................... 15-5
Memory Map ..................................................................................................... 15-6
Starting Up ......................................................................................................... 15-6
Embedded System Library Functions................................................................ 15-7
Trap and Interrupt Handling ....................................................................... 15-8
Simple Interrupt Routines ........................................................................... 15-8
Floating-Point Traps and Interrupts ............................................................ 15-9
Emulating Floating Point Instructions ...................................................... 15-10
Debugging........................................................................................................ 15-10
Unix-Like System S/W .......................................................................................... 15-11
Terminology..................................................................................................... 15-11
Components of a Process ................................................................................. 15-12
System Calls and Protection ............................................................................ 15-13
What the Kernel Does...................................................................................... 15-13
Virtual Memory Implementation for MIPS ..................................................... 15-14
Interrupt Handling for MIPS............................................................................ 15-15
How it Works ............................................................................................ 15-16
Assembly Language Programming Tips........................................................................16
32-bit Address or Constant Values .................................................................... 16-1
Use of “Set” Instructions ................................................................................... 16-1
Use of “Set” with Complex Branch Operations ......................................... 16-2
Carry, Borrow, Overflow, and Multi-Precision Math ................................. 16-2
Machine Instructions Reference (Appendix A)..............................................................A
CPU Instruction Overview.................................................................................. A-1
Instruction Classes .............................................................................................. A-1
Instruction Formats ............................................................................................. A-2
Instruction Notation Conventions ....................................................................... A-2
Instruction Notation Examples ..................................................................... A-3
Load and Store Instructions ................................................................................ A-4
Jump and Branch Instructions............................................................................. A-5
Coprocessor Instructions..................................................................................... A-5
System Control Coprocessor (CP0) Instructions ................................................ A-6
Instruct Set Details.............................................................................................. A-6
Instruction Summary......................................................................................... A-79
FPA Instruction Reference (Appendix B).......................................................................B
FPU Instruction Set Details .................................................................................B-1
i–10

Table of Contents

FPU Instructions ...........................................................................................B-1
Floating-Point Data Transfer ........................................................................B-1
Floating-Point Conversions ..........................................................................B-1
Floating-Point Arithmetic .............................................................................B-2
Floating-Point Register-to-Register Move ....................................................B-2
Floating-Point Branch ...................................................................................B-2
FP Computational Instructions and Valid Operands ...........................................B-2
FP Compare and Condition values ......................................................................B-3
FPU Register Specifiers.......................................................................................B-3
32-bit CP1 registers..............................................................................................B-4
FPU Register Access for 32-bit CP1 Registers..............................................B-5
Instruction Notation Conventions ..................................................................B-5
Load and Store Memory ......................................................................................B-6
Instruction Descriptions .......................................................................................B-6
FPA Instruction Set Summary ...........................................................................B-27
CP0 Operation Reference (Appendix C) ........................................................................C
CP0 Operation Details .........................................................................................C-1
MMU Operations .................................................................................................C-1
Exception Operations...........................................................................................C-1
Dand Register Movement Operations............................................................C-1
Operation Descriptions ........................................................................................C-1
Assembler Language Syntax (Appendix D)....................................................................D
Object Code Formats (Appendix E)................................................................................E
Sections and Segments...............................................................................................E-1
ECOFF Object File Format (RISC/OS).....................................................................E-1
File Header...........................................................................................................E-2
Optional a.out Header ..........................................................................................E-2
Example Loader ...................................................................................................E-3
Further Reading ...................................................................................................E-4
ELF (MIPS ABI)........................................................................................................E-4
File Header...........................................................................................................E-4
Program Header ...................................................................................................E-5
Example Loader ...................................................................................................E-6
Further Reading ...................................................................................................E-7
Object Code Tools .....................................................................................................E-7
Glossary of Common "MIPS" Terms............................................................................. F
DRAWINGS
1.1
MIPS 5-Stage Pipeline..........................................................................................1.2
1.2
The Pipeline and Branch Delays.......................................................................... 1-7
1.3
The Pipeline and Load Delays ............................................................................. 1-8
3.1
PRId Register Fields ............................................................................................ 3-4
3.2
Fields in Status Register....................................................................................... 3-4
3.3
Fields in the Cause Register................................................................................. 3-7
3.4
Fields in the R3071/81 Config Register............................................................... 3-8
3.5
Fields in the R3041 Config (Cache Configuration)Register................................ 3-9
3.6
Fields in the R3041 Bus Control (BusCtrl) Register ......................................... 3-10
5.1
Direct Mapped Cache .......................................................................................... 5-1
6.1
EntryHi and EntryLo Register Fields .................................................................. 6-3
i–11

Table of Contents

6.2
6.3
6.4
6.5
8.1
8.2
9.1
10.1
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
15.1
A.1

EntryHi and EntryLo Register Fields .................................................................. 6-3
Fields in the Index Register ................................................................................. 6-4
Fields in the Random Register............................................................................. 6-4
Fields in the Context Register.............................................................................. 6-4
FPA Control/Status Register Fields ..................................................................... 8-6
FPA Implementation/Revision Register .............................................................. 8-8
Program Segments in Memory .......................................................................... 9-11
Stackframe for a Non-Leaf Function ................................................................. 10-5
Structure Layout and Padding in Memory......................................................... 11-3
Data Representation with #pragma Pack(1) ...................................................... 11-4
Data Representation with #pragma Pack(2) ...................................................... 11-5
Typical Big-Endians Picture .............................................................................. 11-8
Little Endians Picture......................................................................................... 11-8
Bitfields and Big-Endian.................................................................................... 11-9
Bitfields and Little-Endian............................................................................... 11-10
Garbled String Storage when Mixing Modes .................................................. 11-11
Byte-Lane Swapper.......................................................................................... 11-12
Memory Layout of a BSD Process .................................................................. 15-12
CPU Instruction Formats .................................................................................... A-2

TABLES
1.1
R30xx Family Members Compared..................................................................... 1-4
2.1
Conventional Names of Registers with Usage Mnemonics................................. 2-2
3.1
Summary of CPU Control Registers (Not MMU) ............................................... 3-3
3.2
ExcCode Values: Different kinds of Exceptions ................................................. 3-7
4.1
Reset and Exception Entry Points (Vectors) for R30xx Family .......................... 4-3
4.2
Interrupt Bitfields and Interrup Pins .................................................................. 4-13
6.1
CPU Control Registers for Memory Management .............................................. 6-3
8.1
Floating Point Data Formats ................................................................................ 8-4
8.2
Rounding Modes Encoded in FP Control/Status Register................................... 8-7
8.4
FP Move Instructions........................................................................................... 8-9
8.5
FPA 3-Operand Arithmetic................................................................................ 8-10
8.6
FPA Sign-Changing Operators .......................................................................... 8-10
8.7
FPA Data Conversion Operations...................................................................... 8-10
8.8
FP Test Instructions ........................................................................................... 8-11
9.1
Assembler Register and Identifier Conventions ................................................ 9-20
9.2
Assembler Instructions....................................................................................... 9-20
12.1 Test Sequence in Brief ....................................................................................... 12-5
16.1 32-bit Immediate Values.................................................................................... 16-1
16.2 Add-With-Carry................................................................................................. 16-2
16.3 Subtract-with-Borrow Operation ....................................................................... 16-3
A.1
CPU Instruction Operation Notations................................................................. A-3
A.2
Load and Store Common Function ..................................................................... A-4
A.3
Access Type Specifications for Load/Store........................................................ A-5
B.1
Format Field Decoding ........................................................................................B-2
B.2
Logical Negation of Predicates by Condition True/False....................................B-3
B.3
Valid FP Operand Specifiers with 32-bit Coprocessor 1 Registers.....................B-4
B.4
Load and Store Common Functions ....................................................................B-6
i–12

INTRODUCTION

CHAPTER 1

Integrated Device Technology, Inc.

IDT’s R30xx family of RISC microcontrollers family includes the R3051,
R3052, R3071, R3081 and R3041 processors. The different members of
the family offer different price/performance trade-offs, but are all basically
integrated versions of the MIPS R3000A CPU. The R3000A CPU is well
known for the high-performance Unix systems implemented around it; less
publicized but equally impressive is the performance it has brought to a
wide variety of embedded applications.
IDT’s RISController family also includes devices built around MIPS
R4000 64-bit microprocessor technology. These devices, such as the IDT
R4600 Orion microprocessor, offer even higher levels of performance than
the R3000A derivative family. However, these devices also feature slightly
different OS models, and allow 64-bit kernels and applications. Thus, they
are sufficiently different from the R30xx family that this manual is focused
exclusively on the R30xx family.
This manual is aimed at the programmer dealing with the IDT R30xx
family components. Although most programming occurs using a high-level
language (usually “C”), and with little awareness of the underlying system
or processor architecture, certain operations require the programmer to
use assembly programming, and/or be aware of the underlying system or
processor structure. This manual is designed to be consulted when
addressing these types of issues.

WHAT IS A RISC?
The MIPS CPU is one of the “RISC’’ CPUs, born out of a particularly
fertile period of academic research and development. RISC CPUs
(‘‘Reduced Instruction Set Computer’’) share a number of architectural
attributes to facilitate the implementation of high-performance processors.
Most new architectures (as opposed to implementations) since 1986 owe
their remarkable performance to features developed a few years earlier by
a couple of seminal research projects. Someone commented that ‘‘a RISC
is any computer architecture defined after 1984’’; although meant as a jibe
at the industry’s use of the acronym, the comment’s truth also derives
from the widespread acceptance of the conclusions of that research.
One of these was the ‘‘MIPS’’ project at Stanford University. The project
name MIPS puns the familiar ‘‘millions of instructions per second’’ by
taking its name from the key phrase ‘‘Microcomputer without Interlocked
Pipeline Stages’’. The Stanford group’s work showed that pipelining, a wellknown technique for speeding up computers, had been under-exploited by
earlier architectures.

1–1

CHAPTER 1

INTRODUCTION

PIPELINES

Instruction sequence

instr 1

instr 2

I-cache

ALU

D-cache

ALU

MEM

ALU

instr 3

MEM

ALU

MEM

Time
Figure 1.1.

MIPS 5-stage pipeline

Pipelined processors operate by breaking instruction execution into
multiple small independent “stages”; since the stages are independent,
multiple instructions can be in varying states of completion at any one
time. Also, this organization tends to facilitate higher frequencies of
operation, since very complex activities can be broken down into “bitesized” chunks. The result is that multiple instructions are executing at any
one time, and that instructions are initiated (and completed) at very high
frequency. MIPS has consistently been among the most aggressive in the
utilization of these techniques.
Pipelining depends for its success on another technique; using caches
to reduce the amount of time spent waiting for memory. The MIPS R3000A
architecture uses separate instruction and data caches, so it can fetch an
instruction and read or write a memory variable in the same clock phase.
By mating high-frequency operation to high memory-bandwidth, very
high-performance is achieved.
In CISC architectures, caches are often seen as part of memory. A RISC
architecture makes more sense if the dual caches are regarded as very
much part of the CPU; in fact, the pipelines of virtually all RISC processors
require caches to maintain execution. The CPU normally runs from cache
and a cache miss (where data or instructions have to be fetched from
memory) is seen as an exceptional event.
For the R3000A and its derivatives, instruction execution is divided into
five phases (called pipestages), with each pipestage taking a fixed amount
of time (see “MIPS 5-stage pipeline” on page 1-2). Again, note that this
model assumes that instruction fetches and data accesses can be satisfied
from the processor caches at the processor operation frequency. All
instructions are rigidly defined to follow the same sequence of pipestages,
even where the instruction does nothing at some stage.
The net result is that, so long as it keeps hitting the cache, the CPU
starts an instruction every clock.
"Figure 1.1. MIPS 5-stage pipeline”, illustrates this operation.
Instruction execution activity can be described as occurring in the
individual pipestages:
• IF : (‘‘instruction fetch’’) gets the next instruction from the instruction
cache (I-cache).
• RD : (‘‘read registers’’) decodes the instruction and fetches the
contents of any CPU registers it uses.
• ALU : (‘‘arithmetic/logic unit’’) performs an arithmetic or logical
operation in one clock (floating point math and integer multiply/
divide can’t be done in one clock and are done differently; this is
described later).

1–2

INTRODUCTION

CHAPTER 1

• MEM : the stage where the instruction can read/write memory
variables in the data cache (D-cache). Note that for typical programs,
three out of four instructions do nothing in this stage; but allocating
the stage to each instruction ensures that the processor never has
two instructions wanting the data cache at the same time.
• WB : (‘‘write back’’) store the value obtained from an operation back to
the register file.
A rigid pipeline does limit the kinds of things instructions can do; in
particular:
• Instruction length : ALL instructions are 32 bits (exactly one machine
‘‘word’’) long, so that they can be fetched in a constant time. This itself
discourages complexity; there are not enough bits in the instruction
to encode really complicated addressing modes, for example.
• No arithmetic on memory variables : data from cache or memory is
obtained only in stage 4, which is much too late to be available to the
ALU. Memory accesses occur only as simple load or store instructions
which move the data to or from registers (this is described as a ‘‘load/
store architecture’’).
However, the MIPS project architects also attended to the best thinking
of the time about what makes a CPU an easy target for efficient optimizing
compilers. So MIPS CPUs have 32 general purpose registers, 3-operand
arithmetical/logical instructions and eschew complex special-purpose
instructions which compilers can’t usually generate.

THE IDT R3xxx FAMILY CPUS
MIPS Corporation was formed in 1984 to make a commercial version of
the Stanford MIPS CPU. The commercial CPU was enhanced with memory
management hardware, first appearing late in 1985 as the R2000. An
ambitious external floating point math co-processor (the R2010 FPA) first
shipped in mid-87. The R3000, shipped in 1988, is almost identical from
the programmer’s viewpoint (although small hardware enhancements
combined to give a substantial boost to performance). The R3000A was
done in 1989, to improve the frequency of operation over the original
R3000 (other minor enhancements were added, such as the ability for user
tasks to operate with the opposite “endianness” from the kernel).
The R2000/R3000 chips include a cache controller – the
implementation of external caches merely required a few industry
standard SRAMs and some address latches. The math co-processor shares
the cache buses to interpret instructions (in parallel with the integer CPU)
and transfer operands and results between the FPA and memory or the
integer CPU.
The division of function was ingenious, practical and workable, allowing
the R2000/3000 generation to be built without extravagant ultra-high pincount packages. However, as clock speeds increased the very high-speed
signals in the cache interface increased design complexity and limited
operational frequency. In addition, overall chip count for the basic
execution core proved to be a limitation for area and power sensitive
embedded systems.
The R3051, R3052, R3071, R3081 and R3041 are the members (so far)
of a family of products defined, designed, and manufactured by IDT. The
chips integrate the functions of the R3000A CPU, cache memory and
(R3081 only) math co-processor. This means that all the fastest logic is on
chip; so the integrated chips are not only cheaper and smaller than the
original implementation, but also much easier to use.
The parts differ in their cache sizes, whether they include onchip MMU
and/or FPA, clock rates and packaging options. In addition, although all
parts can be used pin-compatibly, certain products feature optional
enhancements in their bus-interface that may serve to reduce system cost
or complexity, and other subtle enhancements for cost or performance.
The major differences are summarized in "Table 1.1. R30xx family
members compared”.
1–3

CHAPTER 1

Part
3051
3051E
3052
3052E

INTRODUCTION

Cache
I+D
4K + 1K

8K + 2K

MMU
–
×
–
×

16K+4K/
8K+8K

–

3081E

16K+4K/
8K+8K

3071

16K+4K/
8K+8K

–

3071E

16K+4K/
8K+8K

3041

2K + 0.5K

–

3081

FPA

Clock
(MHz)

Package
Options

–

20-40

PLCC

32-bit MUX’ed A/D

–

20-40

PLCC

32-bit MUX’ed A/D

20-50

PLCC

Optional 1/2 frequency
bus operation
Optional 1x Clock Input

–

33-50

PLCC

1/2 frequency bus
operation
1x Clock Input

–

16-25

PLCC
TQFP

Variable port width
interface.

System Interface

Table 1.1. R30xx family members compared

MIPS ARCHITECTURE LEVELS
There are multiple generations of the MIPS architecture. The most
commonly discussed are the MIPS-1, MIPS-2, and MIPS-3 architectures.
MIPS-1 is the ISA found in the R2000 and R3000 generation CPUs. It is
a 32-bit ISA, and defines the basic instruction set. Any application written
with the MIPS-1 instruction set will operate correctly on all generations of
the architecture.
The MIPS-2 ISA is also 32-bit. It adds some instructions to speed up
floating point data movement, branch-likely instructions, and other minor
enhancements. This was first implemented in the MIPS R6000 ECL
microprocessor.
The MIPS-3 ISA is a 64-bit ISA. In addition to supporting all MIPS-1 and
MIPS-2 instructions, the MIPS-3 ISA contains 64-bit equivalents of certain
earlier instructions that are sensitive to operand size (e.g. load double and
load word are both supported), including doubleword (64-bit) data
movement and arithmetic. This ISA was first implemented in the R4000 as
a clean (“seamless”) transition from the existing 32-bit architecture.
Note that these ISA levels do not necessarily imply a particular structure
for the MMU, caches, exception model, or other kernel specific resources.
Thus, different implementations of ISA compatible chips may require
different kernels.
In the case of the R30xx family, all devices implement the MIPS-1 ISA.
Many devices are also kernel compatible with the R3000A, but some
devices (most notably those without an MMU) may require small kernel
changes or different boot modules†.

MIPS-1 COMPARED WITH CISC ARCHITECTURES
Although the MIPS architecture is fairly straight-forward, there are a few
features, visible only to assembly programmers, which may at first appear
surprising. In addition, operations familiar to CISC architectures are
† Historically, many embedded MIPS applications have run
exclusively out of the “kseg0 and kseg1” memory regions
(described later in the book). For these applications, the presence
or absence of the MMU is largely irrelevant.
1–4

INTRODUCTION

CHAPTER 1

irrelevant to the MIPS architecture. For example, the MIPS architecture
does not mandate a stack pointer or stack usage; thus, programmers may
be surprised to find that push/pop instructions do not exist directly.
The most notable of these features are summarized here.

Unusual instruction encoding features
• All instructions are 32-bits long : as mentioned above. This means, for
example, that it is impossible to incorporate a 32-bit constant into a
single instruction (there would be no instruction bits left to encode
the operation and the registers!). A ‘‘load immediate’’ instruction is
limited to a 16-bit value; a special ‘‘load upper immediate’’ must be
followed by an ‘‘or immediate’’ to put a 32-bit constant value into a
register.
• Instruction actions must fit the pipeline : actions can only be carried out
in the designated pipeline phase, and must be complete in one clock.
For example, the register writeback phase provides for just one value
to be stored in the register file, so instructions can only change one
register.
• 3-operand instructions : arithmetic/logical operations don’t have to
specify memory locations, so there are plenty of instruction bits to
define two independent source and one destination register.
Compilers love 3-operand instructions, which give optimizers more
scope to improve the code which handles complex expressions.
• 32 registers : the choice of 32 has become universal; compilers like a
large (but not necessarily too large) number of registers, but there is
a cost in context-saving and in encoding the registers to be used by
an instruction. Register $0 always returns zero, to give a compact
encoding of that useful constant.
• No condition codes : the MIPS architecture does not provide condition
code flags implicitly set by arithmetical operations. The motivation is
to make sure that execution state is stored in one place – the register
file. Conditional branches (in MIPS) test a single register for sign/zero,
or a pair of registers for equality.

Addressing and memory accesses
• Memory references are always register loads and stores : arithmetic on
memory variables upsets the pipeline, so is not done. Memory
references only occur due to explicit load or store instructions. The
large register file allows multiple variables to be “on-chip”
simultaneously.
• Only one data addressing mode : all loads and stores define the
memory location with a single base register value modified by a 16-bit
signed displacement. Note that the assembler/compiler tools can use
the $0 register, along with the immediate value, to synthesize
additional addressing modes from this one directly supported mode.
• Byte-addressed : the instruction set includes load/store operations
for 8- and 16-bit variables (referred to as byte and halfword). Partialword load instructions come in two flavors – sign-extend and zeroextend.
• Loads/stores must be address-aligned : memory word operations can
only load or store data from a single 4-byte aligned word; halfword
operations must be aligned on half-word addresses. Many CISC
microprocessors will load/store a multi-byte item from any byte
address (although unaligned transfers always take longer).
Techniques to generate code which will handle unaligned data
efficiently will be explained later.
• Jump instructions : The smallest op-code field in a MIPS instruction is
6 bits; leaving 26 bits to define the target of a jump. Since all
instructions are 4-byte aligned in memory the two least-significant
1–5

CHAPTER 1

INTRODUCTION

address bits need not be stored, allowing an address range of 228 =
256Mbytes. Rather than make this branch PC-relative, this is
interpreted as an absolute address within a 256Mbyte ‘‘segment’’. In
theory, this could impose a limit on the size of a single program; in
reality, it hasn’t been a problem.
Branches out of segment can be achieved by using a jr instruction,
which uses the contents of a register as the target.
Conditional branches have only a 16-bit displacement field (218 byte
range since instructions are 4-byte aligned) which is interpreted as a
signed PC-relative displacement. Compilers can only code a simple
conditional branch instruction if they know that the target will be
within 128Kbytes of the instruction following the branch.

Operations not directly supported
• No byte or halfword arithmetic : all arithmetical and logical operations
are performed on 32-bit quantities. Byte and/or halfword arithmetic
would require significant extra resources, many more op-codes, and
is an understandable omission. Most C programmers will use the int
data type for most arithmetic, and for MIPS an int is 32 bits and such
arithmetic will be efficient. C’s rules are to perform arithmetic in int
whenever any source or destination variable is as long as int.
However, where a program explicitly does arithmetic as short the
compiler must insert extra code to make sure that wraparound and
overflows have the appropriate effect.
• No special stack support : conventional MIPS assembler usage does
define a sp register, but the hardware treats sp just like any other
register. There is a recommended format for the stack frame layout of
subroutines, so that programs can mix modules from different
languages and compilers; it is recommended that programmers stick
to these conventions, but they have no relationship to the hardware.
• Minimal subroutine overhead : there is one special feature; jump
instructions have a ‘‘jump and link’’ option which stores the return
address into a register. $31 is the default, so for convenience and by
convention $31 becomes the ‘‘return address’’ register.
Minimal
interrupt overhead : The MIPS architecture makes very few
•
presumptions about system exception handling, allowing fast
response and a wide variety of software models. In the R30xx family,
the CPU stashes away the restart location in the special register EPC,
modifies the machine state just enough to signal why the trap
happened and to disallow further interrupts; then it jumps to a single
predefined location† in low memory. Everything else is up to the
software.
Just to emphasize this: on an interrupt or trap a MIPS CPU does not
store anything on a stack, or write memory, or preserve any registers
by itself.
By convention, two registers ($k0, $k1; register conventions are
explained in chapter 2) are reserved so that interrupt/trap routines
can ‘‘bootstrap’’ themselves – it is impossible to do anything on a MIPS
CPU without using some registers. For a program running in any
system which takes interrupts or traps, the values of these registers
may change at any time, and thus should not be used.

† One particular kind of trap (a TLB miss on an address in the
user-privilege address space) has a different dedicated entry point.
1–6

INTRODUCTION

CHAPTER 1

Multiply and divide operations
The MIPS CPU does have an integer multiply/divide unit; worth
mentioning because many RISC machines don’t have multiply hardware.
The multiply unit is relatively independent of the rest of the CPU, with its
own special output registers.

Programmer-visible pipeline effects
In addition to the discussion above, programmers of R3xxx architecture
CPUs also must be aware of certain effects of the MIPS pipeline.
Specifically, the results of certain operations may not be available in the
immediately subsequent instruction; the programmer may need to be
explicitly aware of such cases.

branch

branch
delay

branch
addr

branch
target

MEM

ALU

Figure 1.2.

MEM

ALU

MEM

The pipeline and branch delays

• Delayed branches : the pipeline structure of the MIPS CPU (see "Figure
1.2. The pipeline and branch delays”) means that when a jump
instruction reaches the ‘‘execute’’ phase and a new program counter
is generated, the instruction after the jump will already have been
decoded. Rather than discard this potentially useful work, the
architecture rules state that the instruction after a branch is always
executed before the instruction at the target of the branch.
"Figure 1.2. The pipeline and branch delays” show that a special path
is provided through the ALU to make the branch address available a
half-clock early, ensuring that there is only a one cycle delay before
the outcome of the branch is determined and the appropriate
instruction flow (branch taken or not taken) is initiated.
It is the responsibility of the compiler system or the assemblerprogrammer to allow for and even to exploit this “branch delay slot”;
it turns out that it is usually possible to arrange code such that the
instruction in the ‘‘delay slot’’ does useful work. Quite often, the
instruction which would otherwise have been placed before the
branch can be moved into the delay slot.
This can be a bit tricky on a conditional branch, where the branch
delay instruction must be (at least) harmless on the path where it isn’t
wanted. Where nothing useful can be done the delay slot is filled with
a ‘‘nop’’ (no-op, or no-operation) instruction.
Many MIPS assemblers will hide this feature from the programmer
unless explicitly told not to, as described later.
• Load data not available to next instruction : another consequence of
the pipeline is that a load instruction’s data arrives from the cache/
memory system AFTER the next instruction’s ALU phase starts – so it
is not possible to use the data from a load in the following instruction.
See "Figure 1.3. The pipeline and load delays” for how this works. On
the MIPS-1 architecture, the programmer must insure that this rule
is not violated

1–7

CHAPTER 1

INTRODUCTION

• .

load

load
delay

D-cache
MEM rd

ALU

use
data

ALU

Figure 1.3.

MEM

ALU

MEM

The pipeline and load delays

Again, most assemblers will hide this if they can. Frequently, the
assembler can move an instruction which is independent of the load
into the load delay slot; in the worst case, it can insert a NOP to insure
proper program execution.

A NOTE ON MACHINE AND ASSEMBLER LANGUAGE
To simplify assembly level programming, the MIPS Corp’s assembler
(and many other MIPS assemblers) provides a set of “synthetic”
instructions. Typically, a synthetic instruction is a common assembly level
operation that the assembler will map into one or more true instructions.
This mapping can be more intelligent than a mere macro expansion. For
example, an immediate load may map into one instruction if the datum is
small enough, or multiple instructions if the datum is larger. However,
these instructions can dramatically simplify assembly level programming.
For example, the programmer just writes a ‘‘load immediate’’ instruction
and the assembler will figure out whether it needs to generate multiple
machine instructions or can get by with just one (in this example,
depending on the size of the immediate datum).
This is obviously useful, but can be confusing. This manual will try to
use synthetic instructions sparingly, and indicate when it happens.
Moreover, the instruction tables below will consistently distinguish
between synthetic and machine instructions.
These features are there to help human programmers; most compilers
generate instructions which are one-for-one with machine code. However,
some compilers will in fact generate synthetic instructions.
Helpful things the assembler does:
• 32-bit load immediates : The programmer can code a load with any
value (including a memory location which will be computed at link
time), and the assembler will break it down into two instructions to
load the high and low half of the value.
• Load from memory location : The programmer can code a load from a
memory-resident variable. The assembler will normally replace this
by loading a temporary register with the high-order half of the
variable’s address, followed by a load whose displacement is the loworder half of the address.
Of course, this does not apply to variables defined inside C functions,
which are implemented either in registers or on the stack.
• Efficient access to memory variables : some C programs contain many
references to static or extern variables, and a two-instruction
sequence to load/store any of them is expensive. Some compilation
systems, with run-time support, get around this. Certain variables
are selected at compile/assemble time (by default MIPS Corp’s
assembler selects variables which occupy 8 or less bytes of storage)

1–8

INTRODUCTION

CHAPTER 1

and kept together in a single section of memory which must end up
smaller than 64Kbytes. The run-time system then initializes one
register ($28 or gp (global pointer) by convention) to point to the
middle of this section.
Loads and stores to these variables can now be coded as a single gp
relative load or store.
• More types of branch condition : the assembler synthesizes a full set of
branches conditional on an arithmetic test between two registers.
• Simple or different forms of instructions : unary operations such as not
and neg are produced as a nor or sub with the zero-valued register $0.
Two-operand forms of 3-operand instructions can be written; the
assembler will put the result back into the first-specified register.
• Hiding the branch delay slot: in normal coding most assemblers will
not allow access the branch delay slot. MIPS Corp.’s assembler, in
particular, is exceptionally ingenious and may re-organize the
instruction sequence substantially in search of something useful to
do in the delay slot. An assembler directive ‘‘.noreorder’’ is available
where this must not happen.
• Hiding the load delay: many assemblers will detect an attempt to use
the result of a load in the next instruction, and will either move code
around or insert a nop.
• Unaligned transfers: the ‘‘unaligned’’ load/store instructions will
fetch halfword and word quantities correctly, even if the target
address turns out to be unaligned.
• Other pipeline corrections: some instructions (such as those which use
the integer multiply unit) have additional constraints that are
implementation specific (see the Appendix on hazards). Many
assemblers will just “handle” these cases automatically, or at least
warn the programmer about possible hazards violations.
• Other optimizations: some MIPS instructions (particularly floating
point) take multiple clocks to produce results. However, the hardware
is ‘‘interlocked’’, so the programmer does not need to be aware of these
delays to write correct programs. But MIPS Corp.’s assembler is
particularly aggressive in these circumstances, and will perform
substantial code movement to try to make it run faster. This may need
to be considered when debugging.
In general, it is best to use a dis-assembler utility to disassemble a
resulting binary during debug. This will show the system designers the
true code sequence being executed, and thus “uncover” the modifications
made by the assembler or compiler.

1–9

MIPS-1 (R30xx)
ARCHITECTURE

CHAPTER 2

Integrated Device Technology, Inc.

1
PROGRAMMER’S VIEW OF THE PROCESSOR
ARCHITECTURE
This chapter describes the assembly programmer’s view of the CPU
architecture, in terms of registers, instructions, and computational
resources. This viewpoint corresponds, for example, to an assembly
programmer writing user applications (although more typically, such a
programmer would use a high-level language).
Information about kernel software development (such as handling
interrupts, traps, and cache and memory management) are described in
later chapters.

Registers
There are 32 general purpose registers: $0 to $31. Two, and only two,
are special to the hardware:
• $0 always returns zero, no matter what software attempts to store to
it.
• $31 is used by the normal subroutine-calling instruction (jal) for the
return address. Note that the call-by-register version (jalr) can use
ANY register for the return address, though practice is to use only
$31.
In all other respects all registers are identical and can be used in any
instruction ($0 can be used as the destination of instructions; the value of
$0 will remain unchanged, however, so the instruction would be effectively
a NOP).
In the MIPS architecture the ‘‘program counter’’ is not a register, and it
is probably better to not think of it that way. The return address of a jal is
two instructions later in sequence (the instruction after the jump delay slot
instruction); the instruction after the call is the call’s ‘‘delay slot’’ and is
typically used to set up the last parameter.
There are no condition codes and nothing in the ‘‘status register’’ or
other CPU internals is of any consequence to the user-level programmer.
There are two registers associated with the integer multiplier. These
registers, referred to as “HI” and “LO”, contain the 64-bit product result of
a multiply operation, or the quotient and remainder of a divide.
The floating point math co-processor (called FPA for floating point
accelerator), if available, adds 32 floating point registers†; in simple
assembler language they are just called $0 to $31 again – the fact that
these are floating point registers is implicitly defined by the instruction.
Actually, only the 16 even-numbered registers are usable for math; but
they can be used for either single-precision (32 bit) or double-precision
(64-bit) numbers, When performing double-precision arithmetic, odd
numbered register $N+1 holds the remaining bits of the even numbered
register identified $N. Only moves between integer and FPA, or FPA load/
store instructions, ever refer to odd-numbered registers (and even then the
assembler helps the programmer forget...)

† The FPA also has a different set of registers called ‘‘co-processor
1 registers’’ for control purposes. These are typically used to
manage the actions/state of the FPA, and should not be confused
with the FPA data registers.
2–1

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

Conventional names and uses of general-purpose registers
Although the hardware makes few rules about the use of registers, their
practical use is governed by a number of conventions. These conventions
allow inter-changeability of tools, operating systems, and library modules.
It is strongly recommended that these conventions be followed.
Reg No

Name

Used for

zero

Always returns 0

(assembler temporary) Reserved for use by assembler

2-3

v0-v1

Value (except FP) returned by subroutine

4-7

a0-a3

(arguments) First four parameters for a subroutine

8-15

t0-t7

(temporaries) subroutines may use without saving

24-25

t8-t9

16-23

s0-s7

Subroutine ‘‘register variables’’; a subroutine which will write
one of these must save the old value and restore it before it
exits, so the calling routine sees their values preserved.

26-27

k0-k1

Reserved for use by interrupt/trap handler - may change
under your feet

global pointer - some runtime systems maintain this to give
easy access to (some) ‘‘static’’ or ‘‘extern’’ variables.

stack pointer

s8/fp

9th register variable. Subroutines which need one can use
this as a ‘‘frame pointer’’.

Return address for subroutine

Table 2.1. Conventional names of registers with usage mnemonics

With the conventional uses of the registers go a set of conventional
names. Given the need to fit in with the conventions, use of the
conventional names is pretty much mandatory. The common names are
described in Table 2.1, “Conventional names of registers with usage
mnemonics”.
Notes on conventional register names
• at : this register is reserved for use inside the synthetic instructions
generated by the assembler. If the programmer must use it explicitly
the directive .noat stops the assembler from using it, but then there
are some things the assembler won’t be able to do.
• v0-v1 : used when returning non-floating-point values from a
subroutine. To return anything bigger than 2×32 bits, memory must
be used (described in a later chapter).
• a0-a3 : used to pass the first four non-FP parameters to a subroutine.
That’s an occasionally-false oversimplification; the actual convention
is fully described in a later chapter.
• t0-t9 : by convention, subroutines may use these values without
preserving them. This makes them easy to use as ‘‘temporaries’’ when
evaluating expressions – but a caller must remember that they may
be destroyed by a subroutine call.
• s0-s8 : by convention, subroutines must guarantee that the values of
these registers on exit are the same as they were on entry – either by
not using them, or by saving them on the stack and restoring before
exit.
This makes them eminently suitable for use as ‘‘register variables’’ or
for storing any value which must be preserved over a subroutine call.
2–2

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• k0-k1 : reserved for use by the trap/interrupt routines, which will not
restore their original value; so they are of little use to anyone else.
• gp : (global pointer). If present, it will point to a load-time-determined
location in the midst of your static data. This means that loads and
stores to data lying within 32Kbytes either side of the gp value can be
performed in a single instruction using gp as the base register.
Without the global pointer, loading data from a static memory area
takes two instructions: one to load the most significant bits of the 32bit constant address computed by the compiler and loader, and one
to do the data load.
To use gp a compiler must know at compile time that a datum will end
up linked within a 64Kbyte range of memory locations. In practice it
can’t know, only guess. The usual practice is to put ‘‘small’’ global
data items in the area pointed to by gp, and to get the linker to
complain if it still gets too big. The definition of what is “small” can
typically be specified with a compiler switch (most compilers use “G“). The most common default size is 8 bytes or less.
Not all compilation systems or OS loaders support gp.
• sp : (stack pointer). Since it takes explicit instructions to raise and
lower the stack pointer, it is generally done only on subroutine entry
and exit; and it is the responsibility of the subroutine being called to
do this. sp is normally adjusted, on entry, to the lowest point that the
stack will need to reach at any point in the subroutine. Now the
compiler can access stack variables by a constant offset from sp.
Stack usage conventions are explained in a later chapter.
• fp : (also known as s8). A subroutine will use a ‘‘frame pointer’’ to keep
track of the stack if it wants to use operations which involve extending
the stack by an amount which is determined at run-time. Some
languages may do this explicitly; assembler programmers are always
welcome to experiment; and (for many toolchains) C programs which
use the ‘‘alloca’’ library routine will find themselves doing so.
In this case it is not possible to access stack variables from sp, so fp
is initialized by the function prologue to a constant position relative
to the function’s stack frame. Note that a ‘‘frame pointer’’ subroutine
may call or be called by subroutines which do not use the frame
pointer; so long as the functions it calls preserve the value of fp (as
they should) this is OK.
• ra : (return address). On entry to any subroutine, ra holds the address
to which control should be returned – so a subroutine typically ends
with the instruction ‘‘jr ra’’.
Subroutines which themselves call subroutines must first save ra,
usually on the stack.

Integer multiply unit and registers
MIPS’ architects decided that integer multiplication was important
enough to deserve a hard-wired instruction. This is not so common in
RISCs, which might instead:
• implement a ‘‘multiply step’’ which fits in the standard integer
execution pipeline, and require software routines for every
multiplication (e.g. Sparc or AM29000); or
• perform integer multiplication in the floating point unit – a good
solution but which compromises the optional nature of the MIPS
floating point ‘‘co-processor’’.
The multiply unit consumes a small amount of die area, but
dramatically improves performance (and cache performance) over
“multiply step” operations. It’s basic operation is to multiply two 32-bit
values together to produce a 64-bit result, which is stored in two 32-bit

2–3

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

registers (called ‘‘hi’’ and ‘‘lo’’) which are private to the multiply unit.
Instructions mfhi, mflo are defined to copy the result out into general
registers.
Unlike results for integer operations, the multiply result registers are
interlocked. An attempt to read out the results before the multiplication is
complete results in the CPU being stopped until the operation completes.
The integer multiply unit will also perform an integer division between
values in two general-purpose registers; in this case the ‘‘lo’’ register stores
the quotient, and the ‘‘hi’’ register the remainder.
In the R30xx family, multiply operations take 12 clocks and division
takes 35. The assembler has a synthetic multiply operation which starts
the multiply and then retrieves the result into an ordinary register. Note
that MIPS Corp.’s assembler may even substitute a series of shifts and
adds for multiplication by a constant, to improve execution speed.
Multiply/divide results are written into ‘‘hi’’ and ‘‘lo’’ as soon as they are
available; the effect is not deferred until the writeback pipeline stage, as
with writes to general purpose (GP) registers. If a mfhi or mflo instruction
is interrupted by some kind of exception before it reaches the writeback
stage of the pipeline, it will be aborted with the intention of restarting it.
However, a subsequent multiply instruction which has passed the ALU
stage will continue (in parallel with exception processing) and would
overwrite the ‘‘hi’’ and ‘‘lo’’ register values, so that the re-execution of the
mfhi would get wrong (i.e. new) data. For this reason it is recommended
that a multiply should not be started within two instructions of an mfhi/
mflo. The assembler will avoid doing this where it can.
Integer multiply and divide operations never produce an exception,
though divide by zero produces an undefined result. Compilers will often
generate code to trap on errors, particularly on divide by zero. Frequently,
this instruction sequence is placed after the divide is initiated, to allow it
to execute concurrently with the divide (and avoid a performance loss).
Instructions mthi, mtlo are defined to setup the internal registers from
general-purpose registers. They are essential to restore the values of ‘‘hi’’
and ‘‘lo’’ when returning from an exception, but probably not for anything
else.

Instruction types
A full list of R30xx family integer instructions is presented in Appendix
A. Floating point instructions are listed in Appendix B of this manual.
Currently, floating point instructions are only available in the R3081, and
are described in the R3081 User’s Manual.
The MIPS-1 ISA uses only three basic instruction encoding formats; this
is one of the keys to the high-frequencies attained by RISC architectures.
Instructions are mostly in numerical order; to simplify reading, the list
is occasionally re-ordered for clarity.
Throughout this manual, the description of various instructions will
also refer to various subfields of the instruction. In general, the following
typical nomenclature is used:
op
The basic op-code, which is 6 bits long. Instructions which large
sub-fields (for example, large immediate values, such as required
for the ‘‘long’’ j/jal instructions, or arithmetic with a 16-bit
constant) have a unique ‘‘op’’ field. Other instructions are
classified in groups sharing an ‘‘op’’ value, distinguished by
other fields (‘‘op2’’ etc.).
rs, rs1,
rs2
One or two fields identifying source registers.
rd
The register to be changed by this instruction.
sa
Shift-amount: How far to shift, used in shift-by-constant
instructions.

2–4

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

op2

Sub-code field used for the 3-register arithmetic/logical group of
instructions (op value of zero).
offset 16-bit signed word offset defining the destination of a ‘‘PCrelative’’ branch. The branch target will be the instruction
‘‘offset’’ words away from the ‘‘delay slot’’ instruction after the
branch; so a branch-to-self has an offset of -1.
target 26-bit word address to be jumped to (it corresponds to a 28-bit
byte address, which is always word-aligned). The long j
instruction is rarely used, so this format is pretty much
exclusively for function calls (jal).
The high-order 4 bits of the target address can’t be specified by
this instruction, and are taken from the address of the jump
instruction. This means that these instructions can reach
anywhere in the 256Mbyte region around the instructions’
location. To jump further use a jr (jump register) instruction.
constant
16-bit integer constant for ‘‘immediate’’ arithmetic or logic
operations.
mf
Yet another extended opcode field, this time used by ‘‘coprocessor’’ type instructions.
rg
Field which may hold a source or destination register.
crg
Field to hold the number of a CPU control register (different from
the integer register file). Called ‘‘crs’’/‘‘crd’’ in contexts where it
must be a source/destination respectively.
The instruction encodings have been chosen to facilitate the design of a
high-frequency CPU. Specifically:.
• The instruction encodings do reveal portions of the internal CPU
design. Although there are variable encodings, those fields which are
required very early in the pipeline are encoded in a very regular way:
• Source registers are always in the same place : so that the CPU can
fetch two instructions from the integer register file without any
conditional decoding. Some instructions may not need both registers
– but since the register file is designed to provide two source values
on every clock nothing has been lost.
• 16-bit constant is always in the same place : permitting the
appropriate instruction bits to be fed directly into the ALU’s input
multiplexer, without conditional shifts.

Loading and storing: addressing modes
As mentioned above, there is only one basic ‘‘addressing mode’’. Any
load or store machine instruction can be written as:
operation dest-reg, offset(src-reg)
e.g.:lw $1, offset($2); sw $3, offset($4)

Any of the GP registers can be used for the destination and source. The
offset is a signed, 16-bit number (so can be anywhere between -32768 and
32767); the program address used for the load is the sum of dest-reg and
the offset. This address mode is normally enough to pick out a particular
member of a C structure (‘‘offset’’ being the distance between the start of
the structure and the member required); it implements an array indexed
by a constant; it is enough to reference function variables from the stack
or frame pointer; to provide a reasonable sized global area around the gp
value for static and extern variables.
The assembler provides the semblance of a simple direct addressing
mode, to load the values of memory variables whose address can be
computed at link time.

2–5

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

More complex modes such as double-register or scaled index must be
implemented with sequences of instructions.

Data types in Memory and registers
The R30xx family CPUs can load or store between 1 and 4 bytes in a
single operation. Naming conventions are used in the documentation and
to build instruction mnemonics:
‘‘C’’ name

MIPS name

Size(bytes)

Assembler
mnemonic

int

word

‘‘w’’ as in lw

long

word

‘‘w’’ as in lw

short

halfword

‘‘h’’ as in lh

char

byte

‘‘b’’ as in lb

Integer data types
Byte and halfword loads come in two flavors:
• Sign-extend : lb and lh load the value into the least significant bits of
the 32-bit register, but fill the high order bits by copying the ‘‘sign bit’’
(bit 7 of a byte, bit 16 of a half-word). This correctly converts a signed
value to a 32-bit signed integer.
• Zero-extend : instructions lbu and lhu load the value into the least
significant bits of a 32-bit register, with the high order bits filled with
zero. This correctly converts an unsigned value in memory to the
corresponding 32-bit unsigned integer value; so byte value 254
becomes 32-bit value 254.
If the byte-wide memory location whose address is in t1 contains the
value 0xFE (-2, or 254 if interpreted as unsigned), then:
lb
lbu

t2, 0(t1)
t3, 0(t1)

will leave t2 holding the value 0xFFFF FFFE (-2 as signed 32-bit) andt3
holding the value 0x0000 00FE (254 as signed or unsigned 32-bit).
Subtle differences in the way shorter integers are extended to longer
ones are a historical cause of C portability problems, and the modern C
standards have elaborate rules. On machines like the MIPS, which does
not perform 8- or 16-bit precision arithmetic directly, expressions
involving short or char variables are less efficient than word operations.
Unaligned loads and stores
Normal loads and stores in the MIPS architecture must be aligned; halfwords may be loaded only from 2-byte boundaries, and words only from 4byte boundaries. A load instruction with an unaligned address will
produce a trap. Because CISC architectures such as the MC680x0 and
iAPXx86 do handle unaligned loads and stores, this could complicate
porting software from one of these architectures. The MIPS architecture
does provide mechanisms to support this type of operation; in extremity,
software can provide a trap handler which will emulate the desired load
operation and hide this feature from the application.
All data items declared by C code will be correctly aligned.
But when it is known in advance that the program will transfer a word
from an address whose alignment is unknown and will be computed at run
time, the architecture does allow for a special 2-instruction sequence
(much more efficient than a series of byte loads, shifts and assembly). This
sequence is normally generated by the macro-instruction ulw (unaligned
load word).

2–6

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

(A macro-instruction ulh, unaligned load half, is also provided, and is
synthesized by two loads, a shift, and a bitwise ‘‘or’’ operation.)
The special machine instructions are lwl and lwr (load word left, load
word right). ‘‘Left’’ and ‘‘right’’ are arithmetical directions, as in ‘‘shift left’’;
‘‘left’’ is movement towards more significant bits, ‘‘right’’ is towards less
significant bits.
These instructions do three things:
• load 1, 2, 3 or 4 bytes from within one aligned 4-byte (word) location;
• shift that data to move the byte selected by the address to either the
most-significant (lwl) or least-significant (lwr) end of a 32-bit field;
• merge the bytes fetched from memory with the data already in the
destination.
This breaks most of the rules the architecture usually sticks by; it does
a logical operation on a memory variable, for example. Special hardware
allows the lwl, lwr pair to be used in consecutive instructions, even though
the second instruction uses the value generated by the first.
For example, on a CPU configured as big-endian the assembler
instruction:
ulw
add

t1, 0(t2)
t4, t3, t1

is implemented as:
lwl
lwr
nop
add

t1, 0(t2)
t1, 3(t2)
t4, t3, t1

Where:
• the lwl picks up the lowest-addressed byte of the unaligned 4-byte
region, together with however many more bytes which fit into an
aligned word. It then shifts them left, to form the most-significant
bytes of the register value.
• the lwr is aimed at the highest-addressed byte in the unaligned 4-byte
region. It loads it, together with any bytes which precede it in the
same memory word, and shifts it right to get the least significant bits
of the register value. The merge leaves the high-order bits unchanged.
• Although special hardware ensures that a nop is not required between
the lwl and lwr, there is still a load delay between the second of them
and a normal instruction.
Note that if t2 was in fact 4-byte aligned, then both instructions load the
entire word; duplicating effort, but achieving the desired effect.
CPU behavior when operating with little-endian byte order is described
in a later chapter.
Floating point data in memory
Loads into floating point registers from 4-byte aligned memory move
data without any interpretation – a program can load an invalid floating
point number and no FP error will result until an arithmetic operation is
requested with it as an operand.
This allows a programmer to load single-precision values by a load into
an even-numbered floating point register; but the programmer can also
load a double-precision value by a macro instruction, so that:
ldc1

$f2, 24(t1)

is expanded to two loads to consecutive registers:
lwc1
lwc1

2–7

$f2, 24(t1)
$f3, 28(t1)

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

The C compiler aligns 8-byte long double-precision floating point
variables to 8-byte boundaries. R30xx family hardware does not require
this alignment; but it is done to avoid compatibility problems with
implementations of MIPS-2 or MIPS-3 CPUs such as the IDT R4600
(Orion), where the ldc1 instruction is part of the machine code, and the
alignment is necessary.

BASIC ADDRESS SPACE
The way in which MIPS processors use and handle addresses is subtly
different from that of traditional CISC CPUs, and may appear confusing.
Read the first part of this section carefully. Here are some guidelines:
• The addresses put into programs are rarely the same as the physical
addresses which come out of the chip (sometimes they’re close, but
not the same). This manual will refer to them as program addresses
and physical addresses respectively. A more common name for
program addresses is “virtual addresses”; note that the use of the
term “virtual address” does not necessarily imply that an operating
system must perform virtual memory management (e.g. demand
paging from disks...), but rather that the address undergoes some
transformation before being presented to physical memory. Although
virtual address is a proper term, this manual will typically use the
term “program address” to avoid confusing virtual addresses with
virtual memory management requirements.
• A MIPS-1 CPU has two operating modes: user and kernel. In user
mode, any address above 2Gbytes (most-significant bit of the address
set) is illegal and causes a trap. Also, some instructions cause a trap
in user mode.
• The 32-bit program address space is divided into four big areas with
traditional names; and different things happen according to the area
an address lies in:
kuseg 0000 0000 – 7FFF FFFF (low 2Gbytes): these are the addresses
permitted in user mode. In machines with an MMU (“E” versions
of the R30xx family), they will always be translated (more about
the R30xx MMU in a later chapter). Software should not attempt
to use these addresses unless the MMU is set up.
For machines without an MMU (“base” versions of the R30xx
family), the kuseg “program address” is transformed to a
physical address by adding a 1GB offset; the address
transformations for “base versions” of the R30xx family are
described later in this chapter. Note, however, that many
embedded applications do not use this address segment (those
applications which do not require that the kernel and its
resources be protected from user tasks).
kseg0 0x8000 0000 – 9FFF FFFF (512 Mbytes): these addresses are
‘‘translated’’ into physical addresses by merely stripping off the
top bit, mapping them contiguously into the low 512 Mbytes of
physical memory. This transformation operates the same for
both “base” and “E” family members. This segment is referred to
as “unmapped” because “E” version devices cannot redirect this
translation to a different area of physical memory.
Addresses in this region are always accessed through the cache,
so may not be used until the caches are properly initialized. They
will be used for most programs and data in systems using “base”
family members; and will be used for the OS kernel for systems
which do use the MMU (“E” version devices).

2–8

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

kseg1 0xA000 0000 – BFFF FFFF (512 Mbytes): these addresses are
mapped into physical addresses by stripping off the leading three
bits, giving a duplicate mapping of the low 512 Mbytes of
physical memory. However, kseg1 program address accesses will
not use the cache.
The kseg1 region is the only chunk of the memory map which is
guaranteed to behave properly from system reset; that’s why the
after-reset starting point ( 0xBFC0 0000, commonly called the
“reset exception vector”) lies within it. The physical address of
the starting point is 0x1FC0 0000 – which means that the
hardware should place the boot ROM at this physical address.
Software will therefore use this region for the initial program
ROM, and most systems also use it for I/O registers. In general,
IO devices should always be mapped to addresses that are
accessible from Kseg1, and system ROM is always mapped to
contain the reset exception vector. Note that code in the ROM
can then be accessed uncacheably (during boot up) using kseg1
program addresses, and also can be accessed cacheably (for
normal operation) using kseg0 program addresses.
kseg2 0xC000 0000 –
FFFF
FFFF (1 Gbyte): this area is only
accessible in kernel mode. As for kuseg, in “E” devices program
addresses are translated by the MMU into physical addresses;
thus, these addresses must not be referenced prior to MMU
initialization. For “base versions”, physical addresses are
generated to be the same as program addresses for kseg2.
Note that many systems will not need this region. In “E” versions,
it frequently contains OS structures such as page tables; simpler
OS’es probably will have little need for kseg2.

SUMMARY OF SYSTEM ADDRESSING
MIPS program addresses are rarely simply the same as physical
addresses, but simple embedded software will probably use addresses in
kseg0 and kseg1, where the program address is related in an obvious and
unchangeable way to physical addresses.
Physical memory locations from 0x2000 0000 (512Mbyte) upward may
be difficult to access. In “E” versions of the R30xx family, the only way to
reach these addresses is through the MMU. In “base” family members,
certain of these physical addresses can be reached using kseg2 or kuseg
addresses: the address transformations for base R30xx family members is
described later in this chapter.

Kernel vs. user mode
In kernel mode (the CPU resets into this state), all program addresses
are accessible.
In user mode:
• Program addresses above 2Gbytes (top bit set) are illegal and will
cause a trap.
Note that if the CPU has an MMU, this means all valid user mode
addresses must be translated by the MMU; thus, User mode for “E”
devices typically requires the use of a memory-mapped OS.
For “base” CPUs, kuseg addresses are mapped to a distinct area of
physical memory. Thus, kernel memory resources (including IO
devices) can be made inaccessible to User mode software, without
requiring a memory-mapping function from the OS. Alternately, the
hardware can choose to “ignore” high-order address bits when
performing address decoding, thus “condensing” kuseg, kseg2, kseg1,
and kseg0 into the same physical memory.

2–9

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

• Instructions beyond the standard user set become illegal. Specifically,
the kernel can prevent User mode software from accessing the onchip CP0 (system control coprocessor, which controls exception and
machine state and performs the memory management functions of
the CPU).
Thus, the primary differences between User and Kernel modes are:
• User mode tasks can be inhibited from accessing kernel memory
resources, including OS data structures and IO devices. This also
means that various user tasks can be protected from each other.
• User mode tasks can be inhibited from modifying the basic machine
state, by prohibiting accesses to CP0.
Note that the kernel/user mode bit does not change the interpretation
of anything – just some things cease to be allowed in user mode. In kernel
mode the CPU can access low addresses just as if it was in user mode, and
they will be translated in the same way.

Memory map for CPUs without MMU hardware
The treatment of kseg0 and kseg1 addresses is the same for all IDT
R30xx CPUs. If the system can be implemented using only physical
addresses in the low 512Mbytes, and system software can be written to use
only kseg0 and kseg1, then the choice of “base” vs. “E” versions of the
R30xx family is not relevant.
For versions without the MMU (“base versions”), addresses in kuseg and
kseg2 will undergo a fixed address translation, and provide the system
designer the option to provide additional memory.
The base members of the R30xx family provide the following address
translations for kuseg and kseg2 program addresses:
• kuseg: this region (the low 2Gbytes of program addresses) is
translated to a contiguous 2Gbyte physical region between 13Gbytes. In effect, a 1GB offset is added to each kuseg program
address. In hex:
Program address
0x0000 0000 0x7FFF FFFF

Physical Address
→

0x4000 0000 0xBFFF FFFF

• kseg2: these program addresses are genuinely untranslated. So
program addresses from 0xC000 0000 – 0xFFFF FFFF emerge as
identical physical addresses.
This means that “base” versions can generate most physical addresses
(without the use of an MMU), except for a gap between 512Mbyte and
1Gbyte (0x2000 0000 through 0x3FFF FFFF). As noted above, many
systems may ignore high-order address bits when performing address
decoding, thus condensing all physical memory into the lowest 512MB
addresses.
Subsegments in the R3041 – memory width configuration
The R3041 CPU can be configured to access different regions of memory
as either 32-, 16- or 8-bits wide. Where the program requests a 32-bit
operation to a narrow memory (either with an uncached access, or a cache
miss, or a store), the CPU may break a transaction into multiple data
phases, to match the datum size to the memory port width.
The width configuration is applied independently to subsegments of the
normal kseg regions, as follows:
• kseg0 and kseg1: as usual, these are both mapped onto the low
512Mbytes. This common region is split into 8 subsegments
(64Mbytes each), each of which can be programmed as 8-, 16- or 32bits wide. The width assignment affects both kseg0 and kseg1
accesses (that is, one can view these as subsegments of the
corresponding “physical” addresses).

2–10

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

• kuseg: is divided into four 512Mbyte subsegments, each
independently programmable for width. Thus, kuseg can be broken
into multiple portions, which may have varying widths. An example of
this may be a 32-bit main memory with some 16-bit PCMCIA font
cards and an 8-bit NVRAM.
• kseg2: is divided into two 512Mbyte subsegments, independently
programmable for width. Again, this means that kseg2 can support
multiple memory subsystems, of varying port width.
Note that once the various memory port widths have been configured
(typically at boot time), software does not have to be aware of the actual
width of any memory system. It can choose to treat all memory as 32-bit
wide, and the CPU will automatically adjust when an access is made to a
narrower memory region. This simplifies software development, and also
facilitates porting to various system implementations (which may or may
not choose the same memory port widths).

2–11

MIPS-1 (R30xx)
ARCHITECTURE

CHAPTER 2

Integrated Device Technology, Inc.

SYSTEM CONTROL COPROCESSOR ARCHITECTURE

CHAPTER 3

Integrated Device Technology, Inc.

This chapter concentrates on the aspects of the R30xx family
architecture that must be managed by the OS programmer. Note that most
of these features are transparent to the user program author; however, the
nature of embedded systems is such that most embedded systems
programmers will have a view of the underlying CPU and system
architecture, and thus will find this material important.
Co-processors
MIPS uses the term “co-processor” both in a traditional fashion, and also
in a non-traditional fashion. Specifically, the FPA device is a traditional
microprocessor co-processor: it is an optional part of the architecture,
with its own particular instruction set.
Opcodes are reserved and instruction fields defined for up to four ‘‘coprocessors’’. Architecturally, the co-processors can be tightly coupled to
the base integer CPU; for example, the ISA defines instructions to move
data directly between memory and the coprocessor, rather than requiring
it to be moved into the integer processor first.
However, MIPS also uses the term “co-processor” for the functions
required to manage the CPU environment, including exception
management, cache control, and memory management. This
segmentation insures that the chip architecture can be varied (e.g. cache
architecture, interrupt controller, etc.), without impacting user mode
software compatibility.
These functions are grouped by MIPS into the on-chip “co-processor 0”,
or ‘‘system control co-processor’’ - and these instructions implement the
whole CPU control system. Note that co-processor 0 has no independent
existence, and is certainly not optional. It provides a standard way of
encoding the instructions which access the CPU status register; so that,
although the definition of the status register changes among
implementations, programmers can use the same assembler for both
CPUs. Similarly, the exception and memory management strategies can
be varied among implementations, and these effects isolated to particular
portions of the OS kernel.

CPU CONTROL SUMMARY
This chapter, coupled with chapters on cache management, memory
management, and exception processing, provide details on managing the
machine and OS state. The areas of interest include:
• CPU control and co-processor : how privileged instructions are
organized, with shortform descriptions. There are relatively few
privileged instructions; most of the low-level control over the CPU is
exercised by reading and writing bit-fields within special registers.
• Exceptions : external interrupts, invalid operations, arithmetic errors
– all result in ‘‘exceptions’’, where control is transferred to an
exception handler routine.
MIPS exceptions are extremely simple – the hardware does the
absolute minimum, allowing the programmer to tailor the exception
mechanism to the needs of the particular system.
A later chapter describes MIPS exceptions, why they are ‘‘precise’’,
exception vectors, and conventions about how to code exception
handling routines.
Special problems can arise with nested exceptions: exceptions
occurring while the CPU is still handling an earlier exception.

3–1

CHAPTER 3

•

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

Hardware interrupts have their own style and rules.
The Exception Management chapter includes an annotated example
of a moderately-complicated exception handler.
Caches and cache management : all R30xx implementations have dual
caches (the I-cache for instructions, the D-cache for data). On-chip
hardware is provided to manage the caches, and the programmer
working with I/O devices, particularly with DMA devices, may need to
explicitly manage the caches in particular situations.
To manipulate the caches, the CPU allows software to isolate them,
inhibiting cache/memory traffic and allowing the processor to access
cache as if it were simple memory; and the CPU can swap the roles of
the I-cache and D-cache (the only way to make the I-cache writable).
Caches must sometimes be cleared of stale or invalid/uninitialized
data. Even following power-up, the R30xx caches are in a random
state and must be cleaned up before they can be used. A later chapter
will discuss the techniques used by software to manage the on-chip
cache resources.
In addition, techniques to determine the on-chip cache sizes will be
shown (greatest flexibility is achieved if software can be written to be
independent of cache sizes).
For the diagnostics programmer, techniques to test the cache memory
and probe for particular entries will be discussed.
On some CPU implementations the system designer may make
configuration choices about the cache (e.g. the R3081 and R3071
allow the cache organization to be selected between 16kB of I-cache/
4kB of D-cache and 8kB each of I- and D- cache). The cache
management chapter will also discuss some of the considerations to
apply to make a proper selection.
Write buffer : on R30xx family CPUs the D-cache is always write
through; all writes go to main memory as well as the cache. This
simplifies the caches, but main memory won’t be able to accept data
as fast as the CPU can write it. Much of the performance loss can be
made up by using a FIFO store which holds a number of ‘‘write cycles’’
(it stores both address and data). In the R30xx family, this FIFO,
called the write buffer, is integrated on-chip.
System programmers may need to know that writes happen later than
the code sequence suggests. The chapter on cache management
discusses this.
Starting up : at reset almost nothing is defined, so the software must
build carefully. In MIPS CPUs, reset is implemented in almost exactly
the same way as the exceptions.
A later chapter on reset initialization discusses ways of finding out
which CPU is executing the software, and how to get a ROM program
to run.
An example of a C runtime environment, attending to the stack and
special registers, is provided.
Memory management and the TLB : A later chapter will discuss
address translation and managing the translation hardware (the
TLB). This section is mostly for OS programmers.

CPU CONTROL AND ‘‘CO-PROCESSOR 0’’
CPU control instructions
Most control functions are implemented with registers (most of which
consist of multiple bitfields). The MIPS architecture has an escape
mechanism to define instructions for ‘‘co-processors’’ – and the CPU
control instructions are coded for ‘‘co-processor 0’’.

3–2

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

There are several CPU control instructions used in the memory
management implementation, which are described in a later chapter. But
leaving aside the MMU, CPU control defines just one instruction beyond
the necessary move to and from the control registers.
mtc0
rs,
–Move to co-processor zero
Loads ‘‘co-processor 0’’ register number nn from CPU general register rs. It
is unusual, and not good practice, to refer to CPU control registers by their
number in assembler sources; normal practice is to use the names listed
in Table 3.1, “Summary of CPU control registers (not MMU)”. In some toolchains the names are defined by a C-style ‘‘include’’ file, and the C preprocessor run as a front-end to the assembler; the assembler manual
should provide guidance on how to do this. This is the only way of setting
bits in a CPU control register.
mfc0
rd, –Move from co-processor zero
General register rd is loaded with the values from CPU control register
number nn. Once again, it is common to use a symbolic name and a
macro-processor to save remembering the numbers. This is the only way
of inspecting bits in a control register.
rfe
–Restore from exception
Note that this is not ‘‘return from exception’’. This instruction restores the
status register to go back to the state prior to the trap. To understand what
it does, refer to the status register SR defined later in this chapter. The only
secure way of returning to user mode from an exception is to return with
a jr instruction which has the rfe in its delay slot.

Standard CPU control registers
This table describes the general CPU control registers (ignoring the
MMU control registers). Also note that typical convention is to reserve k0
and k1 for exception processing, although they are proper GP registers of
the integer CPU unit.
Register
Mnemonic

Description

CP0
reg no.

PRId

CP0 type and rev level

(status register) CPU mode flags

Cause

Describes the most recently recognized
exception

EPC

Return address from trap

BadVaddr

Contains the last invalid program address
which caused a trap. It is set by address
errors of all kinds, even if there is no MMU

Config

CPU configuration (R3081 and R3041 only)

BusCtrl

(R3041 only) configure bus interface signals.
Needs to be setup to match the hardware
implementation.

PortSize

(R3041 only) used to flag some program
address regions as 8- or 16-bits wide. Must be
programmed to match the hardware
implementation.

Count

(R3041 only, read/write) a 24-bit counter
incrementing with the CPU clock.

Compare

(R3041 only, read/write) a 24-bit value used
to wraparound the Count value and set an
output signal.

Table 3.1. Summary of CPU control registers (not MMU)

3–3

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

Encoding of control registers
The next section describes the format of the control registers, with a
sketch of the function of each field. In most cases, more information
about how things work is to be found in separate sections or chapters
later.
A note about reserved fields is in order here. Many unused control
register fields are marked ‘‘0’’. Bits in such fields are guaranteed to read
zero, and should be written as zero. Other reserved fields are marked
‘‘reserved’’ or ‘‘×’’; software must always write them as zero, and should
not assume that it will get back zero or any other particular value.
Registers specific to the memory management system are described in a
later chapter.

PRId Register
31

reserved

Imp
Figure 3.1.

Rev

PRId Register fields

Figure 3.1, “PRId Register fields” shows the layout of the PRId register,
a read-only register to be consulted to identify the CPU type (more
properly, this register describes CP0, allowing the kernel to dynamically
configure itself for various CPU implementations). ‘‘Imp’’ should be related
to the CPU control register set. The encoding of Imp is described below:
CPU type

‘‘Imp’’ value

R3000A (including
R3051, R3052, R3071,
and R3081)

IDT unique (R3041)

Note that when the Imp field indicates IDT unique, the revision number
can be used to distinguish among various CP0 implementations. Refer to
the R3041 User’s manual for the revision level appropriate for that device.
Since the R3051, 52, 71, and 81 are kernel compatible with the R3000A,
they share the same Imp value.
When printing the value of this register, it is conventional to print them
out as ‘‘x.y’’ where ‘‘x’’ and ‘‘y’’ are the decimal values of Imp and Rev
respectively. Try not to use this register and the CPU manuals to size
things, or to establish the presence or absence of particular features;
software will be more portable and robust if it is designed to include code
sequences to probe for the existence of individual features. This manual
will provide numerous examples designed to determine cache sizes,
presence or absence of TLB, FPA, etc.
SR Register
31

CU3

CU2

CU1

CU0

26
0

8
IM

BEV

SwC

IsC

KUo

IEo

KUp

IEp

KUc

IEc

7
0

Figure 3.2.

Fields in status register (SR)

3–4

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

The MIPS CPU has remarkably few mode bits; those that exist are
defined by fields in the CPU status register SR, as shown in Figure 3.2,
“Fields in status register (SR)”.
Note that there are no modes such as non-translated or non-cached in
MIPS CPUs; all translation and caching decisions are made on the basis of
the program address. Fields are:
CU3,
CU2 Bits (31:30) control the usability of ‘‘co-processors’’ 3 and 2
respectively. In the R30xx family, these might be enabled if
software wishes to use the BrCond(3:2) input pins for polling, or
to speed exception decoding.
CU1 ‘‘co-processor 1 usable’’: 1 to use FPA if present, 0 to disable.
When 0, all FPA instructions cause an exception, even for the
kernel. It can be useful to turn off an FPA even when one is
available; it may also be enabled in devices which do not include
an FPA, if the intent is to use the BrCond(1) pin as a polled input.
CU0 ‘‘co-processor 0 usable’’: set 1 to be able to use some nominallyprivileged instructions in user mode (this is rarely if ever done).
The CPU control instructions encoded as ‘‘co-processor 0’’ type
are always usable in kernel mode, regardless of the setting of this
bit.
RE
‘‘reverse endianness in user mode’’. The MIPS processors can be
configured, at reset time, with either ‘‘endianness’’ (byte ordering
convention, discussed in the various CPU’s User’s Manuals and
later in this manual). The RE bit allows binaries intended to be
run with one byte ordering convention to be run in systems with
the opposite convention, presuming OS software provided the
necessary support.
When RE is active, user-privilege software runs as if the CPU had
been configured with the opposite endianness.
However, achieving cross-universe running would require a large
software effort as well, and should not be necessary in embedded
systems.
BEV ‘‘boot exception vectors’’: when BEV == 1, the CPU uses the ROM
(kseg1) space exception entry point (described in a later chapter).
BEV is usually set to zero in running systems; this relocates the
exception vectors. to RAM addresses, speeding accesses and
allowing the use of “user supplied” exception service routines.
TS
‘‘TLB shutdown’’: In devices which implement the full R3000A
MMU, TS gets set if a program address simultaneously matches
two TLB entries. Prolonged operation in this state, in some
implementations, could cause internal contention and damage
to the chip. TLB shutdown is terminal, and can be cleared only
by a hardware reset.
In base family members, which do not include the TLB, this bit
is set by reset; software can rely on this feature to determine the
presence or absence of TLB support hardware.
PE
set if a cache parity error has occurred. No exception is
generated by this condition, which is really only useful for
diagnostics. The MIPS architecture has cache diagnostic
facilities because earlier versions of the CPU used external
caches, and this provided a way to verify the timing of a
particular system. For those implementations the cache parity
error bit was an essential design debug tool.
For CPUs with on-chip caches this feature is rarely needed; only
the R3071 and R3081 implement parity over the on-chip caches.

3–5

CHAPTER 3

SwC,
IsC

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

shows the result of the last load operation performed with the Dcache isolated (described in the chapter on cache management).
CM is set if the cache really contained data for the addressed
memory location (i.e. if the load would have hit in the cache even
if the cache had not been isolated).
When set, cache parity bits are written as zero and not checked.
This was useful in old R3000A systems which required external
cache RAMs, but is of little relevance to the R30xx family.
‘‘swap caches’’ and ‘‘isolate (data) cache’’. Cache mode bits for
cache management and diagnostics; their use is described in
detail in a later chapter on cache management. In simple terms:
• IsC set 1: makes all loads and stores access only the data
cache, and never memory; and in this mode a partialword store invalidates the cache entry. Note that when
this bit is set, even uncached data accesses will not be
seen on the bus; further, this bit is not initialized by reset.
Boot-up software must insure this bit is properly
initialized before relying on external data references.
• SwC set 1: reverses the roles of the I-cache and D-cache,
so that software can access and invalidate I-cache entries.
‘‘interrupt mask’’: an 8 bit field defining which interrupt sources,
when active, will be allowed to cause an exception. Six of the
interrupt sources are external pins (one may be used by the FPA,
which although it lives on the same chip is logically external); the
other two are the software-writable interrupt bits in the Cause
register.
No interrupt prioritization is provided by the CPU: the hardware
treats all interrupt bits the same. This is described in greater
detail in the chapter dealing with exceptions.

KUc,
IEc

The two basic CPU protection bits.
KUc is set 1 when running with kernel privileges, 0 for user
mode. In kernel mode, software can get at the whole program
address space, and use privileged (‘‘co-processor 0’’)
instructions. User mode restricts software to program addresses
between 0x0000 0000 and 0x7FFF FFFF, and can be denied
permission to run privileged instructions; attempts to break the
rules result in an exception.
IEc is set 0 to prevent the CPU taking any interrupt, 1 to enable.
KUp, IEp‘‘KU previous, IE previous’’:
on an exception, the hardware takes the values of KUc and IEc
and saves them here; at the same time as changing the values of
KUc, IEc to [1, 0] (kernel mode, interrupts disabled). The
instruction rfe can be used to copy KUp, IEp back into KUc, IEc.
KUo, IEo‘‘KU old, IE old’’:
on an exception the KUp, IEp bits are saved here. Effectively, the
six KU/IE bits are operated as a 3-deep, 2-bit wide stack which
is pushed on an exception and popped by an rfe.
This provides a chance of recovering cleanly from an exception
occurring so early in an exception handling routine that the first
exception has not yet saved SR. The circumstances in which this
can be done are limited, and it is probably only really of use in
allowing the user TLB refill code to be made a little shorter, as
described in the chapter on memory management.

3–6

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

Cause Register
31

0
Figure 3.3.

15
IP

ExcCode

Fields in the Cause register

Figure 3.3, “Fields in the Cause register” shows the fields in the Cause
register, which are consulted to determine the kind of exception which
happened and will be used to decide which exception routine to call.
BD
‘‘branch delay’’: if set, this bit indicates that the EPC does not
point to the actual “exception” instruction, but rather to the
branch instruction which immediately precedes it.
When the exception restart point is an instruction which is in the
‘‘delay slot’’ following a branch, EPC has to point to the branch
instruction; it is harmless to re-execute the branch, but if the
CPU returned from the exception to the branch delay instruction
itself the branch would not be taken and the exception would
have broken the interrupted program.
The only time software might be sensitive to this bit is if it must
analyze the ‘‘offending’’ instruction (if BD == 1 then the
instruction is at EPC + 4). This would occur if the instruction
needs to be emulated (e.g. a floating point instruction in a device
with no hardware FPA; or a breakpoint placed in a branch delay
slot).
CE
‘‘co-processor error’’: if the exception is taken because a ‘‘coprocessor’’ format instruction was for a ‘‘co-processor’’ which is
not enabled by the CUx bit in SR, then this field has the coprocessor number from that instruction.
IP
‘‘Interrupt Pending’’: shows the interrupts which are currently
asserted (but may be “masked” from actually signalling an
exception). These bits follow the CPU inputs for the six hardware
levels. Bits 9 and 8 are read/writable, and contain the value last
written to them. However, any of the 8 bits active when enabled
by the appropriate IM bit and the global interrupt enable flag IEc
in SR, will cause an interrupt.
IP is subtly different from the rest of the Cause register fields; it
doesn’t indicate what happened when the exception took place,
but rather shows what is happening now.
ExcCode
A 5-bit code which indicates what kind of exception happened,
as detailed in Table 3.2, “ExcCode values: different kinds of
exceptions”.
ExcCode
Value

Mnemonic

Description

Int

Interrupt

Mod

‘‘TLB modification’’

TLBL

‘‘TLB load/TLB store’’

TLBS

AdEL

AdES

Address error (on load/I-fetch or store respectively).
Either an attempt to access outside kuseg when in user
mode, or an attempt to read a word or half-word at a
misaligned address.

Table 3.2. ExcCode values: different kinds of exceptions

3–7

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

ExcCode
Value

Mnemonic

Description

IBE

DBE

Syscall

Generated unconditionally by a syscall instruction.

Breakpoint - a break instruction.

‘‘reserved instruction’’

CpU

‘‘Co-Processor unusable’’

‘‘arithmetic overflow’’. Note that ‘‘unsigned’’ versions of
instructions (e.g. addu) never cause this exception.

13-31

reserved. Some are already defined for MIPS CPUs such
as the R6000 and R4xxx

Bus error (instruction fetch or data load, respectively).
External hardware has signalled an error of some kind;
proper exception handling is system-dependent. The
R30xx family CPUs can’t take a bus error on a store;
the write buffer would make such an exception
“imprecise”.

Table 3.2. ExcCode values: different kinds of exceptions

EPC Register
This is a 32-bit register containing the 32-bit address of the return point
for this exception. The instruction causing (or suffering) the exception is at
EPC, unless BD is set in Cause, in which case EPC points to the previous
(branch) instruction.
BadVaddr Register
A 32-bit register containing the address whose reference led to an
exception; set on any MMU-related exception, on an attempt by a user
program to access addresses outside kuseg, or if an address is wrongly
aligned for the datum size referenced.
After any other exception this register is undefined. Note in particular
that it is not set after a bus error.

R3041, R3071, and R3081 specific registers
Count and Compare Registers (R3041 only)
Only present in the R3041, these provide a simple 24-bit counter/timer
running at CPU cycle rate. Count counts up, and then wraps around to
zero once it has reached the value in the Compare register. As it wraps
around the Tc* CPU output is asserted. According to CPU configuration
(bit TC of the BusCtrl register), Tc* will either remain active until reset by
software (re-write Compare), or will pulse. In either case the counter just
keeps counting. To generate an interrupt Tc* must be connected to one of
the interrupt inputs.
From reset Compare is setup to its maximum value 0xFF
(
FFFF), so the
counter runs up to 224-1 before wrapping around.
Config Register (R3071 and R3081)
31

Lock

Slow
Bus

DB
Refill

FPInt

Figure 3.4.

Halt

reserved

Fields in the R3071/81 Config Register

3–8

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

• Lock : set this bit to write to the register for the last time; all future
writes to Config will be ignored. The intention is that initialization
software will set the register and can then lock it in case some illbehaved piece of software developed on some earlier version of the
MIPS architecture tries to stomp on Config; this would have had no
effect on earlier CPUs.
• Slow Bus : hardware may require that this bit be set. It only matters
when the CPU performs a store while running from a cached location.
The system hardware design determines the proper setting for this
bit; setting it to ‘1’ should be permissible for any system, but loses
some performance in memory systems able to support more
aggressive bus performance.
If set 1, an idle bus cycle is guaranteed between any read and write
transfer. This enables additional time for bus tri-stating, control logic
generation, etc.
• DB : ‘‘data cache block refill’’, set 1 to reload 4 words into the data
cache on any miss, set 0 to reload just one word. Can be initialized
either way on the R3081, by a reset-time hardware input.
• FPInt : controls the CPU interrupt level on which FPA interrupts are
reported. On original R3000 CPUs the FPA was external and this was
determined by wiring; but the R3081’s FPA is on the chip and it would
be inefficient (and jeopardize pin-compatibility) to send the interrupt
off chip and on again.
Set FPInt to the binary value of the CPU interrupt pin number which
is dedicated to FPA interrupts. By default the field is initialized to
“011’’ to select the pin Int3†; MIPS convention put the FPA on
external interrupt pin 3. For whichever pin is dedicated to the FPA,
the CPU will then ignore the value on the external pin; the IP field of
the cause register will simply follow the FPA.
On the R3071, this field is “reserved”, and must be written as “000”.
• Halt : set to bring the CPU to a standstill. It will start again as soon as
any interrupt input is asserted (regardless of the state of the interrupt
mask). This is useful for power reduction, and can also be used to
emulate old MC68000 “Halt” operation.
• RF : slows the CPU to 1/16th of the normal clock rate, to reduce power
consumption. Illegal unless the CPU is running at 33Mhz or higher.
Note that the CPUs output clock (which is normally used to
synchronize all the interface logic) slows down too; the hardware
design should also accommodate this feature if software desires to
use it.
• AC : ‘‘alternate cache’’. 0 for 16K I-cache/4K D-cache, but set 1 for 8K
I-cache/8K D-cache.
• Reserved : must only be written as zero. It will probably read as zero,
but software should not rely on this.
Config Register (R3041)
31

Lock

DBR

Figure 3.5.

FDM

Fields in the R3041 Config (Cache Configuration) Register

† Take care: the external pin Int3 corresponds to the bit numbered
‘‘5’’ in IP of the Cause register or IM of the SR register. That’s
because both the Cause and SR fields support two ‘‘software
interrupts’’ numbered as bits 0 and 1.
3–9

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

• Lock: set 1 to finally configure register (additional writes will not have
any effect until the CPU is reset).
• 1 and 0 : set fields to exactly the value shown.
• DBR: ‘‘DBlockRefill’’, set 1 to read 4 words into the cache on a miss,
0 to refill just the word missed on. The proper setting for a given
system is dependent on a number of factors, and may best be
determined by measuring performance in each mode and selecting
the best one. Note that it is possible for software to dynamically
reconfigure the refill algorithm depending on the current code
executing, presuming the register has not been “locked”.
• FDM: “Force D-Cache Miss”, set 1 for an R3041-specific cache mode,
where all loads result in data being fetched from memory (missing in
the data cache), but the incoming data is still used to refill the cache.
Stores continue to write the cache. This is useful when software
desires to obtain the high-bandwidth of the cache and cache refills,
but the corresponding main memory is “volatile” (e.g. a FIFO, or
updated by DMA).
BusCtrl Register (R3041 only)
The R3041 CPU has many hardware interface options not available on
other members of the R30xx family, which are intended to allow the use of
simpler and cheaper interface and memory components. The BusCtrl
register does most of the configuration work. It needs to be set strictly in
accordance with the needs of the hardware implementation. Note also that
its default settings (from reset) leave the interface compatible with other
R30xx family members.
Figure 3.6, “Fields in the R3041 Bus Control (BusCtrl) Register” shows
the layout of the fields, and their uses are provided for completeness.
31

3
0

Loc 10
k

2
8

2
7

2
6

Mem

Figure 3.6.

2
5
ED

2
4

2
3
IO

2
2

2
0

1
9

1
8

B
E

1
6

1
5

1
4

BTA

1
2

1
1

1
0

DM T
A
C

B
R

0x30
0

Fields in the R3041 Bus Control (BusCtrl) Register

• Lock: when software has initialized BusCtrl to its desired state it may
write this bit to prevent its contents being changed again until the
system is reset.
• 10 and other numbers : write exactly the specified bit pattern to this
field (hex used for big ones, but others are given as binary). Improper
values may cause test modes and other unexpected side effects.
• Mem : ‘‘MemStrobe* control’’. Set this field to xy binary, where x set
means the strobe activates on reads, and y set makes it active on
writes.
• ED: ‘‘ExtDataEn* control’’. Encoded as for ‘‘Mem’’. Note that the BR
bit must be zero for this pin to function as an output.
• IO: ‘‘IOStrobe* control’’. Encoded as for ‘‘Mem’’. Note that the BR bit
must be zero for this pin to function as an output.
• BE16: ‘‘BE16(1:0)* read control’’ – 0 to make these pins active on
write cycles only.
• BE: ‘‘BE(3:0)* read control’’ – 0 to make these pins active on write
cycles only.
• BTA: ‘‘Bus turn around time’’. Program with a binary number
between 0 and 3, for 0-3 cycles of guaranteed delay between the end
of a read cycle and the start of the address phase of the next cycle.
This field enables the use of devices with slow tri-state time, and
enables the system designer to save cost by omitting data
transceivers.

3–10

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

CHAPTER 3

• DMA: ‘‘DMA Protocol Control’’, enables ‘‘DMA pulse protocol’’. When
set, the CPU uses its DMA control pins to communicate its desire for
the bus even while a DMA is in progress.
• TC: ‘‘TC* negation control’’. TC* is the output pin which is activated
when the internal timer register Count reaches the value stored in
Compare. Set TC zero to make the TC* pin just pulse for a couple of
clock periods; leave TC as 1, and TC* will be asserted on a compare
and remain asserted until software explicitly clears it (by re-writing
Compare with any value).
If TC* is used to generate a timer interrupt, then use the default (TC
== 0). The pulse is more useful when the output is being used by
external logic (e.g. to signal a DRAM refresh).
• BR: ‘‘SBrCond(3:2) control’’. Set zero to recycle the SBrCond(3:2)
pins as IOStrobe and ExtDataEn respectively.
PortSize Register (R3041 only)
The PortSize register is used to flag different parts of the program
address space for accesses to 8-, 16- or 32-bit wide memory.
Settings of this register have to be made at a time and to values which
will be mandated by the hardware design. See ‘‘IDT79R3041 Hardware
User’s Manual’’ for details.

What registers are relevant when?
The various CP0 registers and their fields provide support at specific
times during system operation.
• After hardware reset: software must initialize SR to get the CPU into
the right state to bootstrap itself.
• Hardware configuration at start-up: an R3041, R3071, or R3081
require initialization of Config, BusCtrl, and/or PortSize before very
much will work. The system hardware implementation will dictate the
proper configuration of these registers.
• After any exception: any MIPS exception (apart from one particular
MMU event) invokes a single common ‘‘general exception handler’’
routine, at a fixed address.
On entry, no program registers are saved, only the return address in
EPC. The MIPS hardware knows nothing about stacks. In any case the
exception routine cannot use the user-mode stack for any purpose;
the exception might have been a TLB miss on stack memory.
Exception software will need to use at least one of k0 and k1 to point
to some ‘‘safe’’ (exception-proof) memory space. Key information can
be saved, using the other k0 or k1 register to stage data from control
registers where necessary.
Consult the Cause register to find out what kind of exception it was
and dispatch accordingly.
• Returning from exception: control must eventually be returned to the
value stored in EPC on entry.
Whatever kind of exception it was, software will have to adjust SR
back upon return from exception. The special instruction rfe does the
job; but note that it does not transfer control. To make the jump back
software must load the original EPC value back into a generalpurpose register and use a jr operation.
• Interrupts: SR is used to adjust the interrupt masks, to determine
which (if any) interrupts will be allowed ‘‘higher priority’’ than the
current one. The hardware offers no interrupt prioritization, but the
software can do whatever it likes.
• Instructions which always cause exceptions: are often used (for
system calls, breakpoints, and to emulate some kinds of instruction).
These sometimes requires partial decoding of the offending

3–11

CHAPTER 3

SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE

instruction, which can usually be found at the location EPC. But
there is a complication; suppose that an exception occurs just after a
branch but in time to prevent the branch delay slot instruction from
running. Then EPC will point to the branch instruction (resuming
execution starting at the delay slot would cause the branch to be
ignored), and the BD bit will be set.
This Cause register bit flags this event; to find the instruction at
which the exception occurred, add 4 to the EPC value when the BD
bit is set.
• Cache management routines: SR contains bits defining special modes
for cache management. In particular they allow software to isolate the
data cache, and to swap the roles of the instruction and data caches.
The subsequent chapters will describe appropriate treatment of these
registers, and provide software examples of their use.

3–12

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

Name

Used for

zero

Always returns 0

(assembler temporary) Reserved for use by assembler

2-3

v0-v1

Value (except FP) returned by subroutine

4-7

a0-a3

(arguments) First four parameters for a subroutine

8-15

t0-t7

(temporaries) subroutines may use without saving

24-25

t8-t9

16-23

s0-s7

Subroutine ‘‘register variables’’; a subroutine which will write
one of these must save the old value and restore it before it
exits, so the calling routine sees their values preserved.

26-27

k0-k1

Reserved for use by interrupt/trap handler - may change
under your feet

global pointer - some runtime systems maintain this to give
easy access to (some) ‘‘static’’ or ‘‘extern’’ variables.

stack pointer

s8/fp

9th register variable. Subroutines which need one can use
this as a ‘‘frame pointer’’.

Return address for subroutine

Table 2.1. Conventional names of registers with usage mnemonics

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

2–3

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

2–4

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

op2

2–5

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

More complex modes such as double-register or scaled index must be
implemented with sequences of instructions.

MIPS name

Size(bytes)

Assembler
mnemonic

int

word

‘‘w’’ as in lw

long

word

‘‘w’’ as in lw

short

halfword

‘‘h’’ as in lh

char

byte

‘‘b’’ as in lb

t2, 0(t1)
t3, 0(t1)

2–6

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

t1, 0(t2)
t4, t3, t1

is implemented as:
lwl
lwr
nop
add

t1, 0(t2)
t1, 3(t2)
t4, t3, t1

$f2, 24(t1)

is expanded to two loads to consecutive registers:
lwc1
lwc1

2–7

$f2, 24(t1)
$f3, 28(t1)

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

2–8

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

2–9

CHAPTER 2

MIPS-1 (R30xx) ARCHITECTURE

Physical Address
→

0x4000 0000 0xBFFF FFFF

2–10

MIPS-1 (R30xx) ARCHITECTURE

CHAPTER 2

2–11

EXCEPTION MANAGEMENT

CHAPTER 4

Integrated Device Technology, Inc.

This chapter describes the software techniques used to recognize and
decode exceptions, save state, dispatch exception service routines, and
return from exception. Various code examples are provided.

EXCEPTIONS
In the MIPS architecture interrupts, traps, system calls and everything
else which disrupts the normal flow of execution are called ‘‘exceptions’’
and handled by a single mechanism. These kinds of events include:
• External events : interrupts, or a bus error on a read. Note that for the
R30xx floating point exceptions are reported as interrupts, since
when the R3000A was originally implemented the FPA was indeed
external.
Interrupts are the only exception conditions which can be disabled
under software control.
• Program errors and unusual conditions : non-existent instructions
(including ‘‘co-processor’’ instructions executed with the appropriate
SR disabled), integer overflow, address alignment errors, accesses
outside kuseg in user mode.
• Memory translation exceptions : using an invalid translation, or a write
to a write-protected page; and access to a page for which there is no
translation in the TLB.
• System calls and traps : exceptions deliberately generated by software
to access kernel facilities in a secure way (syscalls, conditional traps
planted by careful code, and breakpoints).
Some things do not cause exceptions, although other CPU architectures
may handle them that way. Software must use other mechanisms to
detect:
• bus errors on write cycles (R30xx CPUs don’t detect these as
exceptions at all; the use of a write buffer would make such an
exception “imprecise”, in that the instruction which generated the
store data is not guaranteed to be the one which recognizes the
exception).
• parity errors detected in the cache (the PE bit in SR is set, but no
exception is signalled).

Precise exceptions
The MIPS architecture implements precise exceptions. This is quite a
useful feature, as it provides:
• Unambiguous proof of cause : after an exception caused by any
internal error, the EPC points to the instruction which caused the
error (it might point to the preceding branch for an instruction which
is in a branch delay slot, but will signal occurrence of this using the
BD bit).
• Exceptions are seen in instruction sequence : exceptions can arise at
several different stages of execution, creating a potential hazard. For
example, if a load instruction suffers a TLB miss the exception won’t
be signalled until the ‘‘MEM’’ pipestage; if the next instruction suffers
an instruction TLB miss (at the ‘‘IF’’ pipestage) the logically second
exception will be signalled first (since the IF occurs earlier in the pipe
than MEM).

4–1

CHAPTER 4

EXCEPTION MANAGEMENT

To avoid this problem, early-detected exceptions are not activated
until it is known that all previous instructions will complete
successfully; in this case, the instruction TLB miss is suppressed and
the exception caused by the earlier instruction handled. The second
exception will likely happen again upon return from handling the data
fault.
• Subsequent instructions nullified : because of the pipelining,
instructions lying in sequence after the EPC may well have been
started. But the architecture guarantees that no effects produced by
these instructions will be visible in the registers or CPU state; and no
effect at all will occur which will prevent execution being restarted at
the EPC.
Note that this isn’t quite true of, for example, the result registers in
the integer multiply unit (logically, the architecture considers these
changed by the initiation of a multiply or divide). But provided that
the instruction arrangement rules required by the assembler are
followed, no problems will arise.
The implementation of precise exceptions requires a number of clever
techniques. For example, the FPA cannot update the register file until it
knows that the operation will not generate an exception. However, the
R30xx family contains logic to allow multi-cycle FPA operations to occur
concurrently with integer operations, yet maintain precise exceptions.

When exceptions happen
Since exceptions are precise, the architecture determines that an
exception seems to have happened just before the execution of the
instruction which caused it. The first fetch from the exception routine will
be made within 1 clock of the time when the faulting instruction would
have finished; in practice it is often faster.
On an interrupt, the last instruction to be completed before interrupt
processing starts will be the one which has just finished its MEM stage
when the interrupt is detected. The EPC target will be the one which has
just finished its ALU stage.
However, take care; some of the interrupt inputs to R30xx family CPUs
are resynchronised internally (to support interrupt signalling from
asynchronous sources) and the interrupt will be detected only on the rising
edge of the second clock after the interrupt becomes active.

Exception vectors
Unlike most CISC processors, the MIPS CPU does no part of the job of
dispatching exceptions to specialist routines to deal with individual
conditions. The rationale for this is twofold:
• on CISC CPUs this feature is not so useful in practice as one might
hope. For example, most interrupts are likely to share code for saving
registers and it is common for CISC microcode to spend time
dispatching to different interrupt entry points, where system software
loads a code number and jumps back to a common handler.
• on a RISC CPU ordinary code is fast enough to be used in preference
to microcode.
Only one exception is handled differently; a TLB miss on an address in
kuseg. Although the architecture uses software to handle this condition
(which occurs very frequently in a heavily-used multi-tasking, virtual
memory OS), there is significant architectural support for a ‘‘preferred’’
scheme for TLB refill. The preferred refill scheme can be completed in
about 13 clocks.
It is also useful to have two alternate pairs of entry points. It is essential
for high performance to locate the vectors in cached memory for OS use,
but this is highly undesirable at start-up; the need for a robust and selfdiagnosing start-up sequence mandates the use of uncached read-only
memory for vectors.

4–2

EXCEPTION MANAGEMENT

CHAPTER 4

So the exception system adds four more “magic” addresses to the one
used for system start-up. The reset mechanism on the MIPS CPU is
remarkably like the exception mechanism, and is sometimes referred to as
the reset exception. The complete list of exception vector addresses is
shown in Table 4.1, “Reset and exception entry points (vectors) for R30xx
family”:
Program
address

‘‘segment’’

Physical
Address

Description

0x8000 0000

kseg0

0x0000 0000

TLB miss on kuseg reference only.

0x8000 0080

kseg0

0x0000 0080

All other exceptions.

0xbfc0 0100

kseg1

0x1fc0 0100

Uncached alternative kuseg TLB
miss entry point (used if SR bit
BEV set).

0xbfc0 0180

kseg1

0x1fc0 0180

Uncached alternative for all other
exceptions, used if SR bit BEV set).

0xbfc0 0000

kseg1

0x1fc0 0000

The ‘‘reset exception’’.

Table 4.1. Reset and exception entry points (vectors) for R30xx family

The 128 byte (0x80) gap between the two exception vectors is because
the MIPS architects felt that 32 instructions would be enough to code the
user-space TLB miss routine, saving a branch instruction without wasting
too much memory.
So on an exception, the CPU:
1)
sets up EPC to point to the restart location.
2)
the pre-existing user-mode and interrupt-enable flags in SR are
saved by pushing the 3-entry stack inside SR, and changing to
kernel mode with interrupts disabled.
3)
Cause is setup so that software can see the reason for the
exception. On address exceptions BadVaddr is also set. Memory
management system exceptions set up some of the MMU
registers too; see the chapter on memory management for more
detail.
4)
transfers control to the exception entry point.

Exception handling – basics
Any MIPS exception handler has to go through the same stages:
• Bootstrapping : on entry to the exception handler very little of the state
of the interrupted program has been saved, so the first job is to
provide room to preserve relevant state information.
Almost inevitably, this is done by using the k0 and k1 registers (which
are reserved for ‘‘kernel mode’’ use, and therefore should contain no
application program state), to reference a piece of memory which can
be used for other register saves.
• Dispatching different exceptions : consult the Cause register. The
initial decision is likely to be made on the ‘‘ExcCode’’ field, which is
thoughtfully aligned so that its code value (between 0 and 31) can be
used to index an array of words without a shift. The code will be
something like this:
mfc0
and
lw
jr

4–3

t1, C0_CAUSE
t2, t1, 0x3f
t2, tablebase(t2)
t2

CHAPTER 4

EXCEPTION MANAGEMENT

• Constructing the exception processing environment : complex exception
handling routines may be written in a high level language; in addition,
software may wish to be able to use standard library routines. To do
this, software will have to switch to a suitable stack, and save the
values of all registers which “called subroutines” may use.
• Processing the exception : this is system and cause dependent.
• Returning from an exception : The return address is contained in the
EPC register on exception entry; the value must be placed into a
general purpose register for return from exception (note that the EPC
value may have been placed on the stack at exception entry).
Returning control is now done with a jr instruction, and the change
of state back from kernel to the previous mode is done by an rfe
instruction after the jr, in the delay slot.

Nesting exceptions
In many cases the system may wish to permit (or will not be able to
avoid) further exceptions occurring within the exception processing
routine – nested exceptions.
If improperly handled, this could cause chaos; vital state for the
interrupted program is held in EPC and SR, and another exception would
overwrite them. To permit nested exceptions, these values must be saved
elsewhere. Moreover, once exceptions are re-enabled, software can no
longer rely on the values of k0 and k1, since a subsequent (nested)
exception may alter their values.
The normal approach to this is to define an exception frame; a memoryresident data structure with fields to store incoming register values, so
that they can be retrieved on return. Exception frames are usually
arranged logically as a stack.
Stack resources are consumed by each exception, so arbitrarily nested
exceptions cannot be tolerated. Most systems sort exceptions into a
priority order, and arrange that while an exception is being processed only
higher-priority exceptions are permitted. Such systems need have only as
many exception frames as there are priority levels.
Software can inhibit certain exceptions, as follows:
• Interrupts : can be individually masked by software to conform to
system priority rules;
• Privilege Violations : can’t happen in kernel mode; virtually all
exception service routines will execute in kernel mode;
• Addressing errors and TLB misses : software must be written to
ensure that these never happen when processing higher priority
exceptions.
Typical system priorities are (lowest first): non-exception code, TLB miss
on kuseg address, TLB miss on kseg2 address, interrupt (lowest)...
interrupt (highest), illegal instructions and traps, bus errors.

An exception routine
The following is an exception routine from IDT/sim.
It receives exceptions, saves all state, and calls the appropriate service
routine. It also shows the code used to install the exception handler in
memory.
/*
**
**
**
**
**
**
*/

exception.s - contains functions for setting up and
handling exceptions
Copyright 1989 Integrated Device Technology, Inc.
All Rights Reserved

4–4

EXCEPTION MANAGEMENT

CHAPTER 4

#include
#include
#include
#include
#include

"iregdef.h"
"idtcpu.h"
"idtmon.h"
"setjmp.h"
"excepthdr.h"

/*
**
move_exc_code() - moves the exception code to the utlb and
gen
**
exception vectors
*/
FRAME(move_exc_code,sp,0,ra)
.set
noreorder
la
t1,exc_utlb_code
la
t2,exc_norm_code
li
t3,UT_VEC
li
t4,E_VEC
li
t5,VEC_CODE_LENGTH
1:
lw
t6,0(t1)
lw
t7,0(t2)
sw
t6,0(t3)
sw
t7,0(t4)
addiu t1,4
addiu t3,4
addiu t4,4
subu
t5,4
bne
t5,zero,1b
addiu t2,4
move
t5,ra
# assumes clear_cache doesnt use t5
li
a0,UT_VEC
jal
clear_cache
li
a1,VEC_CODE_LENGTH
nop
li
a0,E_VEC
jal
clear_cache
li
a1,VEC_CODE_LENGTH
move
ra,t5
# restore ra
j
ra
nop
.set
reorder
ENDFRAME(move_exc_code)
/*
** enable_int(mask) - enables interrupts - mask is positoned so it
only
**
needs to be or'ed into the status reg. This
**
also does some other things !!!! caution
should
**
be used if invoking this while in the middle
**
of a debugging session where the client may
have
**
nested interrupts.
**
*/
FRAME(enable_int,sp,0,ra)
.set
noreorder
la
t0,client_regs
lw
t1,R_SR*4(t0)
nop
or
t1,0x4
or
t1,a0
sw
t1,R_SR*4(t0)
mfc0
t0,C0_SR
or
a0,1
or
t0,a0
mtc0
t0,C0_SR
j
ra

4–5

CHAPTER 4

EXCEPTION MANAGEMENT

nop
.set
reorder
ENDFRAME(enable_int)
/*
**
disable_int(mask) - disable the interrupt - mask is the
compliment
**
of the bits to be cleared - i.e. to clear
ext int
**
5 the mask would be - 0xffff7fff
*/
FRAME(disable_int,sp,0,ra)
.set
noreorder
la
t0,client_regs
lw
t1,R_SR*4(t0)
nop
and
t1,a0
sw
t1,R_SR*4(t0)
mfc0
t0,C0_SR
nop
and
t0,a0
mtc0
t0,C0_SR
j
ra
nop
.set
reorder
ENDFRAME(disable_int)
/*
** the following sections of code are copied to the vector area
**
at location 0x80000000 (utlb miss) and location 0x80000080
**
(general exception).
**
*/
.set
.set

noreorder
noat

# must be set so la does not use at

FRAME(exc_norm_code,sp,0,ra)
la
k0,except_regs
sw
AT,R_AT*4(k0)
sw
gp,R_GP*4(k0)
sw
v0,R_V0*4(k0)
li
v0,NORM_EXCEPT
la
AT,exception
j
AT
nop
ENDFRAME(exc_norm_code)
FRAME(exc_utlb_code,sp,0,ra)
la
k0,except_regs
sw
AT,R_AT*4(k0)
sw
gp,R_GP*4(k0)
sw
v0,R_V0*4(k0)
li
v0,UTLB_EXCEPT
la
AT,exception
j
AT
nop
.set

reorder

/*
** common exception handling code
** Save various registers so we can print informative messages
** for faults (whether in monitor or client mode)
**
Reg.(k0) points to the exception register save area.
**
If we are in client mode then some of these values will
**
have to be copied to the client register save area.
*/
.set
noreorder

4–6

EXCEPTION MANAGEMENT

CHAPTER 4
exception:
sw
v0,R_EXCTYPE*4(k0) # save exception type (gen or
utlb)
sw
v1,R_V1*4(k0)
mfc0
v0,C0_EPC
mfc0
v1,C0_SR
sw
v0,R_EPC*4(k0)# save the pc at the time of the
exception
sw
v1,R_SR*4(k0)
.set
noat
la
AT,client_regs# get address of client reg save area
mfc0
v0,C0_BADVADDR
mfc0
v1,C0_CAUSE
sw
v0,R_BADVADDR*4(k0)
sw
v0,R_BADVADDR*4(AT)
sw
v1,R_CAUSE*4(k0)
sw
v1,R_CAUSE*4(AT)
sw
sp,R_SP*4(k0)
sw
sp,R_SP*4(AT)
lw
v0,user_int_fast#see if a client wants a shot at it
sw
a0,R_A0*4(k0)
sw
a0,R_A0*4(AT)
sw
ra,R_RA*4(k0)
sw
ra,R_RA*4(AT)
lw
sp,fault_stack # use "fault" stack
beq
v0,zero,1f
# skip the following if no client
nop
move
a0,AT
jal
v0
nop
la
k0,except_regs
la
AT,client_regs
beq
v0,zero,1f
# returns false if user did not
handle
nop
la
v1,except_regs
lw
ra,R_RA*4(v1)
lw
AT,R_AT*4(v1)
lw
gp,R_GP*4(v1)
lw
v0,R_V0*4(v1)
lw
sp,R_SP*4(v1)
lw
a0,R_A0*4(v1)
lw
k0,R_EPC*4(v1)
lw
v1,R_V1*4(v1)
j
k0
rfe
/*
** Save registers if in client mode
** then change mode to prom mode currently k0 is pointing
** exception reg. save area - v0, v1, AT, gp, sp regs were saved
** epc, sr, badvaddr and cause were also saved.
*/
1:
lw
v0,R_MODE*4(AT)# get the current op. mode
lw
v1,R_EXCTYPE*4(k0)
sw
v0,R_MODE*4(k0)# save the current prom mode
sw
v1,R_EXCTYPE*4(AT)
li
v1,MODE_MONITOR# see if it
beq
v0,v1,nosave # was in prom mode
nop
li
v0,MODE_MONITOR
sw
v0,R_MODE*4(AT)# now in prom mode
lw
v0,R_GP*4(k0)
lw
v1,R_EPC*4(k0)
sw
v0,R_GP*4(AT)
sw
v1,R_EPC*4(AT)
lw
v0,R_SR*4(k0)
lw
v1,R_AT*4(k0)

4–7

CHAPTER 4
sw
sw
lw
lw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
li
sw
sw
sw
sw
lw
move
and
beq
present
move
lw
and
mtc0
nop
cfc1
cfc1
sw
sw
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1

EXCEPTION MANAGEMENT
v0,R_SR*4(AT)
v1,R_AT*4(AT)
v0,R_V0*4(k0)
v1,R_V1*4(k0)
v0,R_V0*4(AT)
v1,R_V1*4(AT)
a1,R_A1*4(AT)
a2,R_A2*4(AT)
a3,R_A3*4(AT)
t0,R_T0*4(AT)
t1,R_T1*4(AT)
t2,R_T2*4(AT)
t3,R_T3*4(AT)
t4,R_T4*4(AT)
t5,R_T5*4(AT)
t6,R_T6*4(AT)
t7,R_T7*4(AT)
s0,R_S0*4(AT)
s1,R_S1*4(AT)
s2,R_S2*4(AT)
s3,R_S3*4(AT)
s4,R_S4*4(AT)
s5,R_S5*4(AT)
s6,R_S6*4(AT)
s7,R_S7*4(AT)
t8,R_T8*4(AT)
v0,0xbababadd #This reg (k0) is invalid
t9,R_T9*4(AT)
v0,R_K0*4(AT) # should be obvious
k1,R_K1*4(AT)
fp,R_FP*4(AT)
v0,status_base
v1,AT
v0,SR_CU1
v0,zero,1f
# only save fpu regs if
AT,v1
v1,R_SR*4(AT)
v0,v1
v0,C0_SR
v0,$30
v1,$31
v0,R_FEIR*4(AT)
v1,R_FCSR*4(AT)
fp0,R_F0*4(AT)
fp1,R_F1*4(AT)
fp2,R_F2*4(AT)
fp3,R_F3*4(AT)
fp4,R_F4*4(AT)
fp5,R_F5*4(AT)
fp6,R_F6*4(AT)
fp7,R_F7*4(AT)
fp8,R_F8*4(AT)
fp9,R_F9*4(AT)
fp10,R_F10*4(AT)
fp11,R_F11*4(AT)
fp12,R_F12*4(AT)
fp13,R_F13*4(AT)
fp14,R_F14*4(AT)
fp15,R_F15*4(AT)
fp16,R_F16*4(AT)
fp17,R_F17*4(AT)
fp18,R_F18*4(AT)
fp19,R_F19*4(AT)
fp20,R_F20*4(AT)
fp21,R_F21*4(AT)
fp22,R_F22*4(AT)

4–8

EXCEPTION MANAGEMENT

CHAPTER 4
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1
swc1

fp23,R_F23*4(AT)
fp24,R_F24*4(AT)
fp25,R_F25*4(AT)
fp26,R_F26*4(AT)
fp27,R_F27*4(AT)
fp28,R_F28*4(AT)
fp29,R_F29*4(AT)
fp30,R_F30*4(AT)
fp31,R_F31*4(AT)

mflo
mfhi
sw
sw
mfc0
mfc0
sw
sw
mfc0
mfc0
sw
mfc0
sw
sw
.set
nosave:
.set
j

v0
v1
v0,R_MDLO*4(AT)
v1,R_MDHI*4(AT)
v0,C0_INX
v1,C0_RAND
v0,R_INX*4(AT)
v1,R_RAND*4(AT)
v0,C0_TLBLO
v1,C0_TLBHI
v0,R_TLBLO*4(AT)
v0,C0_CTXT
v1,R_TLBHI*4(AT)
v0,R_CTXT*4(AT)
at

reorder
exception_handler

ENDFRAME(exc_utlb_code)
/*
** resume -- resume execution of client code
*/
FRAME(resume,sp,0,ra)
jal
install_sticky
jal
clr_extern_brk
jal
clear_remote_int
.set
noat
.set
noreorder
la
AT,client_regs
lw
v0,status_base
move
v1,AT
and
v0,SR_CU1
beq
v0,zero,1f
# only save fpu regs if present
move
AT,v1
lw
v1,R_SR*4(AT)
nop
or
v0,v1
mtc0
v0,C0_SR
lw
v1,R_FCSR*4(AT)
lwc1
fp0,R_F0*4(AT)
ctc1
v1,$31
lwc1
fp1,R_F1*4(AT)
lwc1
fp2,R_F2*4(AT)
lwc1
fp3,R_F3*4(AT)
lwc1
fp4,R_F4*4(AT)
lwc1
fp5,R_F5*4(AT)
lwc1
fp6,R_F6*4(AT)
lwc1
fp7,R_F7*4(AT)
lwc1
fp8,R_F8*4(AT)
lwc1
fp9,R_F9*4(AT)
lwc1
fp10,R_F10*4(AT)
lwc1
fp11,R_F11*4(AT)
lwc1
fp12,R_F12*4(AT)
lwc1
fp13,R_F13*4(AT)
lwc1
fp14,R_F14*4(AT)
lwc1
fp15,R_F15*4(AT)
lwc1
fp16,R_F16*4(AT)

4–9

CHAPTER 4

EXCEPTION MANAGEMENT
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1
lwc1

fp17,R_F17*4(AT)
fp18,R_F18*4(AT)
fp19,R_F19*4(AT)
fp20,R_F20*4(AT)
fp21,R_F21*4(AT)
fp22,R_F22*4(AT)
fp23,R_F23*4(AT)
fp24,R_F24*4(AT)
fp25,R_F25*4(AT)
fp26,R_F26*4(AT)
fp27,R_F27*4(AT)
fp28,R_F28*4(AT)
fp29,R_F29*4(AT)
fp30,R_F30*4(AT)
fp31,R_F31*4(AT)

1:
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
mtlo
mthi
lw
lw
mtc0
mtc0
lw
lw
mtc0
mtc0
lw
lw
mtc0
move
and
intr */
mtc0
li
move
sw
lw
lw
lw
lw

a0,R_A0*4(AT)
a1,R_A1*4(AT)
a2,R_A2*4(AT)
a3,R_A3*4(AT)
t0,R_T0*4(AT)
t1,R_T1*4(AT)
t2,R_T2*4(AT)
t3,R_T3*4(AT)
t4,R_T4*4(AT)
t5,R_T5*4(AT)
t6,R_T6*4(AT)
t7,R_T7*4(AT)
s0,R_S0*4(AT)
s1,R_S1*4(AT)
s2,R_S2*4(AT)
s3,R_S3*4(AT)
s4,R_S4*4(AT)
s5,R_S5*4(AT)
s6,R_S6*4(AT)
s7,R_S7*4(AT)
t8,R_T8*4(AT)
t9,R_T9*4(AT)
k1,R_K1*4(AT)
gp,R_GP*4(AT)
fp,R_FP*4(AT)
ra,R_RA*4(AT)
v0,R_MDLO*4(AT)
v1,R_MDHI*4(AT)
v0
v1
v0,R_INX*4(AT)
v1,R_TLBLO*4(AT)
v0,C0_INX
v1,C0_TLBLO
v0,R_TLBHI*4(AT)
v1,R_CTXT*4(AT)
v0,C0_TLBHI
v1,C0_CTXT
v0,R_CAUSE*4(AT)
v1,R_SR*4(AT)
v0,C0_CAUSE
/* only sw0 and 1 writable */
v0,AT
v1,~(SR_KUC|SR_IEC|SR_PE)/* make sure we aren't
v1,C0_SR
k0,MODE_USER
AT,v0
k0,R_MODE*4(AT)
v1,R_V1*4(AT)
sp,R_SP*4(AT)
k0,R_EPC*4(AT)
v0,R_V0*4(AT)

/* reset

mode */

4–10

EXCEPTION MANAGEMENT

CHAPTER 4
lw
AT,R_AT*4(AT)
j
k0
rfe
.set
reorder
.set
at
ENDFRAME(resume)
/*
** do_call(procedure, arg1, arg2, arg3, arg4, arg5, arg6, arg7,
arg8)
** interface for call command to client code
** copies arguments to new frame and sets up gp for client
*/
#define CALLFRM ((8*4)+4+4)
FRAME(do_call, sp,CALLFRM,ra)
subu
sp,CALLFRM
sw
ra,CALLFRM-4(sp)
sw
gp,CALLFRM-8(sp)
move
v0,a0
move
a0,a1
move
a1,a2
move
a2,a3
lw
a3,CALLFRM+(4*4)(sp)
lw
v1,CALLFRM+(5*4)(sp)
sw
v1,4*4(sp)
lw
v1,CALLFRM+(6*4)(sp)
sw
v1,5*4(sp)
lw
v1,CALLFRM+(7*4)(sp)
sw
v1,6*4(sp)
lw
v1,CALLFRM+(8*4)(sp)
sw
v1,7*4(sp)
la
t1,client_regs
lw
gp,R_GP*4(t1)
jal
v0
lw
gp,CALLFRM-8(sp)
lw
ra,CALLFRM-4(sp)
addu
sp,CALLFRM
j
ra
ENDFRAME(do_call)
/*
** clear_stat() -- clear status register
** returns current sr
*/
FRAME(clear_stat,sp,0,ra)
.set
noreorder
lw
v1,status_base
mfc0
v0,C0_SR
mtc0
v1,C0_SR
j
ra
nop
ENDFRAME(clear_stat)
.set

reorder

/*
** setjmp(jmp_buf) -- save current context for non-local goto's
** return 0
*/
FRAME(setjmp,sp,0,ra)
sw
ra,JB_PC*4(a0)
sw
sp,JB_SP*4(a0)
sw
fp,JB_FP*4(a0)
sw
s0,JB_S0*4(a0)
sw
s1,JB_S1*4(a0)
sw
s2,JB_S2*4(a0)
sw
s3,JB_S3*4(a0)
sw
s4,JB_S4*4(a0)

4–11

CHAPTER 4

EXCEPTION MANAGEMENT
sw
s5,JB_S5*4(a0)
sw
s6,JB_S6*4(a0)
sw
s7,JB_S7*4(a0)
move
v0,zero
j
ra
ENDFRAME(setjmp)

/*
** longjmp(jmp_buf, rval)
*/
FRAME(longjmp,sp,0,ra)
lw
ra,JB_PC*4(a0)
lw
sp,JB_SP*4(a0)
lw
fp,JB_FP*4(a0)
lw
s0,JB_S0*4(a0)
lw
s1,JB_S1*4(a0)
lw
s2,JB_S2*4(a0)
lw
s3,JB_S3*4(a0)
lw
s4,JB_S4*4(a0)
lw
s5,JB_S5*4(a0)
lw
s6,JB_S6*4(a0)
lw
s7,JB_S7*4(a0)
move
v0,a1
j
ra
ENDFRAME(longjmp)
/*
** wbflush() flush the write buffer - this is specific for each
hardware
**
configuration.
*/
FRAME(wbflush,sp,0,ra)
.set noreorder
lw
t0,wbflush#read an uncached memory location
j
ra
nop
.set reorder
ENDFRAME(wbflush)

INTERRUPTS
The MIPS CPUs are provided with 6 individual hardware interrupt bits,
activated by CPU input pins (in the case of the R3081, one pin is used
internally by the FPA), and 2 additional software-settable interrupt bits. An
active level on any pin is sensed in each cycle, and will cause an exception
if enabled.
The interrupt enable comes in two parts:
• The global interrupt enable bit (IEc) in the status register – when zero
no interrupt exception will occur. Simple, fast and comprehensive,
this is what prevents interrupts occurring during the early and
vulnerable stages of processing exceptions. Also, the global interrupt
enable is usually switched back on by an rfe instruction at the end of
an exception routine; this means that the interrupt cannot take effect
until the CPU has returned from the exception and finished with the
EPC register, avoiding undesirable recursion in the interrupt routine.
• The individual interrupt mask bits IM in the status register, one for
each interrupt. Set the bit 1 to enable the corresponding interrupt.
These are manipulated by software to allow whichever interrupts are
appropriate to the system.

4–12

EXCEPTION MANAGEMENT

CHAPTER 4

Changes to the individual bits are usually made “under cover”, with
the global interrupt enable off.
What are the software interrupt bits for?
One commonly asked question is: “Why does the CPU provide two bits in
the Cause register which, when set, immediately cause an interrupt
unless masked?”
The clue is in ‘‘unless masked’’. Typically this is used as a mechanism for
high-priority interrupt routines to flag actions which will be performed by
lower-priority interrupt routines, once the system has dealt with all high
priority business. As the high-priority processing completes, the software
will open up the interrupt mask, and the pending software interrupt will
occur.
There is no definitive reason why the same effect should not be simulated
by system software (using flags in memory, for example) but the soft
interrupt bits are convenient because they fit in with the already
provided interrupt handling mechanism.
Pin

SR/Cause
bit no

Notes

software interrupt

Int0*

Cause bit reads 1 when pin low (active)

Int1*

Int2*

Int3*

Int4*

Int5*

Usual choice for FPA. The pin corresponding to the
interrupt selected for FPA interrupts on an R3081 is
effectively a no-connect.

Table 4.2. Interrupt bitfields and interrupt pins

Interrupt processing proper begins after an exception is received and the
Type field in Cause signals that it was caused by an interrupt. Table 4.2,
“Interrupt bitfields and interrupt pins” describes the relationship between
Cause bits and input pins.
Once the interrupt exception is “recognized” by the CPU, the stages are:
• Consult the Cause register IP field, logically-‘‘and’’ it with the current
interrupt masks in the SR IM field to obtain a bit-map of active,
enabled interrupt requests. There may be more than one, and any of
them would have caused the interrupt.
• Select one active, enabled interrupt for attention. The selection can be
done simply by using fixed priorities; however, software is free to
implement whatever priority mechanism is appropriate for the
system.
• Software needs to save the old interrupt mask bits of the SR register,
but it is quite likely that the whole SR register was saved in the main
exception routine.
• Change IM in SR to ensure that the current interrupt and all
interrupts of equal or lesser priority are inhibited.
• If not already performed by the main exception routine, save the state
required for nested exception processing.
• Set the global interrupt enable bit IEc in SR to allow higher-priority
interrupts to be processed.
4–13

CHAPTER 4

EXCEPTION MANAGEMENT

• Call the particular interrupt service routine for the selected, current
interrupt.
• On return, disable interrupts again by clearing IEc in SR, before
returning to the normal exception stream.

Conventions and Examples
The following is as simple as an exception routine can be. It does nothing
except increment a counter on each exception:
.set
.set
xcptgen:
la
lw
nop
addu
sw
mfc0
nop
j
rfe
.set
.set

noreorder
noat
k0,xcptcount# get address of counter
k1,0(k0)# load counter
# (load delay)
k1,1
# increment counter
k1,0(k0)# store counter
k0,C0_EPC# get EPC
# (load delay, mfc0 slow)
k0
# return to program
# branch delay slot
at
reorder

Note that this routine cannot survive a nested exception (the original
return address in EPC would be lost, for example). It doesn’t re-enable
interrupts; but note that the counter xcptcount should be at an address
which can’t possibly suffer a TLB miss.

4–14

CACHE MANAGEMENT

CHAPTER 5

Integrated Device Technology, Inc.

CACHES AND CACHE MANAGEMENT
R30xx family CPUs implement separate on-chip caches for instructions
(I-cache) and data (D-cache). Following RISC principles, hardware
functions are provided only for normal operation of the caches; software
routines must be provided to initialize the cache following system start-up,
and to invalidate cache data when required†.
Cache Memory
tagstore

memory address
higher bits

lo bits

cache data store

index

match?

hit?
Figure 5.1.

data
Direct mapped cache

The cache’s job is to hold a copy of memory data which has been recently
read or written, so it can be returned quickly to the CPU; in the R30xx
architecture data accesses in the cache take just one clock, and an I-cache
and a D-cache operation can occur together.
When a cacheable location is read (a data load):
• It will be returned from the D-cache if the cache contains the
corresponding physical address and the cache line is valid there
(called a cache ‘‘hit’’). In this case nothing happens at the CPUs
memory interface, so the read is invisible to the outside world.
• If the data is not found in the D-cache (called a cache “miss”), the data
will be read from external memory. According to the CPU type and
how it is set up, it may read one or more words from memory. The
data is loaded into the cache, and normal operation then resumes.
In normal operation, cache miss processing will cause the targeted
cache line to “invalidate” the valid data already present in the cache.
In the R30xx caches, cache data is never more up-to-date than
memory (because the cache is write-through, described below), so the
previously cached data can be discarded without any trouble.

† Note that the R3071 and R3081 do implement a DMA protocol
that allows automatic, hardware-based data cache invalidation.
5–1

CHAPTER 5

CACHE MANAGEMENT

When data is loaded from an uncacheable location, it is always obtained
from external memory (or a memory-mapped IO location). Most systems
never access the same data locations as cached and uncached; however,
the results of such a system would be predictable. On an uncacheable load
cache data is neither used nor updated.
When software writes a cached location:
• If the CPU is doing a 32-bit store, the cache is always updated
(possibly discarding data from a previously cached location).
• For byte or half-word stores, the cache will only be updated if the
reference hits in the cache; then data will be extracted from the cache,
merged with the store data, and written back†.
• If the partial-word store misses in the cache, then the cache is left
alone.
• In all cases, the write is also made to main memory.
When the store target is an uncached location the cache is not consulted
or modified.
Figure 5.1, “Direct mapped cache” is a diagrammatic representation of
the way the MIPS cache works. Both caches are:
• Physically indexed, physically tagged: the CPUs program address
(virtual address) is translated to a physical address, just as is used to
address real memory, before being used for the cache lookup. The
TAG comparison (checking for a hit) is also based on physical
addresses.
On certain other CPU families the cache index is based on program
addresses (which are available a bit earlier); some CPUs even use
virtual TAGs, which then require that the cache be flushed at context
switch. But physical caches are easier to manage.
• Direct mapped : Each physical address has only one location in each
cache where it may reside. At each cache index there is only one data
item stored – this will be just one word in the D-cache but is usually
a 4-word line for the I-cache (see Figure 5.1, “Direct mapped cache”).
Next to the data is kept the tag, which stores the memory address for
which this data is a copy.
If the tag matches the high-order (higher number) address bits then
the cache line contains the data the CPU is looking for; the data is
returned and execution continues.
For an I-cache access, the CPU must select one of the four words
based on the lowest address bits.
This is a direct mapped cache because there is only one tag/data pair
at each cache index. More complex caches may have more than one
tag field, and compare them simultaneously with the physical
address.
A direct-mapped cache is very simple, but can suffer from cache
thrashing; so the CPU can run slowly if a program loop is regularly
accessing a pair of locations whose low-order addresses happen to be
equal. To avoid this situation, the R30xx family implements relatively
large caches, which minimize the probability of reasonable program
loops causing CPU thrashing.
• Cache lines : the line size is the number of data elements stored with
each tag. For R30xx family CPUs the I-cache implements a 4-word
line size; the D-cache always has 1-word lines.

† In the R30xx family, the data will be merged in the D-Cache.
However, the CPU bus will perform the store only to the bytes
which were actually changed (i.e. the store datum size), facilitating
debugging.
5–2

CACHE MANAGEMENT

CHAPTER 5

When a cache miss occurs the whole line must be filled from memory.
But it is quite possible to fetch more than a line’s worth of data; and
R30xx family CPUs can be configured to fetch 4 words of data on a Dcache miss, refilling 4 1-word ‘‘lines’’.
• Write through : the D-cache is write-through, meaning that all store
operations result in a store to main memory. This means that all data
in the cache is duplicated in main memory, and can therefore be
discarded at any time. In particular, when data is being read following
a cache miss it can always be stored in the cache without regard for
the data which was previously stored at the same index.
• Partial word write implementations : when the CPU writes only part of
a word, it is essential that any valid cache data should still end up as
a duplicate of main memory. One simple approach is to invalidate the
cache line and to write only to main memory (the main memory must
be byte-addressable). But the R30xx family uses a more efficient
strategy:
a)
if the location being written is present in the cache (cache hit) the
cache data is read into the CPU, the partial-word data merged
with it, the whole word written back to the cache, and the
partial-word written to memory.
b)
where the write misses in the cache the partial-word write is
performed to memory only, and the cache left alone.
Note that this takes an extra clock, so a partial-word write which hits
in the cache is slower than a whole-word write.

Cache isolation and swapping
No special instructions are provided to explicitly access the caches;
everything has to be done with load and store instructions.
To distinguish operations for cache management from regular memory
references, without having to dedicate a special address region for this
purpose, the R30xx architecture provides bits in the SR to support cache
management:
• The SR mode bit “IsC” will isolate the D-cache; in this mode loads and
stores affect only the cache, and loads also ‘‘hit’’ regardless of whether
the tag matches. As a special mechanism, with the D-cache isolated
a partial-word write will invalidate the appropriate cache line.
Caution: when the D-cache is isolated, not even loads/stores marked
by their address or TLB entry as ‘‘uncached’’ will operate normally.
One consequence of this is that the cache management routines must
not make any data accesses; they are typically written in assembler,
using only register variables.
• The CPU provides a mode where the caches are swapped (SR SwC
bit), to allow the I-Cache to be targeted by store instructions; then the
D-cache acts as an I-cache, and the I-cache acts as the D-cache. Once
the caches are swapped and isolated I-cache entries may be read,
written and invalidated (invalidation uses the same partial word write
mechanism described above).
Note that cache isolation does not stop instruction fetches from
referencing main memory.
The D-cache behaves ‘‘perfectly’’ as an I-cache (provided it was
sufficiently initialized to work as a D-cache) but the I-cache does not
behave properly as a D-cache. It is unlikely that it will ever be useful
to have the caches swapped but not isolated.
If software does use a swapped I-cache for word stores (a partial-word
store invalidates the line, as before) it must make sure those locations
are invalidated before returning to normal operation.

5–3

CHAPTER 5

CACHE MANAGEMENT

Initializing and sizing the caches
At machine start-up the caches are in a random state, so the result of a
cached read is unpredictable. In addition, following a reset the status
register SwC and IsC bits are also in a random state, so start-up software
had better set them to a known state before attempting any load or store
(even uncached).
Different members of the R3051 family have different cache sizes.
Software will be more portable if it dynamically determines the size of the
I-cache and D-cache at initialization time, rather than hard-wiring a
particular value.
A number of algorithms are possible. Shown below is the code contained
in IDT/sim for cache sizing. The basic algorithm works as follows:isolate
the D-cache;
• swap the caches when sizing the I-cache;
• Write a marker into the initial cache entry.
• Start with the smallest permissible cache size.
• Read memory at the location for the current cache size. If it contains
the marker, that is the correct size. Otherwise, double the size to try
and repeat this step until the marker is found.
/*
** Config_cache() -- determine sizes of i and d caches
** Sizes stored in globals dcache_size and icache_size
*/
#define CONFIGFRM ((4*4)+4+4)
FRAME(config_cache,sp, CONFIGFRM, ra)
.set
noreorder
subu
sp,CONFIGFRM
sw
ra,CONFIGFRM-4(sp)# save return address
sw
s0,4*4(sp)
# save s0 in first regsave slot
mfc0
s0,C0_SR
# save SR
mtc0
zero,C0_SR
# disable interrupts
.set
reorder
jal
_size_cache
sw
v0,dcache_size
li
v0,SR_SWC
# swap caches
.set
noreorder
mtc0
v0,C0_SR
jal
_size_cache
nop
sw
v0,icache_size
mtc0
zero,C0_SR
# swap back caches
and
s0,~SR_PE
# do not inadvertantly clear PE
mtc0
s0,C0_SR
# restore SR
.set
reorder
lw
s0,4*4(sp)
# restore s0
lw
ra,CONFIGFRM-4(sp)# restore ra
addu
sp,CONFIGFRM # pop stack
j
ra
ENDFRAME(config_cache)
/*
** _size_cache()
** return size of current data cache
*/
FRAME(_size_cache,sp,0,ra)
.set
noreorder
mfc0
t0,C0_SR
# save current sr
and
t0,~SR_PE
# do not inadvertently clear PE
or
v0,t0,SR_ISC # isolate cache
mtc0
v0,C0_SR
/*
* First check if there is a cache there at all
*/
move
v0,zero
li
v1,0xa5a5a5a5 # distinctive pattern

5–4

CACHE MANAGEMENT

CHAPTER 5
sw
v1,K0BASE
lw
t1,K0BASE
nop
mfc0
t2,C0_SR
nop
.set
reorder
and
t2,SR_CM
bne
t2,zero,3f
bne
v1,t1,3f
/*
* Clear cache size
*/
li
v0,MINCACHE

# try to write into cache
# try to read from cache

# cache miss, must be no cache
# data not equal -> no cache
boundries to known state.

sw
sll
ble

zero,K0BASE(v0)
v0,1
v0,MAXCACHE,1b

li
sw
li

v0,-1
v0,K0BASE(zero)# store marker in cache
v0,MINCACHE # MIN cache size

lw
v1,K0BASE(v0) # Look for marker
bne
v1,zero,3f
# found marker
sll
v0,1
# cache size * 2
ble
v0,MAXCACHE,2b# keep looking
move
v0,zero
# must be no cache
.set
noreorder
mtc0
t0,C0_SR
# restore sr
j
ra
nop
ENDFRAME(_size_cache)
.set
reorder

In a properly initialized cache, every cache entry is either invalid or
correctly corresponds to a memory location, and also contains correct
parity. Again, the sample code shown is from IDT/sim. The code works as
follows:
• Check that SR bit PZ is cleared to zero (1 disables parity; the R3071
and R3081 contain parity bits, and thus PZ=1 could cause the caches
to be initialized improperly).
• Isolate the D-cache, swap to access the I-cache.
• For each word of the cache: first write a word value (writing correct
tag, data and parity), then write a byte (invalidating the line).
Note that for an I-cache with 4 words per line this is inefficient; it
would be enough to write just one byte in the line to invalidate the
entry. Unless the system uses the invalidate routine often it doesn’t
seem worth the trouble.
FRAME(flush_cache,sp,0,ra)
lw
t1,icache_size
lw
t2,dcache_size
.set
noreorder
mfc0
t3,C0_SR
# save SR
nop
and
t3,~SR_PE
# dont inadvertently clear PE
beq
t1,zero,_check_dcache# if no i-cache check d-cache
nop
li
v0,SR_ISC|SR_SWC# disable intr, isolate and swap
mtc0
v0,C0_SR
li
t0,K0BASE
.set
reorder
or
t1,t0,t1
1:

zero,0(t0)

5–5

CHAPTER 5

CACHE MANAGEMENT

sb
zero,4(t0)
sb
zero,8(t0)
sb
zero,12(t0)
sb
zero,16(t0)
sb
zero,20(t0)
sb
zero,24(t0)
addu
t0,32
sb
zero,-4(t0)
bne
t0,t1,1b
/*
* flush data cache
*/
_check_dcache:
li
v0,SR_ISC
# isolate and swap back caches
.set
noreorder
mtc0
v0,C0_SR
nop
beq
t2,zero,_flush_done
.set
reorder
li
t0,K0BASE
or
t1,t0,t2
1:

sb
sb
sb
sb
sb
sb
sb
addu
sb
bne

zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)
zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t1,1b

.set
noreorder
_flush_done:
mtc0
t3,C0_SR
# un-isolate, enable interrupts
.set
reorder
j
ra
ENDFRAME(flush_cache)

Invalidation
Invalidation refers to the act of setting specified cache lines to contain
no valid references to main memory, but to otherwise be consistent (e.g.
valid parity). Software needs to invalidate:
• the D-cache when memory contents have been changed by something
other than store operations from the CPU. Typically this is done when
some DMA device is reading into memory.
• the I-cache when instructions have been either written by the CPU or
obtained by DMA. The hardware does nothing to prevent the same
locations being used in the I- and D-cache; and an update by the
processor will not change the I-cache contents.
Note that the system could be constructed to use unmapped accesses to
those variables shared with a DMA device; the only difference is in
performance. In general small areas where DMA is frequent compared to
CPU activity should be mapped uncached; and larger areas where CPU
activity predominates should be invalidated by the driver at appropriate
points. Bear in mind that invalidating a word of data in the cache is faster
(probably 4-7 times faster) than an uncached load.
To invalidate the cache:
• Figure out the address range to invalidate. Invalidating a region larger
than the cache size is a waste of time.

5–6

CACHE MANAGEMENT

CHAPTER 5

• isolate the D-cache. Once it is isolated, the system must insure at all
costs against an exception (since the memory interface will be
temporarily disabled). Disable interrupts and ensure that software
which follows cannot cause a memory access exception;
• to work on the I-cache, swap the caches;
• write a byte value to each cache line in the range;
• (unswap and) unisolate.
The invalidate routine is normally executed with its instructions
cacheable. This sounds like a lot of trouble; but in fact shouldn’t require
any extra steps to run cached. An invalidation routine in uncached space
will run 4-10 times slower.
Again, the example code fragment shown is taken from IDT/sim:
/*
** clear_cache(base_addr, byte_count)
** flush portion of cache
*/
FRAME(clear_cache,sp,0,ra)

/*
* flush instruction cache
*/
lw
t1,icache_size
lw
t2,dcache_size
.set
noreorder
mfc0
t3,C0_SR
# save SR
and
t3,~SR_PE
# dont inadvertently clear PE
nop
nop
li
v0,SR_ISC|SR_SWC# disable intr, isolate and swap
mtc0
v0,C0_SR
.set
reorder
bltu
t1,a1,1f
# cache is smaller than region
move
t1,a1
addu
t1,a0
# ending address + 1
move
t0,a0
sb
sb
sb
sb
sb
sb
sb
addu
sb
bltu

zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)
zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t1,1b

/*
* flush data cache
*/

.set
nop
li
mtc0
nop
.set
bltu
move
addu
move
sb
sb
sb
sb

noreorder
v0,SR_ISC
v0,C0_SR
reorder
t2,a1,1f
t2,a1
t2,a0
t0,a0
zero,0(t0)
zero,4(t0)
zero,8(t0)
zero,12(t0)

5–7

# isolate and swap back caches

# cache is smaller than region
# ending address + 1

CHAPTER 5

CACHE MANAGEMENT
sb
sb
sb
addu
sb
bltu

zero,16(t0)
zero,20(t0)
zero,24(t0)
t0,32
zero,-4(t0)
t0,t2,1b

.set
noreorder
mtc0
t3,C0_SR
# un-isolate, enable interrupts
.set
reorder
j
ra
ENDFRAME(clear_cache)

Testing and probing
During test, debug or when profiling, it may be useful to build up a
picture of the cache contents. Software cannot read the tag value directly,
but, for a valid line, can determine the tag value by exhaustive search:
• isolate the cache;
• load from the cache line at each possible line start address (low order
bits fixed, high order bits ranging over physical memory which exists
in the system). After each load consult the CM bit in SR, which will be
‘‘0’’ only when the tag value matches.
This takes a long time by computer terms; but to fully search a 1K Dcache with 4Mbytes of cacheable physical memory on a 20Mhz processor
will take only a couple of seconds, and will provide very valuable debugging
information. IDT/sim provides this capability.

Configuration (R3041/71/81 only)
The R3041, R3071, and R3081 processors allow the programmer to
make choices about the cache by setting fields in the Config register:
• Cache refill burst size (R3041/71/81) : by default the R3041 refills
only 1 word in the D-cache on a cache miss; but software can program
it to use 4-word burst reads instead, by setting the Config DBR bit.
The bit can be changed at any time, without needing to invalidate the
cache.
The refill of R3071 and R3081 processors can be configured by
hardware at reset-time, but software can override that choice.
This support is provided in the hope of enhancing performance. The
proper selection for a given system will depend on both the hardware
and the application. Some systems may find an advantage in
“toggling” the bit for various portions of the software. In general, the
proper burst size selection can be determined as follows:
Burst reads make most sense when the memory is capable of
returning a burst of data significantly faster than it can return 4
individual words. Many DRAM systems are like this; most ROM and
static RAM memories are not. Similarly, data accessed from narrow
memory ports should rarely be configured for a multi-word burst.
If programs tend to access memory sequentially (working up or down
a large array, for example) then the burst refill will offer a very useful
degree of data prefetch, and performance will be enhanced. If cache
access is more random, the burst refill may actually reduce
performance (since it involves overwriting cached data with memory
data the program may never use).
As a general rule, the bigger the D-cache, the smaller the penalty for
burst refills.
• Bigger I-cache in exchange for smaller D-cache (R3071/81) : the R3081
cache can be organized either with both I-cache and D-cache 8Kbytes
in size, or with a 16Kbyte I-cache and 4Kbyte D-cache. The
configuration is programmed using the AC bit in the Config register.

5–8

CACHE MANAGEMENT

CHAPTER 5

After changing the cache configuration both caches should be reinitialized, while running uncached. This means that most systems
will not dynamically reconfigure the caches.
Which configuration is best for a given system is mainly dependent on
the software. Cache effects are extremely hard to predict, and it is
recommended that both configurations be tried and measured, while
running as much of the real system as possible.
As a general rule: with large applications (like in a big OS) the big Icache will probably be best. If the system spends most of its time
manipulating lots of data from tight program loops, the big D-cache
may be better.

WRITE BUFFER
The write-through cache common to all R30xx family CPUs can be a big
performance bottleneck. In the average C program only about 10% of
instructions are stores, but these accesses tend to come in bursts; for
example, when a function prologue saves a few registers.
DRAM memory frequently has the characteristic that the first write of a
group takes quite a long time (5-10 clocks typical on these CPUs), and
subsequent ones are relatively fast so long as they follow quickly.
If the CPU simply waits for all writes to complete, the performance hit
will be significant. So the R30xx provides a write buffer, a FIFO store which
keeps a number of entries each containing both data to be written, and the
address at which to write it. The 4-entry queue provided by R30xx family
CPUs is efficient for well-tuned DRAM.
In general, the operation of the write buffer is completely transparent to
software. Occasionally, the programmer needs to be aware of what is
happening:
• Timing relations for IO register accesses : When software performs a
store to write an IO register, the store reaches memory after a small,
but indeterminate, delay. Some consequences are:
— other communication with the IO system (e.g. interrupts) may
happen more quickly – for example, the CPU may get an interrupt
from a device ‘‘after’’ it has been programmed to generate no
interrupts.
— if the IO device needs some time to recover after a write the program
must ensure that the write buffer FIFO is empty before counting
out that time period.
— at the end of interrupt service, when writing to an IO device to clear
the interrupt it is asserting, software must insure that the
command is actually written to the device, and that it has had to
respond, before re-enabling that interrupt; otherwise, spurious
interrupts may be signalled.
In these cases the programmer must ensure that the CPU waits while
the write buffer empties. It is good practice to define a subroutine
which does this job; it is traditionally called wbflush(). Hints on
implementing this function are provided later in this chapter.
On CPUs outside the R30xx family, even stranger things can happen:
• Reads overtaking writes : a load instruction (uncached or missing in
the cache) executed while the write buffer FIFO is not empty gives the
CPU a choice: should it finish off the write, or use the memory
interface to fetch data for the load?
The R3041, R3051, R3052 and R3081 all have the same rule, which
avoids potential problems: the write buffer is emptied before the load
occurs.
Although it seems tempting to instead implement a scheme which
checks for conflicts, and allows the read to progress if no write buffer
entry matches the read target address, such a scheme does not avoid
the possible system problems. Specifically, writes to locations which

5–9

CHAPTER 5

CACHE MANAGEMENT

may have side effects (e.g. semaphores, IO registers, etc.), are not
detected under such a scheme, and can cause great headaches to the
programmer.
• Byte gathering : some write buffers watch for partial-word writes
within the same memory word, and will combine those partial writes
into a single operation. This is not done by any current R30xx family
CPU, because such operation would pose problems with IO register
writes.

Implementing wbflush()
IDT R30xx family CPUs enforce strict write priority (all pending writes
retired to memory before main memory is read). Thus, implementing
wbflush() is as simple as implementing an uncached load (e.g. from the
boot PROM vector). This will stall the CPU until the writes have finished,
and the load finished too. Alternately, the overhead can be minimized by
performing an uncached load from the fastest memory available in the
system.
The code fragment below shows an implementation of WbFlush, taken
from IDT/sim:
/*
** wbflush() flush the write buffer - this is specific for each
hardware
**
configuration.
*/
FRAME(wbflush,sp,0,ra)
.set noreorder
lw
t0,wbflush#read an uncached memory location
j
ra
nop
.set reorder
ENDFRAME(wbflush)

5–10

MEMORY MANAGEMENT AND
THE TLB

CHAPTER 6

Integrated Device Technology, Inc.

MEMORY MANAGEMENT AND THE TLB
Some R30xx family processors (“E” versions) have on-chip memory
management hardware. This provides a mechanism for dynamically
translating program addresses in the kuseg and kseg2 regions. The key
piece of hardware is the ‘‘TLB†’’.
The memory management is paged: with a fixed page size of 4Kbytes.
The low-order 12 bit of the program address are used directly as the low
order bits of the physical address, so address translation operates in 4K
chunks.
The TLB is a 64-entry associative memory. Each entry in an associative
memory consists of a key field and a data field; when presented with a key,
the memory returns the data of any entry where the key matches.
In the R30xx family, the TLB is referred to as ‘‘fully-associative’’; this
emphasizes that all keys are really compared with the input value in
parallel.
The TLB’s key field contains two sections:
• Virtual page number : (VPN) this is just a program address with the low
12 bits cut off, since the low-order bits don’t participate in the
translation process.
• Address Space Identifier. (ASID): this is a magic number used to
stamp translations, and (optionally) is compared with an extended
part of the key. Why?
In multi-tasking systems it is common to have all user-level tasks
executing at the same sort of program addresses (though of course
they are using different physical addresses); they are said to be using
different address spaces. So translation records for different tasks
will often share the same value of ‘‘VPN’’. If the TLB mechanism was
not supported with an ASID, when the OS switches from one task to
another, it would have to find and invalidate all TLB translations
relating to the old task’s address space, to prevent them from being
erroneously used for the new one. This would be desperately
inefficient.
Instead, the OS assigns a 6-bit unique code to each task’s distinct
address space. During normal running this code is kept in the ASID
field of the EntryHi register, and is used together with the program
address to form the lookup key; so a translation with an ASID code
which doesn’t match is quietly ignored.
Since the ASID is only 6 bits long, OS software does have to lend a
hand if there are ever more than 64 address spaces in concurrent use;
but it probably won’t happen too often. In such a system, new tasks
are assigned new ASIDs until all 64 are assigned; at that time, all
tasks are flushed of their ASIDs “de-assigned” and the TLB flushed;
as each task is re-entered, a new ASID is given. Thus, ASID flushing
is relatively infrequent.
The TLB data field includes:
• Physical frame number (PFN) : the physical address with the low 12
bits cut off. In an address translation, the VPN bits are replaced by
the corresponding PFN bits to form the true physical address.
• Cache control bit (N) : set 1 to make the page uncacheable.

† This is an acronym for ‘‘translation lookaside buffer’’, which is a
look-up table of virtual to physical address translations.
6–1

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

• Write control bit (D) : set 1 to allow stores to this page to happen. The
‘‘D’’ comes from this being called the ‘‘dirty bit’’; a later section on
“Simulating dirty bits” describes a typical use for these bits.
• Valid bit (V) : set 0 to make this entry usable. This seems pretty
pointless; why have a record loaded into the TLB if the translation is
not usable? But an access to an invalid page produces a different trap
from a TLB refill exception, so making a page invalid means that some
strange conditions can be made to take a different trap, which does
not have to be handled by the superfast refill code.
• Global bit (G) : set to disable the ASID-matching scheme, allowing an
OS to map some program addresses to the same physical address for
all tasks; it can be useful to have some corner of each address space
mapped to the same physical locations. Sharp-eyed or experienced
readers will notice that this means that the global bit is really more
like part of the key than part of the data; the distinction tends to get
blurred in associative memories.
Translating an address is now simple, and goes like this:
• CPU generates a program address : either for an instruction fetch, a
load or a store, in one of the translated address regions. The low 12
bits are separated off, and the resulting VPN together with the current
value of the ASID field in EntryHi used as the key to the TLB.
• TLB matches key : selecting the matching entry. The PFN is glued to
the low-order bits of the program address to form a complete physical
address.
• Valid? : the V and D bits are consulted. If it isn’t valid, or a store is
being attempted with D cleared, the CPU takes a trap. As with all
translation traps, the BadVaddr register will be filled with the
offending program address and TLB registers Context and EntryHi
pre-filled with relevant information. The system software can use
these registers to obtain data for exception service.
• Cached? : if the N bit is set the CPU looks in the cache for a copy of
the physical location’s data; if it isn’t there it will be fetched from
memory and a copy left in the cache. Where the C bit is clear the CPU
neither looks in nor refills the cache.
Of course, there are only 64 entries in the TLB, which can hold
translations for a maximum of 256 Kbytes of program addresses. This is
far short of enough for most systems. The TLB is almost always going to be
used as a software-maintained ‘‘cache’’ for a much larger set of
translations.
When a program address lookup in the TLB fails, a TLB refill trap is
taken. System software has the job of:
• figuring out whether there is a correct translation; if not the trap will
be dispatched to the software which handles address errors.
• if there is a correct translation, constructing a TLB entry which will
implement it;
• if the TLB is already full (and it almost always is full in running
systems), selecting an entry which can be discarded;
• writing the new entry into the TLB.

6–2

MEMORY MANAGEMENT AND THE TLB

CHAPTER 6

See below for how this can be tackled; but note here that although
special CPU features help out with one particular class of
implementations, the software can refill the TLB any way it likes.
Register
Mnemonic
EntryHi

Description

CP0
reg no

Together these registers hold a TLB entry. All reads and
writes to the TLB must be staged through them.
EntryHi also remembers the current ASID.

Index

Determines which TLB entry will be read/written by
appropriate instructions

Random

pseudo-random value (actually a free-running counter)
used by a tlbwr to write a new TLB entry into a ‘‘randomly’’
selected location.

Context

Convenience register provided to speed up the processing
of TLB refill traps. The high-order bits are read/write; the
low-order 21 bits reflect the BadVaddr value.
(The register is designed so that, if the system uses the
‘‘favored’’ arrangement of memory-held copies of memory
translation records, it will be setup by a TLB refill trap to
point to the memory location of the record needed to map
the offending address. This speeds up the process of
finding the current memory mapping, and arranging
EntryHi/Lo properly).

EntryLo

Table 6.1. CPU control registers for memory management

MMU registers described
EntryHi, EntryLo
31

VPN

ASID

EntryHi Register (TLB key fields)
Figure 6.1.

EntryHi and EntryLo register fields

PFN

EntryLo Register (TLB data fields)
Figure 6.2.

EntryHi and EntryLo register fields

These two registers represent a TLB entry, and are best considered as a
pair. Fields in EntryHi are:
• VPN : ‘‘virtual page number’’, the high-order bits of a program address.
On a refill exception this field is set up automatically to match the
program address which could not be translated. To write a different
TLB entry, or attempt a TLB probe, software must set it up
“manually”.
• ASID : ‘‘address space identifier’’, normally left holding the OS’ value
for the current address space. This is not changed by exceptions.
Most software systems will deliberately write this field only to setup
the current address space.
However, software must be careful when using tlbr to inspect TLB
entries; the operation overwrites the whole of EntryHi, so software
needs to restore the correct current ASID value afterwards.

6–3

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

Fields in EntryLo are:
• PFN : the high-order bits of the physical address to which values
matching EntryHi’s VPN will be translated.
• N : ‘‘noncacheable’’; 0 to make the access cacheable, 1 for
uncacheable.
• D : ‘‘dirty’’, but really a write-enable bit. 1 to allow writes, 0 and any
store using this translation will be trapped.
• V : ‘‘valid’’, if 0 any address matching this entry will cause an
exception.
• G : ‘‘global’’. When the G bit in a TLB entry is set, that TLB entry will
match solely on the VPN field, regardless of whether the TLB entry’s
ASID field matches the value in EntryHi.
• Fields called ‘‘0’’ : these fields always return zero; but unlike many
reserved fields, they do not need to be written as zero (nothing
happens regardless of the data written). This is important; it means
that the memory-resident data which is used to generate EntryLo
when refilling the TLB can contain some software-interpreted data in
these fields, which the TLB hardware will ignore without the need to
spend precious CPU cycles masking it.
Index
31

7
×

Index
Figure 6.3.

Fields in the Index register

The ‘‘P’’ field is set when a tlbp instruction (tlb probe, used to see if the
TLB can translate a particular VPN) failed to find a valid translation; since
it is the top bit it appears to make the 32-bit value negative, which is easy
to test for.
Random
31

13
Random

Figure 6.4.

Fields in the Random register

Most systems never have to read or write the Random register, shown as
Figure 6.4, “Fields in the Random register”, in normal use; but it may be
useful for diagnostics. The hardware initializes the Random field to its
maximum value (63) on reset, and it decrements every clock period until it
reaches 8, when it wraps back to 63 and starts again.
Context
31
PTEBase

Bad VPN
Figure 6.5.

Fields in the Context Register

• PTEBase : a location which just stores what is put in it. In the
‘‘standard’’ refill handler, this will be the high-order bits of the
(1Mbyte aligned) starting address of a memory-resident page table.
• Bad VPN : following an addressing exception this holds the high-order
bits of the address; exactly the same as the high-order bits of
BadVaddr. However, if the system uses the ‘‘standard’’ TLB refill

6–4

MEMORY MANAGEMENT AND THE TLB

CHAPTER 6

exception handling code the 32-bit value formed by Context is directly
usable as a pointer to the memory-resident page table, considerably
shortening the refill exception code.
• Fields marked 0 : can be written with any value, but they will always
read zero.

MMU control instructions
tlbr

–

Read TLB entry at index

tlbwi

–
Write TLB entry at index
The above two instructions move MMU data between the TLB entry
selected by the Index register and the EntryHi and EntryLo registers.
tlbwr
–
Write TLB entry selected by Random
copies the contents of EntryHi & EntryLo into the TLB entry indexed
by the random register. This saves time when using the
recommended random replacement policy. In practice, tlbwr will be
used to write a new TLB entry in a TLB refill exception handler; tlbwi
will be used anywhere else.
tlbp
–
TLB lookup
searches (probes) the TLB for an entry whose virtual page number
and ASID matches those currently in EntryHi, and stores the index
of that entry in the index register (index is set to a negative value if
nothing matches). If more than one entry matches, anything might
happen. Note that tlbp does not fetch data from the TLB. The
instruction following a tlbp must not be a load or store.

Programming interface to the TLB
TLB entries are set up by writing the required fields into EntryHi and
EntryLo and using a tlbwr or tlbwi instruction to copy that entry into the
TLB proper.
When handling a TLB refill exception, EntryHi has been set up
automatically, with the current ASID and the required VPN.
Be very careful not to create two entries which will match the same
program address/ASID pair. If the TLB contains duplicate entries an
attempt to translate such an address, or probe for it, produces a fatal ‘‘TLB
shutdown’’ condition (indicated by the TS bit in SR being set). It can be
cleared only by a hardware reset.
System software often won’t need to read TLB entries at all. But if
necessary, software can find the TLB entry matching some particular
program address using tlbp to setup the Index register. Don’t forget to save
EntryHi and restore it afterwards because its ASID field is likely to be
important.
Use a tlbr to read the TLB entry into EntryHi and EntryLo.
How refill happens
When a program makes an access in kuseg or kseg2 to a page for which
no translation record is present, the CPU takes a TLB refill exception. The
assumption is that system software is maintaining a large number of page
translations and is using the TLB as a cache of recently-used translations;
so the refill exception will normally be handled by finding a correct
translation, installing it, and returning to user code.
In ‘‘CISC’’ CPUs the TLB is a cache (usually implemented by microcode),
and the CPU automatically reads memory-resident ‘‘page tables’’ whose
structure is part of the CPU architecture.
In the MIPS architecture software is fast enough, and offers greater
flexibility.
To save time on user-program TLB refill exceptions (which will happen
frequently in a ‘‘big’’ OS):
• refill exceptions on kuseg program addresses are vectored through a
low-memory address used for no other exception;

6–5

CHAPTER 6

MEMORY MANAGEMENT AND THE TLB

• special exception rules permit the kuseg refill handler to risk a nested
TLB refill exception on a kseg2 address.
The problem is that before an exception routine can itself suffer an
exception it must first save the previous program state, represented
by the EPC return address and some SR bits. This is helped out by a
hardware feature and a software convention:
a)
the KUo, IEo bits in the status register act as a third level of the
processor-state stack, so that the CPU state already saved as a
result of the kuseg refill exception can be preserved during the
nested exception.
b)
The kuseg refill handler copies EPC into the k1 register; the
general exception code and kseg2 refill handler are then careful
to preserve its value, enabling a clean return.
Refill exceptions on kseg2 addresses are expected to be rare enough that
it will not matter if they share in the overhead of the ‘‘all other exceptions’’
entry point. However, once software determines the type of exception the
handling is similar.
Using ASIDs
By setting up TLB entries with a particular ASID setting and with the
EntryLo G bit zero, those entries will only ever match a program address
when the CPU’s ASID register is set the same. This allows software to map
up to 64 different address spaces simultaneously, without requiring that
the OS clear out the TLB on a context change.
In typical usage, new tasks are assigned an “un-initialized” ASID. The
first time the task is invoked, it will presumably miss in the TLB, allowing
the assignment of an ASID. If the system does run out of new ASIDs, it will
flush the TLB and mark all tasks as “new”. Thus, as each task is reentered, it will be assigned a new ASID. This sequence is expected to
happen infrequently if ever.
The Random register and wired entries
The hardware offers no way of finding out which TLB entries have been
used most recently. When the system needs to replace a mapping
dynamically (using the TLB as a cache) the only practicable strategy is to
replace an entry at random. The CPU makes this easy by maintaining the
Random register, which counts (down) with every processor cycle.
However, it is often useful to have some TLB entries which are
guaranteed to stay there unless explicitly removed. These may be useful to
map pages which are known to be required very often; they are critical
because they allow the system to map pages and guarantee that no refill
exception will be generated on them.
The stable TLB entries are described as ‘‘wired’’ and on R30xx family
CPUs consist of TLB entries 0 through 7. There is nothing special about
these entries; the magic is in the Random register, which never takes
values 0-7; it cycles directly from 63 down to 8 before reloading with 63.
So conventional random replacement leaves TLB entries 0 through 7
unaffected, and entries written there will stay until explicitly removed.

Memory translation – setup
The following code fragment initializes the TLB to ensure no match on
any kuseg or kseg2 address. This is important, and is preferable to
initializing with all “0”’s (which is a kuseg address, and which would cause
multiple matches if referenced):
LEAF(mips_init_tlb)
mfc0
t0,C0_ENTRYHI # save asid
mtc0
zero,C0_ENTRYLO# tlblo = !valid
li
a1,NTLBID<vaddr) >> VMPGSHIFT;
unsigned vpn = xcp->vaddr >> VMPGSHIFT;
unsigned asid = 0;
/* write a random tlb (entryhi, entrylo) pair */
/* mark it valid, global, uncached, and not writable/dirty */
r3k_tlbwr ((vpn < double */

cvt.d.w fd,fs

fd = (double) fs;/* int -> double */

cvt.s.d fd,fs

fd = (float) fs;/* double -> float */

cvt.s.w fd,fs

fd = (float) fs;/* int -> float */

cvt.w.s fd,fs

fd = (int) fs;/* float -> int */

cvt.w.s fd,fs

fd = (int) fs;/* double -> int */

Table 8.7. FPA data conversion operations

When converting from FP formats to 32-bit integer, the result produced
depends on the current rounding mode.

Conditional branch and test instructions
The FP test and branch instructions are separate. A test instruction
compares two FP values and set the FPA condition bit accordingly (C in the
FP status register); the branch instructions branch on whether the bit is
set or unset.

8–10

FLOATING POINT CO-PROCESSOR

CHAPTER 8

The branch instructions are:
bc1f disp

Branch if C bit ‘‘false’’ (zero)

bc1t disp

Branch if C bit ‘‘true’’ (one)

Like the CPU’s other conditional branch instructions disp is PC-relative,
with a signed 16-bit field as a word displacement. disp is usually coded as
the name of a label, which is unlikely to end up more than 128Kbytes
away.
But before executing the branch, the condition bit must be set
appropriately. The comparison operators are:
c..d fs1,fs2

Compare fs1 and fs2 and set C

c..s fs1,fs2

Where is any of 16 conditions called: eq, f, le, lt, nge, ngl, ngle,
ngt, ole, olt, seq, sf, ueq, ule, ult, un. Why so many? These test for any ‘‘OR’’
combination of three mutually incompatible conditions:
fs1 f2) goto foo;/* and trap if unordered */

8–11

CHAPTER 8

FLOATING POINT CO-PROCESSOR

c.ole.d $f0, $f2
nop
bc1f
foo

# the assembler will do this...

Fortunately, many assemblers recognize and manage this delay slot
properly.

INSTRUCTION TIMING REQUIREMENTS
FP arithmetic instructions are interlocked (the instruction flow “stalls”
automatically until results are available; the programmer does not need to
be explicitly aware of execution times), and there is no need to interpose
‘‘nops’’ or to reorganize code for correctness. However, optimal
performance will be achieved by code which lays out FP instructions to
make the best use of overlapped execution of integer instructions, and the
FP pipeline.
However, the compiler, assembler or (in the end) the programmer must
take care about the timing of:
• Operations on the FP control and status register: moves between FP
and integer registers complete late, and the resulting value cannot be
used in the following instruction.
• FP register loads: like integer loads, take effect late. The value can’t be
used in the following instruction.
• Test condition and branch: the test of the FP condition bit using the
bc1t, bc1f instructions must be carefully coded, because the
condition bit is tested a clock earlier than might be expected. So the
conditional branch cannot immediately follow a test instruction.

INSTRUCTION TIMING FOR SPEED
The R30xx family FPA takes more than one clock for most arithmetic
instructions, and so the pipelining becomes visible. The pipeline can show
up in three ways:
• Hazards: where the software must ensure the separation of
instructions to work correctly;
• Interlocks: where the hardware will protect the software by delaying
use of an operand until it is ready, but knowledgable re-arrangement
of the code will improve performance;
• Overlapping: where the hardware is prepared to start one operation
before another has completed, provided there are no data
dependencies. This is discussed later.
Hazards and interlocks arise when instructions fail to stick to the
general MIPS rule of taking exactly one clock period between needing
operands and making results ready. Some instructions either need
operands earlier (branches, particularly, do this), or produce results late
(e.g. loads). All R30xx family instructions which can cause trouble are
tabulated in an appendix of this manual.

INITIALIZATION AND ENABLE ON DEMAND
Reset processing will normally initialize the CPU’s SR register to disable
all optional co-processors, which includes the FPA (alias coprocessor 1).
The SR bit CU1 has to be set for the FPA to work.

8–12

FLOATING POINT CO-PROCESSOR

CHAPTER 8

To determine availability of a hardware FPA, software should read the
FPA implementation register; if it reads zero, no FP is fitted and software
should run the system with CU1 off†. Once CU1 is enabled, software
should setup the control/status register FCR31 with the system choice of
rounding modes and trap enables.
Once the FPA is operating, the FP registers should be saved and restored
during interrupts and context switches. Since this is (relatively) timeconsuming, software can optimize this:
• Leave the FPA disabled by default when running a new task. Since the
task cannot now access the FPA, the OS doesn’t have to save and
restore registers.
• On a FP instruction trap, mark the task as an FP user and enable the
FP before returning to it.
• Disable FP operations while in the kernel, or in any software called
directly or indirectly from an interrupt routine. This avoids saving FP
registers on an interrupt; instead FP registers need be saved only
when context-switching to or from an FP using task.

FLOATING POINT EMULATION
The low-cost members of the R30xx family do not have a hardware FPA.
Floating point functions for these processors are provided by software, and
are slower than the hardware. Software FP is useful for systems where
floating point is employed in some rarely-used routines.
There are two approaches:
• Soft-float: Some compilers can be requested to implement floating
point operations with software. In such a system, the instruction
stream does not contain actual floating point operations; instead,
when the software requests floating point from the compiler, the
compiler inserts a call to a dedicated floating point library. This
eliminates the overhead of emulating a floating point register file, and
also the overhead of decoding the requested operation.
• Run-time emulation: The compiler can produce the regular FP
instruction set. The CPU will then take a trap on each FP instruction,
which is caught by the FP emulator. The emulator decodes the
instruction and performs the requested operation in software.
Part of the emulator’s job will be emulating the FP register set in
memory.
This technique is much slower than the soft-float technique; however,
the binaries generated will automatically gain significant performance
when executed by an R3081, simplifying system upgrades.
As described above, a run-time emulator may also be required to back
up FP hardware for very small operands or obscure operations; and, for
maximal flexibility that emulator is usually complete. However, it will be
written to ensure exact IEEE compatibility and is only expected to be called
occasionally, so it will probably be coded for correctness rather than speed.
Compiled-in floating point (soft-float) is much more efficient on integer
only chips; the emulator has a high overhead on each instruction from the
trap handler, instruction decoder, and emulated register file.

† Some systems may still enable CP1, to use the BrCond(1) input
pin as an input port. The software must then insure that no FPA
operations are actually required, since the CPU will presume that
they are actually executed.
8–13

ASSEMBLER LANGUAGE
PROGRAMMING

CHAPTER
CHAPTER
9
9

Integrated Device
Integrated
DeviceTechnology,
Technology,Inc.
Inc.

This chapter details the techniques and conventions associated with
writing and reading MIPS assembler code. This is different from just
looking at the list of machine instructions because:
1)
MIPS assemblers provide a large number of extra ‘‘macro’’
instructions which provide a richer instruction set than in fact
exists at the machine level.
2)
Programmers need to know the exact syntax of directives to start
and end functions, define data, control instruction ordering and
optimization, etc.
Before reading much further, it may be a good idea to go back and review
Chapter 2 (MIPS Architecture). It describes the low-level machine
instruction set, data types, addressing modes, and conventional register
usage.

SYNTAX OVERVIEW
Appendix C of this manual contains the formal syntax for the original
MIPS Corp. assembler; most assemblers from other vendors follow this
closely, although they may differ in their support of certain directives.
These directives and conventions are similar to those found in other
assemblers, especially a UNIX† assembler.

Key points to note
• The assembler allows more than one statement on each line, as long
as they are separated by semi-colons.
• "White space" (tabs and spaces) is permitted between any symbols.
• All text from a ‘#’ to the end of the line is a comment and is ignored,
but do not put a ‘#’ in column 1.
• Identifiers for labels, variables, etc. can be any combination of alphanumeric characters plus ‘$’, ‘_’ and ‘.’, except for the first character
which cannot be numeric:
Good labels:
AVeryLongIdentifier
frog$spawn
frog.spawn
__peculiar2

#
#
#
#
#

lower case is different from upper case
dollars allowed in names
’.’ is also valid
leading underscores often used to
avoid name clashes in C

Bad labels:
7down
frog-spawn

# leading decimal
# "-" not allowed

• The assembler allows the use of numbers (decimal between 1-99) as
a label. These are treated as ‘‘temporary’’, and are “re-usable”. In a
branch instruction ‘‘1f’’ (forward) refers to the next ‘‘1:’’ label in the
code, and ‘‘1b’’ (back) refers to the last-met ‘‘1:’’ label.
This eliminates the need for inventing unique but meaningless names
for little branches and loops. Many programmers reserve named
labels for subroutine entry points.

† UNIX is a trademark of Univel Inc.
9–1

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

• The MIPS Corp. assembler, among others, provides the conventional
register names (a0, t5, etc.) as C pre-processor macros; thus, the
programmer must pass the source through the C preprocessor and
include the file †.
• If the C preprocessor is indeed used, then typically it is permitted to
also use C-style /* comments */ and macros.
• Hexadecimal constants are numbers preceded by ‘‘0x’’ or ‘‘ 0X’’; octal
constants must be preceded by ‘‘0’’; be careful not to put a redundant
zero on the front of a decimal constant. Constants are:
0
0x80000000
0377
08
01024

#
#
#
#
#

strictly octal zero, but who cares?
the biggest negative integer
255 decimal, probably what was meant
illegal (0 implies octal)
octal for 528, probably not what was meant

• Pointer values can be used; in a word context, a label or relocatable
symbol stands for its address as a 32-bit integer. The identifier ‘.’ (dot)
represents the current location counter.
Many assemblers even allow some limited arithmetic.
• Character constants and strings can contain the following special
characters, introduced by the backslash ‘\’ escape character:
character

generated code

alert (bell)

backspace

escape

formfeed

newline

carriage return

horizontal tab

vertical tab

backslash

\’

single quote

double quote

null (integer 0)

A character can be represented as a one-, two-, or three-digit octal
number (\ followed by octal digits), or as a one-, two-, or three-digit
hexadecimal number ( \x followed by hexadecimal digits).
• The precedence of binary and unary operations in constant
expressions follows the C definition.

REGISTER-TO-REGISTER INSTRUCTIONS
Most MIPS machine instructions are three-register operations, i.e. they
are arithmetic or logical functions with two inputs and one output, for
example:

† In IDT/c version 5.0 and later, the header files exist in the
directory “/idtc”. The pre-processor is automatically invoked if the
extension of the filename is anything other than “.s”. To force the
pre-processor to be used with “.s” files, use the switch “xassemble-with-cpp” in the command line.
9–2

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9
rd = rs + rt

• rd : is the destination register, which receives the result of functions
op;
• rs : is a source register (operand);
• rt : is a second source register.
In MIPS assembly language these type of instructions are written:
opcode rd, rs, rt
For example:
addu

$2, $4, $5

# $2 = $4 + $5

Of course any or all of the register operands may be identical. To
produce a CISC-style, two-operand instruction just use the destination
register as a source operands; the assembler will do this automatically if
rs is omitted.
addu

$4, $5

→

addu

$4, $4, $5

# $4 = $4 + $5

Unary operations (e.g. neg, not) are always synthesized from one or
more of the three-register instructions. The assembler expects maximum
of two operands for these instructions (dst and src):
neg
not

$2, $4
$3

→
→

sub
nor

$2, $0, $4
$3, $0, $3

# $2 = -$4
# $3 = ~$3

Probably the most common register-to-register operation is move. This
ubiquitous instruction is in fact implemented by an addu with the always
zero-valued register $0:
move

$3, $5

→

addu

$3, $5, $0

# $3 = $5

IMMEDIATE (CONSTANT) OPERANDS
An immediate operand is the traditional term for a constant value found
in a field of the instruction. Many of the MIPS arithmetic and logical
operations have an alternative form which use a 16-bit immediate in place
of rt. The immediate value is first sign-extended or zero-extended to 32bits, for arithmetic or logical operations respectively.
Although an immediate operand implies different low-level machine
instruction from its three-register version (e.g. addi instead of add), there
is no need for the programmer to write this explicitly. The assembler will
spot the case when the final operand is an immediate, and use the correct
machine instruction. For example:
add

$2, $4, 64

→

addi

$2, $4, 64

If an immediate value is too large to fit into the 16-bit field in the
machine instruction, then the assembler helps out again. It automatically
loads the constant into the assembler temporary register $at/$1 and then
performs the operation using that.
add

$4, 0x12345

→

li
add

$at, 0x12345
$4, $4, $at

Note the li (load immediate) instruction, which again isn’t found in the
machine’s instruction set; li is a heavily-used macro instruction which
loads a 32-bit integer value into a register, without the programmer having
to worry about how it gets there:

9–3

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

• When the 32-bit value lies between ±32K it can use a single addiu
with $0; when bits 31-16 are all zero it can use ori; when the bits 150 are all zero it will be lui; and when none of these is possible it will
be a an lui/ori pair:
li

$3, -5

→

addiu

$3, $0, -5

$4, 0x8000

→

ori

$4, $0, 0x8000

$5, 0x120000→lui

$5, 0x12

$6, 0x12345→ lui

$6, 0x1
ori
$6, $6, 0x2345

MULTIPLY/DIVIDE
The multiply and divide machine instructions are unusual:
• they do not accept immediate operands;
• they do not perform overflow or divide-by-zero tests;
• they operate asynchronously – so other instructions can be executed
while they do their work;
• they store their results in two separate result registers (hi and lo),
which can only be read with the two special instructions mfhi and
mflo;
• the result registers are interlocked – they can be read at any time after
the operation is started, and the processor will stall until the result is
ready.
However the conventional assembler multiply/divide instructions will
hide this: they are complex macro instructions which simulate a threeoperand instruction and perform overflow checking. A signed divide may
generate about 13 instructions, but they execute in parallel with the
hardware divider so that no time is wasted (the divide itself takes 35
cycles).
Instruction

Description

mul

simple unsigned multiply, no checking

mulo

signed multiply, checks for overflow above 32-bits

mulou

unsigned multiply, checks for overflow above 32-bits

div

signed divide, checks for zero divisor or divisor of -1 with most
negative dividend.

divu

unsigned divide, checks for zero divisor

rem

signed remainder, checks for zero divisor or divisor of -1 with
most negative dividend.

remu

unsigned remainder, checks for zero divisor

Some MIPS assemblers will convert constant multiplication, and
division/remainder by constant powers of two, into the appropriate shifts,
masks, etc. Don’t rely on this though, as most toolchains expect the
compiler or assembly-language programmer to spot this sort of
optimization.
To explicitly control the multiplication, specify a dst of $0. The
assembler will issue the raw machine instruction to start the operation; it
is then up to the programmer to fetch the result from hi and/or lo and, if
required, perform overflow checking.

9–4

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

LOAD/STORE INSTRUCTIONS
The following table lists all the assembler’s load/store instructions. The
signed load instructions sign-extend the memory data to 32-bits; the
unsigned instructions zero-extend.
Load
Signed

Store

Description

word

Unsigned

lw
lh

lhu

halfword

lbu

byte

usw

unaligned word

ush

unaligned halfword

lwl

swl

word left

lwr

swr

word right

l.d

s.d

double precision floating-point

l.s

s.s

lwc1

swc1

single precision floating-point (i.e.,
coprocessor 1 register)

ulw
ulh

ulhu

Don’t forget the architectural constraints of load/store instructions:
• Strict alignment: addresses must be aligned correctly (i.e. a multiple
of 4 for words, and 2 for halfwords), except for the special left, right
and unaligned variants (described below), or else they will cause an
exception.
• Load delay: all load instructions require at least one other instruction
between them and the instruction which uses their result – but most
assemblers should guarantee this by inserting a nop if necessary.
There is a special exception to this rule for lwl followed immediately
by lwr to the same register, or vice versa (the last instruction of the
pair will still have the delay slot, but no delay slot is required between
the instructions in the pair).

Unaligned loads and store
As noted above, normal load and store instructions must have a
correctly aligned address. This can occasionally cause problems when
porting software from CISC architectures which allow unaligned
addresses.
All data structures that are declared as part of a standard C program
will be aligned correctly. But addresses computed at run-time, or data
structures declared using a non-standard language extension, may
require that software copes with unaligned addresses. While this can be
done by a combination of byte loads, shifts and adds, the MIPS
architecture provides the special purpose lwl, lwr, swl and swr
instructions. An unaligned word can be accessed using just two of these
special instructions as a pair, however they are not usually used directly,
but are generated by the ulw (unaligned load word) and usw (unaligned
store word) macro instructions.
The ulh, ulhu, and ush unaligned halfword macro instructions do not
use the special instructions. Unaligned halfwords loads generate two lb’s,
a shl and an or (4 instructions); stores generate two sb’s and a shr (3
instructions).

9–5

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

ADDRESSING MODES
As discussed above, the hardware supports only one addressing mode:
base_reg+offset, where offset is in the range –32768 to 32767. However the
assembler simulates direct and direct+index-reg addressing modes by
using two or three machine instructions, and the assembler-temporary
register.
lw

$2, ($3)

→

$2, 0($3)

$2, 8+4($3)

→

$2, 12($3)

$2, addr

→

lui
lw

$at, %hi_addr
$2, %lo_addr($at)

$2, addr($3) →

lui
addu
sw

$at, %hi_addr
$at, $at, $3
$2, %lo_addr($at)

The store instruction is written with the source register first and the
address second, to look like a load; for other operations the destination is
first.
The symbol addr in the above examples can be any of these things:
• a relocatable symbol – the name of a label or variable (whether in this
module or elsewhere);
• a relocatable symbol ± a constant expression;
• a 32-bit constant expression (e.g. the absolute address of a device
register).
The constructs ‘‘%hi_’’ and ‘‘%lo_’’ do not actually exist in the assembler,
but represent the high and low 16-bits of the address. This is not quite the
straightforward division into low and high words that it looks, because the
16-bit offset field of a lw is treated as signed. So if the ‘‘addr’’ value is such
that bit 15 is a ‘‘1’’, then the %lo_addr value will act as negative, and the
assembler needs to increment %hi_addr to compensate:
addr

%hi_addr

%lo_addr

0x12345678

0x1234

0x5678

0x10008000

0x1001

0x8000

The la (load address) macro instruction provides a similar service for
addresses as the li instruction provides for integer constants:
la

$2, 4($3)

→

addiu

$2, $3, 4

$2, addr

→

lui
addiu

$at, %hi_addr
$2, $at, %lo_addr

$2, addr($3) →

lui
addiu
addu

$at, %hi_addr
$2, $at, %lo_addr
$2, $2, $3

In principle, la could avoid apparently-negative ‘‘%lo_’’ values by using
an ori instruction. But the linker has to be able to fix up addresses in the
signed ‘‘%lo_’’ format found for load/store instructions – so la uses the add
instruction so as to use the same kind of address fixup.

Gp-relative addressing
Loads and stores to global variables or constants usually require at least
two instructions, e.g.:
lw

$2, addr

→

lui
lw

$at, %hi_addr
$2, %lo_addr($at)

9–6

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9
$2, addr($3) →

lui
addu
sw

$at, %hi_addr
$at, $at, $3
$2, %lo_addr($at)

A common low-level optimization supported by many toolchains is to
use gp-relative addressing. This technique requires the cooperation of the
compiler, assembler, linker and run-time start-up code to pool all of the
‘‘small’’ variables and constants into a single region of maximum size
64Kb, and then set register $28 (known as the global pointer or gp register)
to point to the middle of this region†. With this knowledge the assembler
can reduce the number of instructions used to access any of these small
variables, e.g.:
→

$2, addr

$2, addr($3) →

$2, addr – _gp($at)

addu
sw

$at, $gp, $3
$2, addr – _gp($at)

By default most toolchains consider objects less than or equal to 8 bytes
in size to be ‘‘small’’. This limit can usually be controlled by the ‘-G n’
compiler/assembler option; specifying ‘-G 0’ will switch this optimization
off altogether.
While it is a useful optimization, there are some pitfalls to beware of:
• The programmer must take special care when writing assembler code
to declare global data items correctly:
a)
Writable, initialized data of 8 bytes or less must be put explicitly
into the .sdata section.
b)
Global common data must be declared with the correct size, e.g:
.comm
.comm

smallobj, 4
bigobj, 100

Small external variables should also be explicitly declared, e.g:
.externsmallext, 4

•

Most assemblers are effectively one-pass, so make sure that the
program declares data before using it in the code, to get the most
out of the optimization.
In C, global variables must be declared correctly in all modules which
use them. For external arrays either omit the size (e.g. extern int
extarray[]), or give the correct size (e.g.int cmnarray[NARRAY]).
Don’t just give a dummy size of 1.
A very large number of small data items or constants may cause the
64Kb limit to be exceeded, causing strange relocation errors when
linking. The simplest solution here is to completely disable gp-relative
addressing (i.e. use –G 0).
Some real-time operating systems, and many PROM monitors, can be
entered by direct subroutine calls, rather then via a single ‘‘system
call’’ interface. This makes it impossible (or at least very difficult) to
switch back and forth between the two different values of gp that will
be used by the application, and by the o/s or monitor. In this case
either the applications or the o/s (but not necessarily both) must be
built with –G 0.
When the –G 0 option has been used for compilation of any set of
modules, then it is usually essential that all libraries should also be
compiled that way, to avoid relocation errors.

† The actual handling may be toolchain dependent; this is the
most common technique.
9–7

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

JUMPS, SUBROUTINE CALLS AND BRANCHES
The MIPS architecture follows Motorola nomenclature:
• PC-relative instructions are called ‘‘branch’’, and absolute-addressed
instructions ‘‘jump’’; the operation mnemonics begin with a b or j.
• A subroutine call is ‘‘jump and link’’ or ‘‘branch and link’’, and the
mnemonics end ..al.
• All the branch instructions, even branch-and-link, are conditional,
testing one or two registers. They are therefore described in the next
section. However, unconditional versions can be readily synthesized,
e.g.: beq $0, $0, label.
Jump instructions are:
• j: this instruction (jump) transfers control unconditionally to an
absolute address. Actually, j doesn’t quite manage a 32-bit address;
the top 4 address bits of the target are not defined by the instruction
and the top 4 bits of the current ‘‘PC’’ value is used instead.
Most of the time this doesn’t matter: 28-bits still gives a maximum
code size of 256 Mb. It can be argued that it is useful in system
software, because it avoids changing the top 3 address bits which
select the address segment (described earlier in this manual).
To reach a really long way away, use the jr (jump to register)
instruction; which is also used for computed jumps.
• jal, jalr: these instructions implement a direct and indirect
subroutine call. As well as jumping to the specified address, they
store the current pc + 8 in register $31 (ra). Why add 8 to the program
counter? Remember that jump instructions, like branches, always
execute the following instruction (at pc + 4), so the return address is
the instruction after the branch delay slot. Subroutine return is
normally done with jr $31.
Position independent subroutine calls can use the bal, bgezal and
bltzal instructions.

CONDITIONAL BRANCHES
The MIPS architecture does not include a condition code register.
Conditional branch machine instructions test one or two registers; and,
together with a small group of compare-and-set instructions, are used to
synthesize a complete set of arithmetic conditional branches.
Conditional branches are always PC-relative.
Branch instructions are listed below. Again there are architectural
considerations:
• Limited branch offset for PC-relative branches: the maximum branch
displacement is ±32768 instructions (±128K bytes), because a 16-bit
field is used for the offset.
• Branch delay slot: the instruction immediately after a branch (or a
jump) is always executed, whether or not the branch is taken. Many
assemblers will normally hide this from the programmer, and will try
to fill the branch delay slot with a useful instruction, or a nop if this
is not possible.
• No carry flag: due to the lack of condition codes; if software need to
check for carry, then compare the operands and results to work out
when it occurs (typically, this requires only one slt instruction).
• No overflow flag: though the add and subtract instructions are
available in an optional form which causes a trap if the result
overflows into the sign bit. C compilers typically won’t generate those
instructions, but Fortran might.

9–8

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

Co-processor conditional branches
There are four pairs of branches, testing true/false on four ‘‘coprocessor
condition’’ values CPCOND0-3. In the R3081, CPCOND1 is an internal flag
which tests the floating point condition set by the FP compare instructions.
Note that the coprocessor must be enabled for the branch instruction to be
executed.

COMPARE AND SET
The compare-and-set instructions conform to the C standard; they set
their destination to 1 if the condition is true, and zero otherwise. Their
mnemonics start with an ‘‘s’’: so seq rd, rs, rt sets rd to a 1 or zero
depending on whether rs is equal to rt. These instructions operate just like
any 3-operand MIPS instruction.
Floating point comparisons are done quite differently, and are described
in the Floating-Point Accelerator chapter.

COPROCESSOR TRANSFERS
CPU control functions are provided by a set of registers, which the
instruction set accesses as ‘‘co-processor 0’’ data registers. These registers
deal with catching exceptions and interrupts, and accessing the memory
management unit and caches. A R3051 family CPU has at least 12
registers; some have more. There’s much more about this in earlier
chapters.
The floating point accelerator is ‘‘co-processor 1’’, and is described in an
earlier chapter. It has 16 64-bit registers to hold single- or doubleprecision FP values, which come apart into 32 32-bit registers when doing
loads, stores and transfers to/from the integer registers. There are also two
floating point control registers accessed with ctc1, cfc1 instructions.
‘‘Co-processor’’ instructions are encoded in a standard way, and the
assembler doesn’t have to know much about what they do.
There are a range of instructions for moving data to and from the
coprocessor data and control registers. The assembler expects numbers
specified with ‘‘$’’ in front (except for floating point registers, which are
called $f0 to $f31); but most toolchains provide a header file for the C preprocessor which provides meaningful names for the CPU control and FP
control registers.
The assembler syntax makes no special provisions for ‘‘co-processor’’
registers; so if the program contains “obvious” mistakes (like reversing the
CPU and special register names) the assembler will just silently do the
wrong thing.
Instruction

Description

mfc0 dst, dr

move from CPU control register (to integer register)

mtc0 src, dr

move to CPU control register (from integer register)

cfc1 dst, cr

move from fpa control register (to integer register)

ctc1 src, cr

move to fpa control register (from integer register)

mfc1 dst, dr

move from FP register to integer register

mtc1 src, dr

move to FP register from integer register

swc1 dr, offs(base)

store FP register (to memory)

lwc1 dr, offs(base)

load FP register (from memory)

Like conventional load instructions, there must always be one
instruction after the move before the result can be used (the load-delay
slot), whichever direction data is being moved.

9–9

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

Coprocessor Hazards
A pipeline hazard occurs when the architecture definition allows the
internal pipelining to ‘‘show through’’ and affect the software: examples
being the load and branch delay slots. Most MIPS assemblers will usually
shield the programmer from hazards by moving instructions around or
inserting NOP’s, to ensure that the code executes as written.
However some CPU control register writes have side-effects which
require pipeline-aware programming; since most assemblers don’t
understand anything about what these instructions are doing, they may
not help.
One outstanding example is the use of interrupt control fields in the
Status and Cause registers. In these cases the programmer must account
for any side-effects, and the fact that they are delayed for up to three
instructions. For example, after an mtc0 to the Status register which
changes an interrupt mask bit, it will be two further instructions before the
interrupt is actually enabled or disabled. The same is also true when
enabling or disabling floating-point coprocessor instructions (i.e. changing
the CU1 bit).
To cope with these situations usually requires the programmer to take
explicit action to prevent the assembler from scheduling inappropriate
instructions after a dangerous mtc0. This is done by using the .set
noreorder directive, discussed below.
A comprehensive summary of pipeline hazards can be found later in this
chapter.

ASSEMBLER DIRECTIVES
Sections
The names of, and support for different code and data sections is likely
to differ from one toolchain to another. Most will at least support the
original MIPS conventions, which are illustrated (for ROMable programs)
by Figure 9.1, “Program segments in memory”.
Within an assembler program the sections are selected as shown in
Figure 9.1, “Program segments in memory”.
.text, .rdata, .data
Simply put the appropriate section name before the data or instructions,
for example:
msg:

.rdata
.asciiz"Hello world!\n"

.data
table: .word 1
.word 2
.word 3

func:

.text
sub
...

sp, 64

.lit4, .lit8
These sections cannot be selected explicitly by the programmer. They
are read-only data sections used implicitly by the assembler to hold
floating-point constants which are given as arguments to the li.s or li.d
macro instructions. Some assemblers and linkers will save space by
combining identical constants.

9–10

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9
ROM
etext
.rdata
read-only data

.text
1fc0000

program code

_ftext

RAM

????????
stack
grows down from top of memory
heap
grows up towards stack
end
.bss
uninitialized writable data
.sbss
uninitialized writable small data

_fbss
edata

.lit8
64-bit floating point constants

.lit4
32-bit floating point constants

.sdata
writable small data

.data
00000200

writable data

_fdata

exception vectors
00000000
Figure 9.1: Program segments in memory

9–11

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

.bss
This section is used to collect uninitialized data, the equivalent of C and
Fortran’s common data. An uninitialized object is declared, together with
its size. The linker then allocates space for it in the .bss section, using the
maximum size from all those modules which declare it. If any module
declares it in a real, initialized data section, then all the sizes are ignored
and that definition is used.
.comm dbgflag, 4
.lcomm sum, 4
.lcomm array, 100

# global common variable, 4 bytes
# local common variable, 8 bytes
# local common variable, 100 bytes

“Uninitialized” is actually a misnomer: although these sections occupy
no space in the object file, the run-time start-up code or operating-system
must clear the .bss area to zero before entering the program; most C
programs will rely on this behavior. Many tool chains will accommodate
this need through the start up file provided with the tool, to be linked with
the user program†.
.sdata, .sbss
These sections are equivalent to the .data and .bss sections above, but
are used in some toolchains to hold small‡ data objects. This was
described earlier in this chapter, when the use of the gp was discussed.
Stack and heap
The stack and heap are not real sections that are recognized by the
assembler or linker. Typically they are initialized and maintained by the
run-time system by setting the sp register to the top of physical memory
(aligned to an 8-byte boundary), and setting the initial heap pointer (used
by the malloc functions) to the address of the end symbol.
Special symbols
Figure 9.1, “Program segments in memory” also shows a number of
special symbols which are automatically defined by the linker to allow
programs to discover the start and end of their various sections. Some of
these are part of the normal UNIX†† environment expected by many
programs; others are specific to the MIPS environment.
Symbol

Standard?

_ftext
etext

start of text (code) segment
✓

_fdata
edata

end of text (code) segment
start of initialized data segment

✓

_fbss
end

Value

end of initialized data segment
start of uninitialized data segment

✓

end of uninitialized data segment

Data definition and alignment
Having selected the correct section, the data objects themselves are
specified using the directives described in this section.

† IDT/c provides this code in the file “/idtc/idt_csu.S”.
‡ The default for “small” is 8 bytes. This number can be changed
with the “-G” compiler/assembler switch.
†† UNIX is a trademark of Univel Inc.
9–12

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

.byte, .half, .word
These directives output integers which are 1, 2, or 4 bytes long,
respectively. A list of values may be given, separated by commas. Each
value may be repeated a number of times by following it with a colon and
a repeat count. For example.
.byte
.half
.word

3
1, 2, 3
5 : 3, 6, 7

# 1 byte:3
# 3 halfwords:1 2 3
# 5 words:5 5 5 6 7

Note that the section’s location counter is automatically aligned to the
appropriate boundary before the data is emitted. To actually emit
unaligned data, explicit action must be taken using the .align directive
described below.
.float, .double
These output single or double precision floating-point values,
respectively. Multiple values and repeat counts may be used in the same
way as the integer directives.
.float 1.4142175
.double1e+10, 3.1415

# 1 single-precision value
# 2 double-precision values

.ascii, .asciiz
These directives output ASCII strings, either without or with a
terminating null character respectively. The following example outputs two
identical strings:
.ascii "Hello\0"
.asciiz"Hello"

.align
This directive allows the programmer to specify an alignment greater
than that which would normally be required for the next data directive. The
alignment is specified as a power of two, for example:
var:

.align 4
.word 0

# align to 16-byte boundary (24)

If a label (var in this case) comes immediately before the .align , then the
label will still be aligned correctly. For example, the following is exactly
equivalent to the above:
var:

.align 4
.word 0

# align to 16-byte boundary (24)

For ‘‘packed’’ data structures this directive allows the programmer to
override the automatic alignment feature of .half, .word, etc., by specifying
a zero alignment. This will stay in effect until the next section change. For
example:
.half 3
.align 0
.word 100

# correctly aligned halfword
# switch off auto-alignment
# word aligned on halfword boundary

.comm, .lcomm
These directives declare a common, or uninitialized data object by
specifying the object’s name and size.

9–13

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

An object declared with .comm is shared between all modules which
declare it: it is allocated space by the linker, which uses the largest
declared size. If any module declares it in one of the initialized .data,
.sdata or .rdata sections, then all the sizes are ignored and the initialized
definition is used instead†.
An object declared with .lcomm is local to the current module, and is
allocated space in the ‘‘uninitialized’’ .bss (or .sbss) section by the
assembler.
.comm dbgflag, 4
.lcomm array, 100

# global common variable, 4 bytes
# local uninitialized object, 100 bytes

.space
The .space directive increments the current section’s location counter
by a number of bytes, for example:
struc: .word 3
.space 120
.word -1

# 120 byte gap

For normal data and text sections it just emits that many zero bytes, but
in assemblers which allow the programmer to declare new sections with
labels but no real content (like .bss), it will just increment the location
counter without emitting any data.

Symbol binding attributes
Symbols (i.e. labels in one of the code or data segments) can be made
visible and used by the linker which joins separate modules into a single
program. The linker binds a symbol to an address and substitutes the
address for assembler-language references to the symbol.
Symbols can have three levels of visibility:
• Local: invisible outside the module they are declared in, and unused
by the linker. The programmer does not need to worry about whether
the same local symbol name is used in another module.
• Global: made public for use by the linker. Programs can refer to a
global symbol in another module without defining any local space for
it, using the .extern directive.
• Weak global: obscure feature provided by some toolchains. This
allows the programmer to arrange that a symbol nominally referring
to a locally-defined space will actually refer to a global symbol, if the
linker finds one. If the linked program has no global symbol with that
name, the local version is used instead.
The preferred programming practice is to use the .comm directive
whenever possible.
.globl
Unlike C, where module-level data and functions are automatically
global unless declared with thestatic keyword, all assembler labels have
local binding unless explicitly modified by the .globl directive.
To define a label as having global binding that is visible to other
modules, use the directive as follows:
.data
.globl status
status:.word 0

# global variable

.text
.globl set_status# global function

† The actual handling may be toolchain dependent; this is the
most common technique.
9–14

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9
set_status:
subu
...

sp,24

Note that .globl is not required for objects declared with the .comm
directive; these automatically have global binding.
.extern
All references to labels which are not defined within the current module
are automatically assumed to be references to globally-bound symbols in
another module (i.e. external symbols). In some cases the assembler can
generate better code if it knows how big the referenced object is (e.g. the
global pointer, described earlier). An external object’s size is specified
using the .extern directive, as follows:
.externindex, 4
.externarray, 100
lw
$3, index
# load a 4 byte (1 word) external
lw
$2, array($3) # load part of a 100 byte external
sw
$2, value
# store in an unknown size external

.weakext
Some assemblers and toolchains support the concept of weak global
binding. This allows the program to specify a provisional binding for a
symbol, which may be overridden if a normal, or strong global definition is
encountered. For example:
.data
.weakext errno
errno: .word 0
.text
lw

$2,errno

# may use local or external
# definition

This module, and others which access errno, will use this local definition
of errno, unless some other module also defines it with a .globl.
It is also possible to declare a local variable with one name, but make it
weakly global with a different name:
.data
myerrno: .word0
.weakext errno, myerrno
.text
lw
lw

$2,myerrno
$2,errno

# always use local definition
# may use local definition, or
# other

Function directives
Some MIPS assemblers expect the programmer to mark the start and
end of each function, and describe the stack frame which it uses. In some
toolchains this information is used by the debugger to perform stack
backtraces and the like.
.ent, .end
These directives mark the start and end of a function. A trivial leaf
function might look like this:
.text
.ent
localfunc:
addu

9–15

localfunc
v0,a1,a2

# return (arg1 + arg2)

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING
j
.end

ra
localfunc

The label name may be omitted from the .end directive, which then
defaults to the name used in the last .ent. Specifying the name explicitly
allows the assembler to check that the programmer did not miss earlier
.ent or .end directives.
.aent
Some functions may provide multiple, alternative entry-points. The
.aent directive identifies labels as such. For example:
.text
.globl
.ent
memcpy:move
move
move

memcpy
memcpy
t0,a0
a0,a1
a1,t0

.globl
.aent
bcopy: lb
sb
addu
addu
subu
bne
j
.end

bcopy
bcopy
t0,0(a0)
# very slow byte copy
t0,0(a1)
a0,1
a1,1
a2,1
a2,zero,bcopy
ra
memcpy

# swap first two arguments

.frame, .mask, .fmask
Most functions need to allocate a stack frame in which to:
• save the return address register ($31);
• save any of the registers s0 - s9 and $f20 - $f31 which they modify
(known as the callee-saves registers);
• store local variables and temporaries;
• pass arguments to other functions.
In some CISC architectures the stack frame allocation, and possibly
register saving, is done by special purpose enter and leave instructions,
but in the MIPS architecture it is coded by the compiler or assemblylanguage programmer. However debuggers need to know the layout of each
stack frame to do stack backtraces and the like, and in the original MIPS
Corp. toolchain these directives provided this information; in other
toolchains they may be quietly ignored, and the stack layout determined
at run-time by disassembling the function prologue. Putting them in the
code is therefore not always essential, but does no harm and may make
the code more portable. Many toolchains supply a header file ,
which provides C-style macros to generate the appropriate directives, as
required (the procedure call protocol, and stack usage, is described in a
later chapter).
The .frame directive takes 3 operands:
• framereg: the register used to access the local stack frame – usually
$sp.
• returnreg: the register which holds the return address. Usually this is
$0, which indicates that the return address is stored in the stack
frame, or $31 if this is a leaf function (i.e. it doesn’t call any other
functions) and the return address is not saved.
• framesize: the total size of stack frame allocated by this function; it
should always be the case that $sp + framesize = previous $sp.

9–16

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

.frame framereg, framesize, returnreg

The .mask directive indicates where the function saves general registers
in the stack frame; .fmask does the same for floating-point registers. Their
first argument is regmask, a bitmap of which registers are being saved (i.e.
bit 1 set = $1, bit 2 set = $2, etc.); the second argument is regoffset, the
distance from framereg + framesize to the start of the register save area.
.mask regmask, regoffset
.fmask fregmask, fregoffs

How these directives relate to the stack frame layout, and examples of
their use, can be found in the next chapter. Remember that the directives
do not create the stack frame, they just describe its layout; that code still
has to be written explicitly by the compiler or assembly-language
programmer.

Assembler control (.set)
The original MIPS Corp. assembler is an ambitious program which
performs intelligent macro expansion of synthetic instructions, delay-slot
filling, peephole optimization, and sophisticated instruction reordering, or
scheduling, to minimize pipeline stalls. Many assemblers will be less
complex: modern optimizing compilers usually prefer to do these sort of
optimizations themselves. However in the interests of source code
compatibility, and to make the programmer’s life easier, most MIPS
assemblers perform macro expansion, insert extra nops as required to
hide branch and load delay-slots, and prevent pipeline hazards in normal
code (pipeline hazards are described in detail later).
With a reordering assembler it is sometimes necessary to restrict the
reordering, to guarantee correct timing, or to account for side-effects of
instructions which the assembler cannot know about (e.g. enabling and
disabling interrupts). The .set directives provide this control.
.set noreorder/reorder
By default most assemblers are in reorder mode, which allow them to
reorder instructions to avoid pipeline hazards and (perhaps) to achieve
better performance; in this mode it will not allow the programmer to insert
nops. Conversely, code that is an a noreorder region will not be optimized
or changed in any way. This means that the programmer can completely
control the instruction order, but the downside is that the code must now
be scheduled manually, and delay slots filled with useful instructions or
nops. For example:
.set noreorder
lw
t0, 0(a0)
nop
# LDSLOT
subu
t0, 1
bne
t0, zero, loop
nop
# BDSLOT
.set
reorder

.set volatile/novolatile
Any load or store instruction within a volatile region will not be moved
with respect to other loads and stores. This can be important for accesses
to memory mapped device registers, where the order of reads and writes is
important. For example, if the following code fragment did not use .set
volatile, then the assembler might decide to move the second lw before the
sw, to fill the first load delay-slot. Hazard avoidance and other
optimizations are not affected by this option.

9–17

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

.set volatile
lw
t0,0(a0)
sw
t0,0(a1)
lw
t1,4(a0)
.set novolatile

.set noat/at
The assembler reserves register $1 (known as the assembler temporary,
or $at register) to hold intermediate values when performing macro
expansions; if code attempts to use the register, a warning or error
message will be sent. It is not always obvious when the assembler will use
$at, and there are certain circumstances when the programmer may need
to ensure that it does not (for example in exception handlers before $1 has
been saved). Switching on noat will make the assembler generate an error
message if it needs to use $1 in a macro instruction, and allows the
programmer to use it explicitly without receiving warnings. For example:
xcptgen:
.set noat
subu
k0,sp,XCP_SIZE
sw
$at,XCP_AT(k0)
.set at

.set nomacro/macro
Most of the time the programmer will not care whether an assembler
statement generates more than one real machine instruction, but of course
there are exceptions. For instance when manually filling a branch delayslot in a noreorder region, it would almost certainly be wrong to use a
complex macro instruction; if the branch was taken, only the first
instruction of the macro would be executed. Switching on nomacro will
cause a warning if any statement expands to more than one machine
instruction. For example, compare the following two code fragments:
.set
blt
.set
li
.set
.set

noreorder
a1,a2,loop
nomacro
a0,0x1234
macro
reorder

.set
blt
.set
li
.set
.set

noreorder
a1,a2,loop
nomacro
a0,0x12345
macro
reorder

# BDSLOT

The first will assemble successfully, but the second will generate an
assembler error message, because its li is expanded into two machine
instructions (lui and ori). Some assemblers will catch this mistake
automatically.
.set nobopt/bopt
Setting the nobopt control prevents the assembler from carrying out
certain types of branch optimization. It is usually used only by compilers.

THE COMPLETE GUIDE TO ASSEMBLER INSTRUCTIONS
Table 9.2, “Assembler instructions” below shows, for every mnemonic
defined by the MIPS assemblers for the R3000 (MIPS 1) instruction set,
how it is likely to be implemented, and what it does.
Some naming conventions in the assembler may appear confusing:
9–18

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

• Unsigned versions: a ‘‘u’’ suffix on the assembler mnemonic is usually
to be read as ‘‘unsigned’’. Usually this follows the conventional
meaning; but the most common u-suffix instructions are addu and
subu: and here the u means that overflow into the sign bit will not
cause a trap. Regular add is never generated by C compilers.
Many compilers, not expecting there to be a run-time system to
handle overflow traps, will always use the ‘‘u’’ variant.
However, because the integer multiply instructions mult and multu
generate 64-bit results the signed and unsigned versions are really
different – and neither of the machine instructions produce a trap
under any circumstances.
• Immediate operands: as mentioned above, the programmer can use
immediate operands with most instructions (e.g. add rd, rs, 1); quite
a few arithmetic/logic instructions really do have ‘‘immediate’’
versions (called addi etc.). Most assemblers do not require the
programmer to explicitly know which machine instructions support
immediate variants.
• Building addresses, %lo_ and %hi_: synthesis of addressing modes
was described earlier. The table typically will list only one addressmode variant for each instruction in the table.
• What it does: the function of each instruction is described using ‘‘C’’
expression syntax; it is easy to get a rough idea, but a thorough
knowledge of C allows the exact behavior to be understood.
The assembler descriptions use the following conventions:
Word

Used for

rs,rt

CPU registers used as operands

CPU register which receives the result

fs,ft

floating point register operands

floating point register which receives the result

imm

16-bit ‘‘immediate’’ constant

label

the name of an entry point in the instruction stream

addr

one of a number of different address expressions

%hi_addr
%lo_addr

where addr is a symbol defined in the data segment,
‘‘%hi_addr’’ and ‘‘%lo_addr’’ are as described above; that
is, they are the high and low parts of the value which can
be used in an lui/addui sequence.

%gpoff_addr

the offset in the ‘‘small data’’ segment of an address

$at

$zero

$ra

the ‘‘return address’’ register $31

RETURN

the point to where control returns to after a subroutine
call; this is the next instruction but one after the branch/
jump to subroutine, and is normally loaded into $ra by
the ‘‘.. and link’’ instructions.

trap(CAUSE, code)

Take a CPU trap; ‘‘CAUSE’’ determines the setting of the
Cause register, and ‘‘code’’ is a value not interpreted by
the hardware, but which system software can obtain by
looking at the trap instruction.
CAUSE values can be BREAK; FPINT (for floating point
exception); SYSCALL.

Table 9.1: Assembler register and identifier conventions

9–19

CHAPTER 9

ASSEMBLER LANGUAGE PROGRAMMING

Word

Used for

unordered(fs,ft)

some exceptional floating point values cannot be sensibly
compared; it is not sensible to ask whether one NaN is
bigger than another (NaN, ‘‘not a number’’, is produced
when the result of an operation is not defined). The
IEEE754 standard requires that for such a pair that ‘‘fs
ft’’ shall all be false.
‘‘unordered(fs,ft)’’ returns true for an unordered pair, false
otherwise.

fpcond

the floating point ‘‘condition bit’’ found in the FP control/
status register, and tested by the bc1f and bc0t
instructions.
Table 9.1: Assembler register and identifier conventions

Assembler
move rd,rs

Expands To
addu rd,rs,$zero

What it does
rd = rs;

Branch (PC-relative, all conditional)
b label

beq
$zero,$zero,label

beq rs,rt,label

goto label;
if (rs == rt) goto label;

bge rs,rt,label

slt $at,rs,rt
beq $at,$zero,label

if ((signed) rs >= (signed) rt)
goto label;

bgeu
rs,rt,label

sltu $at,rs,rt
beq $at,$zero,label

if ((unsigned) rs >= (unsigned) rt)
goto label;

bgt rs,rt,label

slt $at,rt,rs
bne $at,$zero,label

if ((signed) rs > (signed) rt)
goto label;

bgtu rs,rt,label

slt $at,rt,rs
beq $at,$zero,label

if ((unsigned) rs > (unsigned) rt)
goto label;

ble rs,rt,label

sltu $at,rt,rs
beq $at,$zero,label

if ((signed) rs <= (signed) rt)
goto label;

bleu rs,rt,label

sltu $at,rt,rs
beq $at,$zero,label

if ((unsigned) rs <= (unsigned) rt)
goto label;

blt rs,rt,label

slt $at,rs,rt
bne $at,$zero,label

if ((signed) rs <(signed) rt)
goto label;

bltu rs,rt,label

sltu $at,rs,rt
bne $at,$zero,label

if ((unsigned) rs <(unsigned) rt)
goto label;

bne rs,rt,label
beqz rs,label

if (rs != rt) goto label;
beq rs,$zero,label

if (rs == 0) goto label;

bgez rs,label

if ((signed) rs >= 0) goto label;

bgtz rs,label

if ((signed) rs > 0) goto label;

blez rs,label

if ((signed) rs <= 0) goto label;
Table 9.2: Assembler instructions

9–20

ASSEMBLER LANGUAGE PROGRAMMING
Assembler

CHAPTER 9
Expands To

bltz rs,label

What it does
if ((signed) rs <0) goto label;

bnez rs,label

bne rs,$zero,label

if (rs != 0) goto label;

bal label

bgezal $zero,label

ra = RETURN;
goto label;

bgezal rs,label

if ((signed) rs >= 0) {
ra = RETURN;
goto label;
}

bltzal rs,label

if ((signed) rs <0) {
ra = RETURN;
goto label;
}

Unary arithmetic/logic instructions
abs rd,rs

sra $at,rs,31
xor rd,rs,$at
sub rd,rd,$at

rd = rs <0 ? -rs: rs;

abs rd

sra $at,rd,31
xor rd,rd,$at
sub rd,rd,$at

rd = rd <0 ? -rd: rd;

neg rd,rs

sub rd,$zero,rs

rd = -rs; /* trap on overflow */

neg rd

sub rd,$zero,rd

rd = -rd; /* trap on overflow */

negu rd,rs

subu rd,$zero,rs

rd = -rs; /* no trap */

negu rd

subu rd,$zero,rd

rd = -rd; /* no trap */

not rd,rs

nor rd,rs,$zero

rd = ~rs;

not rd

nor rd,rd,$zero

rd = ~rd;

Binary arithmetic/logical operations
add rd,rs,rt
add rd,rs

rd = rs + rt; /* trap on overflow */
add rd,rd,rs

rd += rs; /* trap on overflow */

addu rd,rs,rt

rd = rs + rt; /* no trap on overflow */

addu rd,rs

rd += rs; /* no trap on overflow */

and rd,rs,rt

rd = rs & rt;

and rd,rs

and rd,rd,rs

rd &= rs;

Table 9.2: Assembler instructions

9–21

CHAPTER 9
Assembler

ASSEMBLER LANGUAGE PROGRAMMING
Expands To

What it does

div rs,rt
bne rt,$zero,1f
nop
break 7
1:
li $at,-1
bne rt,$at,2f
nop
lui $at,0x8000
bne rs,$at,2f
nop
break 6
2:
mflo rd

rd = rs/rt;

div rd,rs

as above

rd = rd/rt; /* trap on errors */

divu rd,rs,rt

divu rs,rt
bne rt,$zero,1f
nop
break 7
1:
mflo rd

div rd,rs,rt

/* trap divide by zero */

/* trap overflow conditions */

rd = rs/rt;
/* trap on divide by zero */
/* no check for overflow */

or rd,rs,rt

rd = rs | rt;

mul rd,rs,rt

multu rs,rt
mflo rd

rd = rs*rt; /* no checks */

mulo rd,rs,rt

mult rs,rt
mfhi rd
sra rd,rd,31
mflo $at
beq rd,$at,1f
nop
break 6
1:
mflo rd

rd = rs * rt; /* signed */

multu rs,rt
mfhi $at
mflo rd
beq $at,$zero,1f
nop
break 6
1:

rd = (unsigned) rs * rt;

mulou rd,rs,rt

nor rd,rs,rt

/* trap on overflow */

/* trap on overflow */
rd = ~(rs | rt);

Table 9.2: Assembler instructions

9–22

ASSEMBLER LANGUAGE PROGRAMMING
Assembler
rem rd,rs,rt

remu rd,rs,rt

CHAPTER 9
Expands To

What it does

div rs,rt
bne rt,$zero,1f
nop
break 7
1:
li $at,-1
bne rt,$at,2f
nop
lui $at,0x8000
bne rs,$at,2f
nop
break 6
2:
mfhi rd

rd = rs%rt;

divu rs,rt
bne rt,$zero,1f
nop
break 7
1:
mfhi rd

/* unsigned operation, ignore overflow */
rd = rs%rt;

/* trap if rt == 0 */

/* trap if it will overflow */

/* trap if rt == 0 */

rol rd,rs,rt

negu $at,rt
srlv $at,rs,$at
sllv rd,rs,rt
or rd,rd,$at

/* rd = rs rotated left by rt */

ror rd,rs,rt

negu $at,rt
sllv $at,rs,$at
srlv rd,rs,rt
or rd,rd,$at

/* rd = rs rotated right by rt */

seq rd,rs,rt

xor rd,rs,rt
sltiu rd,rd,1

rd = (rs == rt) ? 1: 0;

sge rd,rs,rt

slt rd,rs,rt
xori rd,rd,1

rd = ((signed)rs >= (signed)rt) ? 1: 0;

sgeu rd,rs,rt

sltu rd,rs,rt
xori rd,rd,1

rd = ((unsigned)rs >= (unsigned)rt) ? 1: 0;

sgt rd,rs,rt

slt rd,rt,rs

rd = ((signed)rs > (signed)rt) ? 1: 0;

sgtu rd,rs,rt

sltu rd,rt,rs

rd = ((unsigned)rs > (unsigned)rt) ? 1: 0;

sle rd,rs,rt

slt rd,rt,rs
xori rd,rd,1

rd = ((signed)rs <= (signed)rt) ? 1: 0;

sleu rd,rs,rt

sltu rd,rt,rs
xori rd,rd,1

rd = ((unsigned)rs <= (unsigned)rt) ? 1: 0;

slt rd,rs,rt

rd = ((signed)rs <(signed)rt) ? 1: 0;

sltu rd,rs,rt

sltu rd,rs,rt
xor rd,rs,rt

rd = ((unsigned)rs <(unsigned)rt) ? 1: 0;

sne rd,rs,rt

sltu rd,$zero,rd

rd = (rs == rt) ? 1: 0;

Table 9.2: Assembler instructions

9–23

CHAPTER 9
Assembler

ASSEMBLER LANGUAGE PROGRAMMING
Expands To

What it does

sll rd,rs,rt

sllv rd,rs,rt

rd = rs <> rt;

srl rd,rs,rt

srlv rd,rs,rt

rd = ((unsigned) rs) >> rt;

sub rd,rs,rt

rd = rs - rt; /* trap on overflow */

subu rd,rs,rt

rd = rs - rt; /* no trap on overflow */

xor rd,rs,rt

rd = rs ^ rt;

Binary instructions with one constant operand (‘‘immediate’’)
addi opcode is legal but unnecessary
add rd,rs,imm

addi rd,rs,imm

lui rd,hi_imm
ori rd,rd,lo_imm
add rd,rs,rd

/* “add” traps on overflow */
/* when -32768 <= imm <32768 */
rd = rs + (signed) imm;
/* for big values add and ALL signed ops
* expand like this */
rd = imm & 0xFFFF0000;
rd |= imm & 0xFFFF;
rd = rs + rd;

addu
rd,rs,imm

addiu rd,rs,imm

/* “addu” won’t trap on overflow */
/* will expand if imm bigger than 16 bit */
rd = rs + (signed) imm;

sub rd,rs,imm

addi rd,rs,-imm

/* trap on overflow */
/* will expand if imm bigger than 16 bit */
rd = rs - (signed) imm;

subu
rd,rs,imm

addiu rd,rs,-imm

/* no trap on overflow */
/* will expand if imm bigger than 16 bit */
rd = rs - (signed) imm;

and rd,rs,imm

andi rd,rs,imm

rd = rs & imm; /* 0 <= imm <65535 */

lui rd,hi_imm
ori rd,rd,lo_imm
and rd,rs,rd

/* for big values add and ALL unsigned
ops
* expand like this */
rd = imm & 0xFFFF0000;
rd |= imm & 0xFFFF;
rd = rs & rd;

or rd,rs,imm

ori rd,rs,imm

rd = rs | imm; /* 0 <= imm <65535 */

slt rd,rs,imm

slti rd,rs,imm

/* -32768 <= imm <32768 */
rd = ((signed) rs <(signed) imm) ? 1: 0;
/* expanded as for add if imm big */

sltu rd,rs,imm

sltiu rd,rs,imm

rd = ((unsigned) rs <(unsigned) imm) ? 1:
0;
/* expanded as for “and”if imm big */

xor rd,rs,imm

xori rd,rs,imm

rd = rs ^ imm;

li rd,imm

ori rd,$zero,imm

rd = (unsigned) imm; /* imm <= 65335 */

lui rd,hi_imm
ori rd,$zero,lo_imm

/* for big imm value expand to... */
rd = imm & 0xFFFF0000;
rd |= imm & 0xFFFF;

lui rd,imm

rd = imm << 32;

Multiply/divide unit machine instructions
Table 9.2: Assembler instructions

9–24

ASSEMBLER LANGUAGE PROGRAMMING
Assembler

CHAPTER 9
Expands To

What it does

mult rs,rt

/* Start signed multiply of rs and rd.
* Result can be retrieved, in a while,
* using mfhi/mflo
*/

multu rs,rt

/* start unsigned multiply of rs and rd */

divd rs,rt

/* start signed divide rs/rd */

divdu rs,rt

/* start unsigned divide rs/rd */

mfhi rd

/* retrieve remainder from divide or high* order word of result of multiply */

mflo rd

/* retrieve result of divide or low-order
* word of result of multiply */

mthi rs

/* load multiply unit ‘‘hi’’ register */

mtlo rs

/* load multiply unit ‘‘lo’’ register */

Unconditional (absolute) branch and call
jal label

ra = RETURN;
goto label;

jalr rd,rs

rd = RETURN;
goto *rs;

jalr rs

jalr rs,$ra

ra = RETURN;
goto *rs;

jal rd,addr

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
jalr rd,$at

rs = RETURN;
goto label;
goto *at;

j label

goto label;

jr rs

goto *rs;

No-op
nop

sll
$zero,$zero,$zero

/* no-op, instruction code == 0 */

lui rd,%hi_label
addiu
rd,rd,%lo_label

rd = %hi_addr <<32
rd += (signed) %lo_label;

Load address
la rd,label

Address mode implementation for load/store
lw rd,label

lui rd,%hi_label
lw rd,%lo_label(rd)

/* link-time determined location */
/* note can use rd or $at for lw */

lw
/* link-time location, in gp segment */
rd,%gpoff_addr($gp
)
lw rd,offset(rs)

lw rd,offset(rsO)

/* single instruction if offset fits
* in 16 bits */

lui rd,%hi_offset
addu rd,rd,rs
lw rd,%lo_offset(rd)

/* sequence for big offset */

Load and store instructions
Table 9.2: Assembler instructions

9–25

CHAPTER 9
Assembler

ASSEMBLER LANGUAGE PROGRAMMING
Expands To

What it does

lw rd,addr

/* load word */
rd = *((int *) addr);

lh rd,addr

/* load half-word,sign-extend */
rd = *((short *) addr);

lhu rd,addr

/* load half-word,zero-extend */
rd = *((unsigned short *) addr);

lb rd,addr

/* load byte, sign-extend */
rd = *((signed char *) addr);

lbu rd,addr

/* load byte, sign-extend */
rd = *((unsigned char *) addr);

ld $t2,addr

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
lw $t2,0($at)
lw $t3,4($at)

/* load 64-bit integer into pair of regs */

sw rs,addr

/* store word */
*((int *) addr) = rs;

sh rs,addr

/* store half-word */
*((short *) addr) = rs;

sb rs,addr

/* store byte */
*((char *) addr) = rs;

sd $t2,addr

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
sw $t2,0($at)
sw $t3,4($at)

/* store 64-bit integer */

ulw rd,addr

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
lwl rd,0($at)
lwr rd,3($at)

/* load word unaligned */

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
swl rs,0($at)
swr rs,3($at)

/* store word unaligned */

usw rs,addr

lwl rd,addr

/* if addr is aligned, does same load
* twice */

/* if addr is aligned, does same store
* twice */
load/store word left/right, see “Unaligned
loads and store” on page 1-5

lwr rd,addr
swl rs,addr
swr rs,addr
l.s fd,addr

lui $at,%hi_addr
lwc1
fd,%lo_addr($at)

/* load FP single */
fd = *((float *) addr);

l.d $f6,addr

lui $at,%hi_addr
addiu
$at,$at,%lo_addr
lwc1 $f7,0($at)
lwc1 $f6,4($at)

/* load FP double into reg pair */
fd = *((double *) addr);

s.s fs,addr

swc1 fs,addr

/* store FP single */
*((float *) addr) = fs;

Table 9.2: Assembler instructions

9–26

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

Assembler
s.d $f2,addr

Expands To
lui $at,%hi_addr
addiu
$at,$at,%lo_addr
swc1 $f3,0($at)
swc1 $f2,4($at)

What it does
/* store FP double from reg pair */
*((double *) addr) = fs;

Co-processor ‘‘condition’’ tests
bc0t label
bc2t label
bc3t label

/* goto label if corresponding BrCond
* input is active */

bc0f label
bc2f label
bc3f label

/* goto label if corresponding BrCond
* input is inactive */

Trap instructions
break code

trap(BREAK, code);

syscall

trap(SYSCALL, 0)

teq rs,rt,code

bne rs,rt,1f
nop
break code
1:

/* R4000 compatibility instruction */
if (rs == rt)
trap(BREAK, code);

tge rs,rt,code

slt $at,rs,rt
bne $at,$zero,1f
nop
break code
1:

if ((signed)rs >= (signed)rt)
trap(BREAK, code);

tgeu rs,rt,code

sltu $at,rs,rt
bne $at,$zero,1f
nop
break code
1:

if ((unsigned)rs >= (unsigned)rt)
trap(BREAK, code);

tlt rs,rt,code

slt $at,rs,rt
beq $at,$zero,1f
nop
break code
1:

if ((signed)rs <(signed)rt)
trap(BREAK, code);

tltu rs,rt,code

sltu $at,rs,rt
beq $at,$zero,1f
nop
break code
1:

if ((unsigned)rs <(unsigned)rt)
trap(BREAK, code);

tne rs,rt,code

beq rs,rt,1f
nop
break code
1:

if (rs != rt)
trap(BREAK, code);

Floating point instructions.
All come in both ‘‘.d’’ (64-bit) and ‘‘.s’’ (32-bit) forms
Only ‘‘.d’’ listed.
Test and set condition flag instructions
c.f.d

if (unordered(fs,ft))
trap(FPINT);
fpcond = 0;

c.sf.d

fpcond = 0;
Table 9.2: Assembler instructions

9–27

CHAPTER 9
Assembler

ASSEMBLER LANGUAGE PROGRAMMING
Expands To

What it does

c.un.d

if (unordered(fs,ft))
trap(FPINT);
fpcond = unordered(fs,ft);

c.ngle.d

fpcond = unordered(fs,ft);

c.eq.d

if (unordered(fs,ft))
trap(FPINT);
fpcond = (fs == ft);

c.seq.d

fpcond = (fs == ft);

c.ueq.d

if (unordered(fs,ft))
fpcond = (fs == ft) || unordered(fs,ft);

c.ngl.d

fpcond = (fs == ft) || unordered(fs,ft);

c.olt.d

if (unordered(fs,ft))
trap(FPINT);
fpcond = (fs 0) ? fs: -fs;
abs.d fd,fd

neg.d fd,fs
neg.d fd

fd = (fd > 0) ? fd: -fd;
fd = -fs;

neg.d fd,fd

fd = -fd;

Convert between formats
cvt.X.Y should be read “convert TO X FROM Y”
cvt.d.s fd,fs
cvt.d.s fd

fd = (double) ((float) fs);
cvt.d.s fd,fd

cvt.d.w fd,fs
cvt.d.w fd

fd = (double) ((int) fs);
cvt.d. fd,fs

cvt.s.d fd,fs
cvt.s.d fd

fd = (double) ((float) fd);

fd = (double) ((int) fd);
fd = (float) ((double) fs);

cvt.s.d fd,fd

fd = (float) ((double) fd);

Table 9.2: Assembler instructions

9–28

ASSEMBLER LANGUAGE PROGRAMMING
Assembler

CHAPTER 9
Expands To

cvt.s.w fd,fs
cvt.s.w fd

fd = (float)((int) fs);
cvt.s.w fd,fd

cvt.w.d fd,fs

cvt.w.d fd

fd = (float)((int) fd);
/* note integer value is chosen
* according to rounding mode */
fd = (int)((double) fs);

cvt.w.d fd,fd

cvt.w.s fd,fs
cvt.w.s fd

What it does

fd = (int)((double) fd);
fd = (int)((float) fs);

cvt.w.s fd,fd

fd = (int)((float) fd);

Convert from floating-point to integer
using an explicit rounding mode.
Note: rt is used as a temporary.
ceil.w.d fd,fs,rt

cfc1 rt,$31
nop
ori $at,rt,3
xori $at,$at,1
ctc1 $at,$31
nop
cvt.w.d fd,fs
ctc1 rt,$31

fd = ceil((double) fd);

floor.w.d
fd,fs,rt

cfc1 rt,$31
nop
ori $at,rt,3
xori $at,$at,0
ctc1 $at,$31
nop
cvt.w.d fd,fs
ctc1 rt,$31

fd = floor((double) fd);

round.w.d
fd,fs,rt

cfc1 rt,$31
nop
ori $at,rt,3
xori $at,$at,2
ctc1 $at,$31
nop
cvt.w.d fd,fs
ctc1 rt,$31

fd = round((double) fd);

trunc.w.d
fd,fs,rt

cfc1 rt,$31
nop
ori $at,rt,3
xori $at,$at,2
ctc1 $at,$31
nop
cvt.w.d fd,fs
ctc1 rt,$31

fd = (int) ((double) fd);

ceil.w.s fd,fs,rt

see above

fd = ceil((float) fd);

floor.w.s
fd,fs,rt

see above

fd = floor((float) fd);

round.w.s
fd,fs,rt

see above

fd = round((float) fd);

trunc.w.s
fd,fs,rt

see above

fd = (int) ((float) fd);

Arithmetic operations
all can trap under some circumstances
Table 9.2: Assembler instructions

9–29

CHAPTER 9
Assembler

ASSEMBLER LANGUAGE PROGRAMMING
Expands To

add.d fd,fs,ft
add.d fd,fs

fd = fs + ft;
add.d fd,fd,fs

div.d fd,fs,ft
div.d fd,fs

div.d fd,fd,,fs

fd /= fs;
fd = fs*ft;

mul.d fd,fd,fs

sub.d fd,fs,ft
sub.d fd,fs

fd += fs;
fd = fs/ft;

mul.d fd,fs,ft
mul.d fd,fs

What it does

fd *= fs;
fd = fs - ft;

sub.d fd,fd,fs

fd -= fs;

Conditional branch following test
bc1f label

if (!fpcond)
goto label;

bc1t label

if (fpcond)
goto label;

Move data between FP and integer register
mfc1 rd,fs

/* no format conversion done, just copies
* bits. Can use odd-numbered fp
registers */
rd = fs;

mtc1 rs,fd

/* no format conversion done, just copies
* bits. Can use odd-numbered fp
registers */
fd = rs;

mfc1.d $t2,$f2

mfc1 $t2,$f3
mfc1 $t3,$f2

mtc1.d $t2,$f2

mtc1 $t2,$f3
mtc1 $t3,$f2

/* move a double value (just bits, no
* conversion) from integer register pair
*to FP reg pair */
/* move a double value (just bits, no
* conversion)from integer register pair
*to FP reg pair */

CPU control instructions (privileged mode only)
mfc0 rd, nn

rd = (contents of CPU control reg nn);

mtc0 rs, nn

(CPU control reg nn) = rs;

tlbr

These instructions are used to setup the
TLB (memory management hardware) and
are described in Chapters 2 & 3.

tlbwi
tlbwr
tlbpr
rfe

Used at the end of an exception routine
Restores kernel-mode and global
interrupt enable bits from the 3-level
“stack” in the status register SR. See
chapter 3.
Table 9.2: Assembler instructions

ALPHABETIC LIST OF ASSEMBLER INSTRUCTIONS
In this list real hardware instructions are marked with a dagger.
abs rd,rs: integer absolute value
abs.d fd,fs†: FP double precision absolute value
abs.s fd,fs†: FP single precision absolute value
9–30

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

add rd,rs,rt_imm†: add, trap on overflow
add.d fd,fs,ft†: FP double precision add
add.s fd,fs1,fs2†: FP single precision add
addi rd,rs,imm†: add immediate, trap on overflow
addiu rd,rs,imm†: add immediate, never trap
addu rd,rs,rt_imm†: add, never trap
and rd,rs,rt_imm†: logical AND
andi rd,rs,imm†: logical AND immediate
bal label: PC-relative subroutine call
bc0f offset†: branch if CPCOND input signal inactive
bc0t offset†: branch if CPCOND input signal active
bc1f label†: branch if FP condition bit clear
bc1t label†: branch if FP condition bit set
beq rs,rt,label†: branch if rs == rt
beqz rs,label: branch if rs is zero
bge rs,rt,label: branch if rs ≥ rt (signed compare)
bgeu rs,rt,label: branch if rs ≥ rt (unsigned compare)
bgez rs,label†: branch if rs ≥ 0 (signed)
bgezal rs,label†: branch to subroutine if rs == 0
bgt rs,rt,label: branch if rs > rt (signed)
bgtu rs,rt,label: branch if rs > rt (unsigned)
bgtz rs,label†: branch if rs > 0 (signed)
ble rs,rt,label: branch if rs ≤ rt (signed)
bleu rs,rt,label: branch if rs ≤ rt (unsigned)
blez rs,label†: branch if rs ≤ 0 (signed)
blt rs,rt,label: branch rs rt (signed), 0 otherwise
sgtu rd,rs,rt: set rd to 1 if rs > rt (unsigned), 0 otherwise
sh rs2,offset(rs1)†: store half-word (16bits) to memory
sle rd,rs,rt: set rd to 1 if rs ≤ rt (signed), 0 otherwise
sleu rd,rs,rt: set rd to 1 if rs ≤ rt (unsigned), 0 otherwise
sll rd,rs,rt†: rd = rs shifted left (bigger) by rt (max 31)
9–32

ASSEMBLER LANGUAGE PROGRAMMING

CHAPTER 9

sllv rd,rs1,rs2†: rd = rs shifted left (bigger) by rt (max 31)
slt rd,rs,rt_imm†: set rd to 1 if rs

sp+8

sp+4

address of "bearer"

sp+0

address of "bear"

There are less than 16 bytes of arguments, so they all fit in registers.
That seems like a complex way of deciding to put three arguments into
the usual registers. However, its value is clearer in the case of something
a bit more tricky from the math library:
double ldexp (double, int);
y = ldexp(x, 23); /* y = x * (2**23) */

The arguments come out as
Location

Contents

In register

sp+12

sp+8

sp+4

(double) x

$f12/$f13

sp+0

Exotic example; passing structures
C allows the programmer to use structure types as arguments (it is
much more common practice to pass pointers to structures instead, but
the language supports both). In MIPS the structure forms part of the
‘‘argument structure’’. In the following example:
struct thing {
char letter;
short count;
int value;
} = {"z", 46, 100000};

10–2

C PROGRAMMING

CHAPTER 10

(void) processthing (thing);

Location

Contents

In register

sp+4

100000

sp+0

‘‘z’’

In a big-endian CPU, the result of this is that the char value in the
structure should end up in the most-significant 8 bits of the argument
register, but packed together with the short.

How printf() and varargs work
Consider this example:
printf ("length = %f, width = %f, num = %dn", 1.414, 1.0, 12);

Location

Contents

In register

sp+24

sp+20

(double) 1.0

(double) 1.414

sp+16
sp+12
sp+8

sp+4

sp+0

pointer to format string

Note:
• The padding at sp +4 is required to get correct alignment of the double
values (the C rule is that floating point arguments are always passed
as double unless the programmer explicitly asks otherwise with a
typecast or function prototype).
• Because the first argument is not a floating point value, the compiler
doesn’t use an FP register for the second argument either. The data
will instead be loaded into the two registers a2 and a3.
This turns out to be very useful.
The printf() subroutine is defined with the ‘‘stdarg’’ or ‘‘varargs’’ macro
package, which provides a portable cover for the register and stack
manipulation involved. The printf routine picks off the arguments by
taking the address of the first or second argument, and then can advance
up the argument structure to find further arguments.
However, the macro package also has to persuade the C compiler to copy
a0 through a3 into their ‘‘shadow’’ locations in the argument structure.
Some compilers will detect the use of the address of an argument and take
the hint; ANSI C compilers should react to ‘‘...’’ in the function definition;
others may need a ‘‘pragma’’.
This should clarify the value of placing the double value into the integer
registers; that way ‘‘stdarg’’ and the compiler can just store the registers
a0- a3 into the first 16 bytes of the argument structure, regardless of the
type or number of the arguments.

10–3

CHAPTER 10

C PROGRAMMING

Returning value from a function
An integer or pointer return value will be in register v0 ($2). Register v1
($3) is reserved by the MIPS ABI but many compilers don’t use it. However,
expect it to be used for returning 64-bit integer values in certain compilers
(probably as a long long data type).
Any floating point result comes back in register $f0 (implicitly using $f1
if the value is double precision).
If a function is declared in C as returning a structure value, that value
is not returned in registers. Instead an additional implicit argument, a
pointer to a caller-supplied structure template, is prepended to the explicit
arguments; and the called function copies its return value to the template.
Following the normal rules for arguments the ‘‘implicit’’ first argument will
be in register a0 when the function is called. On return v0 points to the
returned structure, too.

Macros for prologues and epilogues
Most assemblers seem to provide a partial prologue macro, which at
least hides the pseudo-ops required to define a function and to record, in
the object file, information for debuggers to use when conversing about
your function.

Stack-frame allocation
Provided that a function (written in any language) adheres to the calling
conventions, it can do anything it likes with the stack. There are some
additional conventions which, if adhered to, can ease the task of a
debugger while doing a stack backtrace. These conventions are not
described here; use of the recommended function prologue and epilogue
macros enables code to support them.
Functions can be divided into three classes; three different approaches
satisfy most programming needs.
Leaf functions
Functions which contain no calls to other functions are called leaf
functions. Because of this they don’t have to worry about setting up
argument structures and can safely maintain data in the non-preserved
registers t0 – t7, a0 – a3 and v0 – v1, and may use the stack for storage if
required. They can leave the return address in register ra and return
directly to it.
Most functions written in assembler for tuning reasons, or as
convenience functions for accessing features not visible in C, will be leaf
functions. The declaration of such a function is very simple, e.g.:
#include
#include
LEAF(myleaf)
...

...
j
ra
END(myleaf)

Most toolchains can pass assembler source code through the C macro
pre-processor before assembling it. The files and include useful macros (like LEAF and END, above) for declaring
global functions and data; they also allow the use of software register
names, e.g. a0 instead of $4. If using the MIPS Corp. toolchain, for
example, the above fragment would be expanded to:
.globl myleaf
.ent
myleaf,0
...

10–4

C PROGRAMMING

CHAPTER 10
...
j
.end

$31
myleaf

Other toolchains may have different definitions for these macros, as
appropriate to their needs.
Non-leaf functions
Non-leaf functions are those which contain calls to other functions.
Normally the function starts with code (the ‘‘function prologue’’) to reset sp
to the low-water mark of argument structures for any functions which may
be called, and to save the incoming values of any of the registers s0 – s8
which the function uses. Stack locations must also be reserved for ra,
automatic (i.e. stack-based local) variables, and any further registers
whose value this function needs preserved over its own calls (if the values
of the argument registers a0 – a3 need to be preserved, they can be saved
into their standard positions on the ‘‘argument structure’’).
Note that, since sp is set only once (in the function prologue) all stackheld locations can be referenced by fixed offsets from sp.
This is illustrated in the non-leaf function listed below, in conjunction
with the picture of the stackframe in Figure 10.1, “Stackframe for a nonleaf function”.

space for arg 4

higher

more arguments
(if won’t fit in 16 bytes)

space for arg 3
space for arg 2
fregoffs

framesize

regoffs

space for arg 1
automatic (local) variables

integer register save area
f.p. register save area

(>= 16 bytes)

sp while
running

lower

space for building
arguments for nested calls

Figure 10.1.

Addresses

sp on
entry

Stackframe for a non-leaf function

#include
#include
#
# myfunc (arg1, arg2, arg3, arg4, arg5)
#
# framesize = locals + regsave (ra,s0) + pad + fregsave (f20/21)
+ args + pad
myfunc_frmsz= 4 + 8 + 4 + 8 + (5 * 4) + 4
NESTED(myfunc, myfunc_frmsz, zero)
subu
sp,myfunc_frmsz
.mask 0x80010000, -4

10–5

CHAPTER 10

C PROGRAMMING

sw ra,myfunc_frmsz-8(sp)
sw s0,myfunc_frmsz-12(sp)
.fmask 0x00300000, -16
s.d
$f20,myfunc_frmsz-24(sp)
...

# local = otherfunc (arg5, arg2, arg3, arg4, arg1)
sw
a0,16(sp)
# arg5 (out) = arg1 (in)
lw
a0,myfunc_frmsz+16(sp)# arg1 (out) = arg5 (in)
jal
otherfunc
sw
v0,myfunc_frmsz-4(sp)# local = result
...
l.d $f20,myfunc_frmsz-24(sp)
lw s0,myfunc_frmsz-12(sp)
lw ra,myfunc_frmsz-8(sp)
addu
sp,myfunc_frmsz
jr ra
END(myfunc)

Analyzing the above example, one step at a time:
#
# myfunc (arg1, arg2, arg3, arg4, arg5)
#

The function myfunc expects five arguments: on entry the first four of
these will be in registers a0 – a3, and the fifth will be at sp+16.
# framesize = locals + regsave (ra,s0) + pad + fregsave (f20/21)
+ args + pad
myfunc_frmsz= 4 + 8 + 4 + 8 + 20 + 4

The total frame size is calculated as follows:
• locals (4 bytes): keep one local variable on the stack, rather than in a
register; the example may need to pass the address of the variable to
another function.
• regsave (8 bytes): save the return address register ra, because this
function calls another function; this function also plans to use the
callee-saved register s0.
• pad (4 bytes): the rules say that double precision floating-point must
by 8-byte aligned, so add one word of padding to align the stack.
• fregsave (8 bytes): the function plans to use $f20, which is one of the
callee-saved floating-point registers.
• argsize (20 bytes): this function is going to call another function
which needs five argument words; this size must never be less than
16 bytes if a nested function is called, even if it takes no arguments.
• pad (4 bytes): the rules say that the stack pointer must always be 8byte aligned, so add another word of padding to align it.
NESTED(myfunc, myfunc_frmsz, zero)
subu
sp,myfunc_frmsz

In the MIPS Corp. toolchain this would be expanded to:
.globl
.ent
.frame
subu

myfunc
myfunc,0
$29,myfunc_frmsz,$0
$29,myfunc_frmsz

This declares the start of the function, and makes it globally accessible.
The .frame function tells the debugger the size of stack frame to be
created, and finally the subu instruction creates the stack frame itself.

10–6

C PROGRAMMING

CHAPTER 10

.mask 0x80010000, -4
sw ra,myfunc_frmsz-8(sp)
sw s0,myfunc_frmsz-12(sp)

The function must save the return address and any callee-saved integer
registers used, in the stack frame. The .mask directive tells the debugger
which registers will be saved($31 and $20), and the offset from the top of
the stack frame to the top of the save area: this corresponds to regoffs. The
sw instructions then save the registers: the higher the register number, the
higher up the stack it is placed (i.e. the registers are saved in order).
.fmask 0x00300000, -16
s.d
$f20,myfunc_frmsz-24(sp)

The code then does the same thing for the callee-saved floating-point
registers $f20 and (implicitly) $f21. The .fmask offset corresponds to
fregoffs, i.e. local variable area + integer register save area + padding word.
# local = otherfunc (arg5, arg2, arg3, arg4, arg1)
sw
a0,16(sp)
# arg5 (out) = arg1 (in)
lw
a0,myfunc_frmsz+16(sp)# arg1 (out) = arg5 (in)
jal
otherfunc

This program calls the function otherfunc. Its arguments 2 to 4 are the
same as this programs’ arguments 2 to 4, so these can pass straight
through without being moved. However, the code must swap argument 5
and argument 1, so it copies:
• its input arg1 (in register a0) to the arg5 position in the outgoing
argument build area (new sp + 16).
• its input arg5 (at old sp + 16) to outgoing argument 1 (register a0).
sw

v0,myfunc_frmsz-4(sp)# local = result

The return value from otherfunc is stored in the local (automatic)
variable, allocated the top 4 bytes of the stack frame.
l.d $f20,myfunc_frmsz-24(sp)
lw s0,myfunc_frmsz-12(sp)
lw ra,myfunc_frmsz-8(sp)
addu
sp,myfunc_frmsz
jr ra
END(myfunc)

Finally the function epilogue reverses the prologue operations: restores
the floating-point, integer and return address registers; pops the stack
frame; and returns.
Functions needing run-time computed stack locations
In some languages dynamic variables can be created whose size varies
at run-time. Some C compilers support this, by using the useful library
function alloca. This means that sp has been lowered by an amount
unknown at compile time, so the compiler can’t use it to reach stack
locations.
In this case the function prologue grabs another register, s8, also known
as fp, and points it to the post-prologue value of sp.
Since fp is one of the saved registers, the prologue must also save its old
value. In the function body, all stack location references to automatic
variables, and saved-register positions are made via fp. But when calling
other functions, and putting data into the argument structure, that will be
done with relation to sp.

10–7

CHAPTER 10

C PROGRAMMING

Assembler buffs may enjoy the observation that, when creating space
with alloca the address returned is actually a bit higher than sp, since the
compiler has still reserved space for the largest argument structure
required by any function call.
This example is a slightly modified version of the function used in the
last section, with the addition of a ‘‘call’’ to alloca.
#include
#include
#
# myfunc (arg1, arg2, arg3, arg4, arg5)
#
# framesize = locals + regsave (ra,s8,s0) + fregsave (f20/21) +
args + pad
myfunc_frmsz= 4 + 12 + 8 + (5 * 4) + 4
.globl myfunc
.ent
myfunc,0
.frame fp,myfunc_frmsz,$0
subu
sp,myfunc_frmsz
.mask 0xc0010000, -4
sw ra,myfunc_frmsz-8(sp)
sw fp,myfunc_frmsz-12(sp)
sw s0,myfunc_frmsz-16(sp)
.fmask 0x00300000, -16
s.d
$f20,myfunc_frmsz-24(sp)
move
fp,sp

# save bottom of fixed

frame
...
# t6 = alloca (t5)
addu
t5,7
# make sure that size
and
t5,~7
# is multiple of 8
subu
sp,t5
# allocate stack
addu
t6,sp,20
# leave room for args
...

# local = otherfunc (arg5, arg2, arg3, arg4, arg1)
sw
a0,16(sp)
# arg5 (out) = arg1 (in)
lw
a0,myfunc_frmsz+16(fp)# arg1 (out) = arg5 (in)
jal
otherfunc
sw
v0,myfunc_frmsz-4(fp)# local = result
...
move
sp,fp
# restore stack
pointer
l.d $f20,myfunc_frmsz-24(sp)
lw s0,myfunc_frmsz-16(sp)
lw fp,myfunc_frmsz-12(sp)
lw ra,myfunc_frmsz-8(sp)
addu
sp,myfunc_frmsz
jr ra
END(myfunc)

There are a few notable differences from the previous example:
.globl myfunc
.ent
myfunc,0
.frame fp,myfunc_frmsz,$0

The function can’t use the NESTED macro any more, since it is using a
separate frame pointer which must be explicitly declared using the .frame
directive.
.mask 0xc0010000, -4
sw ra,myfunc_frmsz-8(sp)

10–8

C PROGRAMMING

CHAPTER 10
sw fp,myfunc_frmsz-12(sp)
sw s0,myfunc_frmsz-16(sp)

Since the program will modify fp (= s8 = $30), it must save it in the
stackframe too.
# t6 = alloca (t5)
addu
t5,7
and
t5,~7
subu
sp,t5
addu
t6,sp,20

#
#
#
#

make sure that size
is multiple of 8
allocate stack
leave room for args

This sequence allocates a variable number of bytes on the stack, and
sets a register (t6) to point to it. The program must make sure that the size
is rounded up to a multiple of 8, so that the stack stays correctly aligned.
In addition, it must add 20 to the stack pointer, to leave room for the five
argument words that will be used in future calls.
sw
lw
jal
sw

a0,16(sp)
# arg5 (out) = arg1 (in)
a0,myfunc_frmsz+16(fp)# arg1 (out) = arg5 (in)
otherfunc
v0,myfunc_frmsz-4(fp)# local = result

When building another function’s arguments, use the sp register; but
when accessing input arguments or local variables the program must use
the fp register.
move
sp,fp
pointer
l.d $f20,myfunc_frmsz-24(sp)
lw s0,myfunc_frmsz-16(sp)
lw fp,myfunc_frmsz-12(sp)

# restore stack

Finally, at the start of the function epilogue, restore the stack pointer to
its post-prologue position, and then restore the registers (not forgetting to
restore the old value of fp, of course).

SHARED AND NON-SHARED LIBRARIES
A C object library is a collection of pre-compiled modules, which are
automatically linked into a program’s binary when it refers to a function or
variable whose name is defined in the module. Many standard C functions
like printf are defined in libraries.
Libraries provide a simple and powerful way of extending the language;
but in a multi-tasking OS every program will carry its own copy of the
library function. Modern library functions may be huge; for example the
graphics interface libraries to the widely-used X window system add about
300Kbytes to the size of a MIPS object, dwarfing the application code of
many simpler programs.
In response to this problem most modern OS’ provide some way in which
library code may be shared between different applications. There are
different approaches:

Sharing code in single-address space systems
In a single address-space OS like VxWorks†, programs can be linked to
library functions by deferring the link operation (which actually fixes up
the program code) until the program is loaded into system memory. In this
kind of system the library function becomes part of a single large program.
But:
• The libraries must be written to be ‘‘re-entrant’’; they may be used by
different tasks, and one task may be suspended in the middle of a
library function and that function re-used by another.
† VxWorks is a trademark of Wind River Systems, Inc.
10–9

CHAPTER 10

C PROGRAMMING

For simple operations, re-entrancy is easily achieved by avoiding any
use of static modifiable data (so that all computation is done on the
stack and in machine registers). However, where library functions
must maintain internal data life gets much more complicated;
accesses to shared variables must use the programming technique of
critical regions protected by semaphores.
This does mean that library programmers must respect these rules,
and can’t just recompile existing code into libraries without
modification.
• The run-time system must maintain a symbol table for loading.
System utilities such as the debugger also need access to the symbol
table and relocation information.
In such a system a little extra work at load time allows a single copy of
a library function to be freely used by the OS kernel, drivers and any
number of application tasks. Simple functions suffer very little run-time
overhead (the convenient gp-relative addressing optimization, described in
the last chapter, cannot be used); the critical region overhead for shared
data is unavoidable.

Sharing code across address spaces
In a ‘‘protected’’ OS where separate applications run in separate virtual
address spaces, the problems are quite different. This section will outline
the way in which Unix-like systems conforming to the MIPS/ABI standard
provide libraries which can be shared between different applications, with
no restriction on how the libraries and applications can be programmed.
Every MIPS/ABI application runs in its own virtual address space. The
application code is fixed to particular locations in this address space when
it is linked. Library code is not built in; the application carries a table of
the names of library functions and variables which are used, but not yet
included. In addition, the application’s symbol table defines public items
which may be called from the library; under MIPS/ABI, library routines
may freely refer to public data, or call public functions, in application
code†.
In the MIPS/ABI model the binary application code must not be
modified; it may itself be shared by multiple invocations of the application
by multiple users.
It is not possible to predefine the actual virtual addresses at which a
library’s code and data will be located, but the offset from the start of its
code to the start of its data is fixed, and this permits a number of tricks to
be used.
• Position-independent code: the compiler and assembler (by a
command line option, used for library functions) can generate fully
‘‘position independent code’’ (PIC). All MIPS branch instructions are
PC-relative; somewhat more complex sequences must be used to load
a PC-relative address into a register, but if necessary it can be done:
la rd, label –>
1:

bgezal $zero, 1f
nop
addu rd, $31, label – 1b

• Indirection and the Global offset table: PIC is suitable for references to
code within a single module of a library (because the module’s code is
loaded as a single entity into consecutive virtual addresses). Data, or
external functions, will be at locations which cannot be determined
until the application and library are loaded, and so their addresses
cannot be embedded in the program text.

† Though this may not be good programming practice.
10–10

C PROGRAMMING

CHAPTER 10

Such addresses are held in a table built in the each library’s perprocess data space, the ‘‘global offset table’’ (GOT). Since the data
space is not shared and is writable, the table can be built as the
application and its libraries are loaded.
A library function refers to a variable or external function through the
GOT at a table index fixed when the library was compiled and linked.
A load of the external integer type ‘‘errno’’ will come out as:
lw rd, errno →

la gp, ThisLibsGOTBase
lw rd, errno_offset(gp)
nop
lw rd, 0(rd)

Similarly, invocation of the shared-library function exit() would
look like this:
/* setup argument */
jal exit
→
la gp, ThisLibsGOTBase
lw t9, exit_offset(gp)
nop
jalr t9

The register gp (or $28 ) is a good choice for the table base. Because
of its role in providing fast access to short variables it is not modified
by standard functions. As an optimization it is calculated only once
per function, in the function prologue. The calculation uses the fact
that the function’s actual virtual address will be in t9 (see previous
example), and that the library’s GOT is at a fixed offset from its code.
So a position-independent function prologue might start like this:
func:
la
addu
addu
sw

gp,
gp,
sp,
gp,

_gp_disp
gp, t9
sp, framesize
32(sp)

In the above example, _gp_disp is a magic symbol which is
recognized by the linker when building a shared library: it’s value will
be the offset between the instruction and the GOT. The calculated
value is saved on the stack, and must be restored from there after a
call to an external function, since that function may itself have
modified gp.
There is much more that could be said about the way in which the
MIPS/ABI implementation is optimized. For example, no attempt is made
to link in libraries when an application is first loaded into memory; dummy
GOT entries are used instead. When and if the application uses a library
module, the reference is caught and fixed up in much the same way as a
virtual-memory system incrementally pages-in a program image.

AN INTRODUCTION TO OPTIMIZATION
The compiler writer’s first responsibility is to ensure that the generated
code does precisely what the language semantics say it should; and that is
hard enough. In modern compilers, the optimizer has a secondary
purpose, which is to allow the compiler’s basic code generator to be simple
(and therefore easier to implement correctly).

Common optimizations
Most compilers will do all of the following. Occasionally the assembler
may get in on some of them too.

10–11

CHAPTER 10

C PROGRAMMING

• Common sub-expression elimination (CSE): this detects when the code
is doing the same work twice. At first sight this looks like it is just
making up for dumb programming; but in fact CSE is critically
important, and tends to be run many times to tidy up after other
stages:
a)
It is CSE which gives the compiler the ability to optimize across
the function. The basic code generator works through the
program expression-by-expression; even for well-written sourcecode, the expansion of simple C statements into multiple MIPS
instructions will lead to a lot of duplicated effort. The very first
CSE pass factors out the duplication and clears the way for
register allocation.
b)
Most memory-reference optimization is actually done by CSE –
the code which fetches a variable from memory is itself a subexpression.
The enemy of CSE is unpredictable flow of control: the
conditional branch. Once code turns into spaghetti, the compiler
finds it difficult to know what computation has run before which;
with some straightforward exceptions, CSE can really only
operate inside basic blocks (a piece of code delimited by, but not
containing, either an entry point or a conditional branch). CSE
markedly improves both code density and run-time
performance.
Similar to CSE are the optimizations of constant folding, constant
propagation and arithmetic simplification. These pre-compute
arithmetic performed on constants, and modify other expressions
using standard algebraic rules so as to permit further constant
folding and better CSE.
• Jump optimization: removes redundant tests and jumps. Code
produced by earlier compiler stages often contains jumps to jumps,
jumps around unreachable code, redundant conditional jumps, and
so on. These optimizations will remove this redundancy.
• Strength reduction: means the replacement of computationally
expensive operations by cheaper ones. For example; multiplication by
a constant value can be replaced by a series of shifts and adds. This
actually tends to increase the code size while reducing run-time.
• Loop optimization: studies loops in the code, starting with the inner
ones (which, the compiler guesses, will be where most time is spent).
There are a number of useful things which can be done:
a) Sub-expressions which depend on variables which don’t change
inside the loop can be pre-computed before the loop starts.
b) Expressions which depend in some simple way on a loop variable
can be simplified. For example, in:
int i, x[NX];
for (i = 0; i
main (int argc, char **argv)
{
printf ("hello world!\n");
return (0);
}

Memory map
A simple stand-alone program will usually have all of memory to itself,
except for a small amount at the bottom (and possibly the top) which is
reserved for use by the PROM monitor.
In such an environment, the programmer will not have to worry about
virtual memory: the program can be linked to run in the cacheable kseg0
address region or, to see the program with a logic analyzer, in the
uncacheable kseg1 region. These regions map one-to-one with physical
memory.
A typical base address for the program code would be 0x80020000 (i.e.
at offset 0x20000 in the KSEG0 region). This leaves 128 Kbytes free for the
PROM monitor’s own data and stack, which is enough for IDT/sim. Above
this will come the program’s initialized data, then BSS (uninitialized data),
followed by its ‘‘heap’’ (free memory for use by malloc et al ). The PROM
monitor will usually put the stack pointer near the top of memory, and the
stack and heap will grow towards each other.

Starting up
Having downloaded the program to the evaluation board and told the
PROM monitor to start the program, it will set the stack pointer to the top
of memory and jump to the program’s entrypoint, often defined by a label
with a standard name (e.g. _start), or simply by jumping to the first
address in the program.
The code following the entrypoint has to ensure that the run-time
environment required for a C program and library is set up. For a
downloaded program this is usually a simple matter of zeroing the BSS
segment, and initializing the $gp register and stack. It should then call the
program’s main function, after ensuring that its argc , argv and envp
arguments are initialized. If main returns, then its return value is passed
to the exit function, which will close open files and in turn call _exit. The
_exit function should transfer control back to the PROM monitor (the
exact manner this is done is system or tool dependent)†.
The following code fragment shows how a start-up module might be
implemented; it is commonly provided as part of the development system.

† The above functionality is provided by the “idt_csu.S” program
provided with IDT/c.
15–1

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

.comm

environ,4

.data
#defineARGC 1
argv0: .asciiz"prog"
argvec:.word argv0, 0
envvec:.word 0

.text
LEAF(_start)
/* initialize $gp */
la
gp,_gp

/* clear the BSS */
la
t0,_fbss
la
t1,end
sw
zero,0(t0)
addu
t0,4
bltu
t0,t1,1b
/* make sure stack is in same KSEG as .data */
and
t0,sp,0x1fffffff
# get stack physical
# address
and
t1,~0x1fffffff
# get KSEG of "end"
or
sp,t0,t1
# put sp in same KSEG
/* align to 8 byte boundary and allocate an argsave
area */
and
sp,~7
subu
sp,16
/* initialize argc, argv, and environ (IDT/sim zeroes
a0-a2) */
li
a0,ARGC
# dummy argc
la
a1,argvec
# dummy argv
la
a2,envvec
# dummy envp
sw
a2,environ
/* exit (main (argc, argv, environ)) */
jal
main
move
a0,v0
jal
exit

/* in case exit returns */
break 1
b
1b
END(_start)
1:

LEAF(_exit)
li
j
END(_exit)

ra,0xbfc00000+(17*8)# IDT/sim prom return
# vector
ra

C Library functions
Many C application programs will expect to have access to a C library
which conforms to the ANSI definition, as described in [reference K&R].
Most development systems will supply a library that conforms to at least
some parts of this standard. The rest of this section follows Appendix B of
[reference K&R] to warn the programmer about those areas where some
cross-development system libraries may deviate from the standard – refer
to the toolchain documentation for specific information.

15–2

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

Input and output
The header file is almost certain to be present, but the
library will often provide only a small subset of the expected standard i/o
facilities. In particular it will usually provide access to only a single console
device via stdin and stdout, with no file i/o. Some systems may provide
remote file access facilities, but this is often via a distinct set of nonstandard function calls†.
• File operations : are unlikely to be present, and if they are will usually
support only the system console device.
• Formatted output : the printf functions will usually be present, but
may omit some of the newer ANSI formatting options, and may not
support floating-point formats.
• Formatted input : the scanf functions are often absent.
• Character input and output : usually provided, but often only to the
system console.
• Direct input and output : sometimes provided, but often only to the
system console. or serial I/O ports.
• File positioning : probably absent.
• Error handling : probably absent.
Character class tests
The header file and its associated functions and/or macros
are usually provided. The isxdigit function is sometimes absent or has
a different name.
String functions
The older string functions are usually present, although often not very
optimized. Some of the newer string functions such as strspn, strcspn,
strpbrk, strstr, strerror and strtok may be absent.
The mem... functions are sometimes absent, and in their place the older
bcopy, bcmp and bzero functions may be provided.
Mathematical functions
The mathematical functions, if provided at all, will often be in a separate
maths library. If this library is supplied, it will probably implement all of
the required functions. Note that it may be impossible, or tricky, to run
floating-point code on CPUs which do not have an on-chip FPA. Even if it
does have one, the system may still need a trap handler for serious
floating-point use). Some compilers can be instructed to implement
floating-point operations by making calls to an emulation library.
Utility functions
The strto... functions are sometimes absent, but the olderatoi and
atol will usually be available. The floating point conversions may be
absent.
The following functions are often absent: rand, srand, atexit, system,
getenv, bsearch, qsort, labs, div and ldiv.
The malloc family will probably exist in some form, although realloc
is sometimes absent. At the lowest level they will probably call the sbrk
function to acquire memory from the system, which the programmer may
be required to implement. A simple sbrk will just return consecutive
chunks of memory starting from &end (i.e. just after the program’s declared
data), until it reaches somewhere near the bottom of the stack, as follows:

† The IDT/c toolchain does provide many of these and other
referenced functions. The programmer should consult the
reference manuals for a particular toolchain to determine which
functions are supported.
15–3

CHAPTER 15

extern
extern
static
static

SOFTWARE DESIGN EXAMPLES

char end[];
interrno;
void *curbrk = end;
void *maxbrk = 0;

#define MAXSTACK (64 * 1024)
void *
sbrk (int n)
{
void *p;
/* calculate limit for curbrk on first call */
if (!maxbrk)
maxbrk = (void *)&n – MAXSTACK; /* &n is approx value of
$sp */
/* check that there is room for this request */
if (curbrk + n > maxbrk) {
/* no room */
errno = ENOMEM;
return (void *)-1;
}
/* zero the requested region */
memset (curbrk, 0, n);
/* advance curbrk past region and return pointer to it */
p = curbrk;
curbrk += n;
return p;
}

Diagnostics
The assert macro is often absent.
Variable argument lists
Variable arguments are usually supported, but sometimes only via the
old vararg mechanism rather than the newer ANSI stdarg.
Non-local jumps
The setjmp and longjmp functions are usually supplied.
Signals
It is unlikely that the signal functions will be supported, although
sometimes a limited form is provided in order to support SIGINT only.
Date and time
It is likely that none of these functions will be available. Timing
benchmarks will often require a stop-watch, or some software mechanism
which is very specific to the PROM monitor and/or development system†.

Running the program
Having typed in the ‘‘hello world’’ program, the programmer must then
compile it, link it, and convert it into a form suitable for downloading to an
evaluation board. This process is very dependent on the particular

† IDT typically provides timer utilities as part of a utility disk
provided with an evaluation board, and also with the IDT/c
toolchain. These utilities are often board specific, since they rely
on an underlying hardware timer mechanism.
15–4

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

development system, which should provide some sort of automated
mechanism: many UNIX-hosted toolchains provide a set of makefiles
which control the whole process, via the well-known make utility.
When the compilation has completed successfully, a down-loadable file
is created (typically using S-records or other standard format).
Downloading this file will require the use of a terminal emulator (in IDT/
sim, use the “load” command on the board, and the “cp” command on the
host), or some other special utility, to transmit the file down an RS232 line
to the board. More advanced evaluation boards may provide an Ethernet
or parallel interface in order to download large programs at high speed.
Finally, it is then only necessary to instruct the PROM monitor to execute
the program.
So a complete edit, compile, download and run cycle on a SUN platform
using IDT/c might look like this:
On development system:
C> cd /idt/samples
C> vi hello.c
C> cp MakeBE Makefile
C> vi Makefile
C> make

change to source code directory
enter/edit the source file
create the makefile from the template
change ”stanford” in template to “hello”
compile and link for IDT 79RS385 board
this creates a “hellof.srec” file

;

On eval board’s console:
srec download via RS232 port #1

IDT>> l -a tty1

On development system:
C> cat hellof.srec > /dev/ttyb download via ttyb port

On eval board’s console:
IDT>> go

start the program

Debugging the program
Hopefully not too much can go wrong with ‘‘hello world’’, but larger
application programs may require some debugging before they work.
Most PROM monitors, including IDT/sim, incorporate a command-line
driven, machine-level debugger. This will allow the programmer to
disassemble the code, examine registers and memory, set breakpoints and
single-step through code one machine-instruction at a time.
Source-level debuggers allow the programmer to work in terms of the
original source code and data structures instead of MIPS machine
instructions. These debuggers run on the host development system – so
that they can get at the source files and compiler-generated debugging
information. They operate the program on the evaluation board by ‘‘remote
control’’, via a serial line or network connection. Many PROM monitors will
incorporate a special protocol to support this feature, although some may
require that the code for it be downloaded along with the program.
Source-level debuggers may themselves be command-line driven (e.g.
MIPS dbx and IDT’s/GNU’s gdb), or may offer a multi-window, GUI
interface. In all cases they are very complex programs, with many different
commands and options. The development system’s documentation should
provide more details of how to use them with a target board.

EMBEDDED SYSTEM SOFTWARE
Many aspects of ‘‘embedded software’’ are the same as ‘‘application
software’’, and its early development may be carried out in exactly the
same way, on an evaluation board. But ultimately it is likely to be running
in EPROM, on custom hardware, and require lower-level access to the
processor in order to initialize it, test it, and handle machine traps and
interrupts.

15–5

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

Memory map
Compared to a program which is downloaded into RAM, embedded
software will (at least initially) have its code and read-only data in EPROM.
The EPROM, and thus the code, should be located at physical address
0x1fc00000, which corresponds to the processor’s reset vector of
0xbfc00000. The data area should probably be located near the bottom of
RAM (DRAM or SRAM), but just above the area used for the processor’s
(non-boot) exception vectors: 0x400 should be safe for all existing
RISController processors. Device registers should be decoded at high
physical addresses, but below 512 Mb. If the hardware engineer suggests
putting RAM at anywhere other than zero, or device registers anywhere
outside of the bottom 512 Mb, then complain loudly: it will make software
much more complicated, and performance may suffer.

Starting up
After a hardware reset, code will be running in KSEG1 (i.e. uncached),
with the caches, TLB (if present), internal registers, and RAM in an
undefined state. Its first job is to initialize these resources. A detailed
discussion and example of this can be found in earlier chapters of this
manual.
For higher performance, code will need to be located in the cacheable
KSEG0 region (i.e. at 0x9fc00000), rather than the uncached KSEG1
(0xbfc00000). This has implications for start-up code. Before the caches
are initialized, branches and absolute jumps (i.e. j and jal) are safe,
because they do not alter the top four bits of the program counter, but any
reference to data, or an attempt take the address of a function for use with
jr or jalr will generate a KSEG0 address before it is valid to do so. The
programmer must take care that any such references are explicitly
mapped to KSEG1, by logically or-ing in the KSEG1 base address (i.e.
0xa0000000). Once the caches are initialized, switch to running cached
by use of an explicit jr instruction, as follows:
/* switch to running cached, if so linked */
la t0,1f
jr t0
1:

Even running cached from EPROM will not give optimal performance,
since cache refill cycles from EPROM will be slower than from RAM. A
higher performance option is to link the code to run in RAM, and arrange
for the start-up code to copy itself and the rest of the software from ROM
to RAM. This is also useful when debugging the ROM, since it is not
possible to set breakpoints or single-step code which is in ROM. Note,
however, that this requires even more careful programming of the start-up
code. Even jumps cannot be used until the code has been moved: only pcrelative branches are safe, and the bal instruction should be used in place
of jal (though beware its limited +-128Kb range). Any attempt to access
data or take the address of a function must be relocated by explicitly
adding in the offset between the code in RAM and its temporary location in
EPROM. It is sensible to calculate this offset just once, and keep it in a
reserved register, such as $k1.
Another complication is initialized data. Initialized data can be declared
in assembler or C, e.g.
base:

.data
.word

or
int base = 10;

15–6

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

The initialized data is writable, and so must be in RAM. But how does it
get there?
Some cross-development toolchains are not very helpful, and require
that all data must be either uninitialized, or if initialized then read-only.
Other toolchains provide various different mechanisms by which to
initialize this data. SDE-MIPS, for example, takes the straight-forward
step, when generating a ROM image, of placing a copy of the initialized
data segment (i.e. .data) at the next 16-byte boundary after the code. It is
then easy to copy this from ROM to its final in RAM.
The following code fragment illustrates a flexible mechanism for
handling these different possibilities for moving code and data to RAM.
_reset_vec:
b _reset
...
_reset:
move k1,zero
bal 1f
1: la t0,1b
beq t0,ra,2f

#
#
#
#
#

assume no relocation
ra := current pc
t8 := linked pc
when they match, then no reloc is
correct

/* executing at other than
li k1,0xbfc00000
# k1 :=
la t0,_reset_vec
# t8 :=
subu k1,t0
# k1 :=
2:

the linked address */
actual EPROM base
linked EPROM base (may be RAM)
reloc factor (actual – linked)

/* initialize CPU, RAM, caches, tlb & stack
(hardware specific) */
...
/* skip code move if it is linked for ROM */
and t0,k1,~0x20000000 # ignore simple KSEG1->KSEG0 reloc
beqz t0,3f
/* copy code to linked address in RAM */
la a0,_ftext
# a0 := destination (RAM) address
addu a1,a0,k1
# a1 := source (ROM) address
la a2,etext
# a2 := code size (etext – _ftext)
subu a2,a0
bal memcpy
3:
/* copy initialized data to RAM (SDE-MIPS specific) */
la a0,_fdata
# a0 := destination (RAM) address
la a1,etext
# a1 := source address (after ROM code)
addu a1,k1
addu a1,15
# round address up to 16-byte boundary
and a1,~15
la a2,edata
# a2 := data size (edata – _fdata)
subu a2,a0
bal memcpy
/* jump to C start-up at linked address */
la t0,_start
j t0

Embedded system library functions
Embedded system software written in C or C++ will still need access to
the MIPS Coprocessor 0 registers and instructions in order to control
interrupts, catch exceptions, handle the caches and TLB, and so on. Some
cross-compiler vendors will supply a toolkit of low-level library routines to
do this, and sometimes it will include full source code. At a minimum such

15–7

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

a kit should include assembler functions which read and write each CPU
control register, initialize and update the TLB (if present), and initialize and
invalidate all or part of the caches. Unfortunately there are no standard
interfaces for these functions, and the programmer will have to read the
cross-development system’s documentation. The examples in this manual
could serve as a baseline reference for programmers which choose to
generate these functions themselves.†
Trap and interrupt handling
Beyond routines to manipulate the CPU control registers and caches,
the system software may need a mechanism by which to catch machine
exceptions (the generic name for traps and interrupts), and cause
appropriate C handler functions to be called. Vendor-supplied embedded
system toolkits probably contain some code to help with this, although this
is often at the very low level, and require more work to interface to C-level
functions. SDE–MIPS includes some relatively high-level exception
handling code that allows the programmer to route different exceptions to
different C functions, and pass them a pointer to a structure which
contains the complete CPU context at the time of the exception.
Choices about stacks
An exception handler has several choices regarding its use of stacks:
1) Remain on the current stack, shared with the main, or current
application. This is usually adequate for simple, single-threaded
applications.
2) Have an exception stack, which it switches to upon receiving an
exception when not already at exception level. This avoids
overrunning an application’s stack, if it is small, and avoids
problems if the exception was caused by a bad value stack
pointer value.
3) Have several exception stacks, one per ‘‘process’’. This is
essential in multi-processing applications.
Simple interrupt routines
If any of the CPU’s six interrupt pins or two software interrupt bits are
active, and not masked by the CPU’s Status register, the CPU takes an
immediate Interrupt exception. Once the generic exception handler has
routed the exception to the specific Interrupt exception function, it is the
this function’s responsibility to sort the interrupts into priority order and
dispatch to the correct device’s interrupt handler. The simplest technique
is to make interrupt priorities correspond directly to interrupt pin number,
allowing a simple bit-scan of the Cause register.
A very simple, fixed-priority interrupt handler might look something like
this:
extern
extern
extern
extern
extern
extern
extern

void
void
void
void
void
void
void

softclock(), softnet();
diskintr();
netintr();
ttyintr();
fpuintr();
clkintr();
dbgintr();

/* interrupt handler table
void (*intrhand())[8] = {
softclock,
softnet,
diskintr,
netintr,

*/
/*
/*
/*
/*

[0] SInt0: clock */
[1] SInt1: network */
[2] Intr0: disk controller */
[3] Intr1: network interface */

† Alternately, the programmer could obtain IDT/sim and/or IDT/
kit from IDT, or a similar product from other 3rd party tools
vendors.
15–8

SOFTWARE DESIGN EXAMPLES

CHAPTER 15
ttyintr,
fpuintr,
clkintr,
dbgintr
button, etc. */
};

/*
/*
/*
/*

[4]
[5]
[6]
[7]

Intr2:
Intr3:
Intr4:
Intr5:

uart */
fpu interrupt */
timer interrupt */
bus errors, debug

/*
* Interrupt exception handler.
* 1) The xcp argument points to a structure which maps to
*
the stack frame in which the CPU context (i.e. all
*
registers) are saved.
* 2) On entry all interrupts are masked (disabled).
* 3) It calls the mips_setsr() function to modify the CPU
*
Status register.
*/
interrupt (struct xcption *xcp)
{
unsigned int pend, intrno;
/* find all pending, unmasked interrupts */
pend = xcp->cause & xcp->status & SR_IMASK;
/* dispatch each pending interrupt, starting with
* highest */
for (intrno = 7; (pend & SR_IMASK) != 0;
pend <<= 1, intrno--) {
if (pend & SR_IBIT7) {
/* enable only interrupts of higher priority
* than this one. */
unsigned int imask = SR_IMASK <<(intrno + 1)
mips_setsr (imask | SR_IEC);
/* call interrupt handler */
*intrhand[intrno]) (xcp);
}
}
/* disable all interrupts */
mips_setsr (0);
}

Floating-point traps and interrupts
The previous section shows how to recognize a floating point interrupt.
Following the interrupt the EPC will either point at the FP instruction or (if
the FP instruction is in a branch delay slot) at the immediately preceding
branch.
To find out what happened, look first at the CPU Cause register. If it
shows a ‘‘co-processor unusable’’ condition, then the FPA instruction set
is not enabled. If it shows an interrupt at the FPA’s level, the handler can
get further details of exactly what has gone wrong by consulting the
floating point status register. However, there are only three cases of
interest:
• The FPA is disabled (CU1 == 0 in the CPU status register). If the CPU
does not have an FPA, the software might want to emulate the
instruction. If the FPA is available, the system might have been doing
an “enable-on-demand”†. If so enable it and return to retry the
instruction.

15–9

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

• The chip includes an FPA, and it’s enabled, and the FCR31 UnImp bit
is set. The FPA has interrupted because it can’t perform this
particular operation, with these particular operands. The normal
approach is to emulate the instruction – though in this case software
will want to put the result back in the real FP registers.
In theory there are a rather restricted range of operations and
operands which cause this condition: underflows, operations which
should produce the ‘‘illegal’’ NaN value, denormalised operands, NaN
operands, and infinite operands.
The system could put in special case code to handle just these
conditions. But it is very hard to get assurances about exactly when
the FPA may refuse an operation.
• The system has an enabled FPA, and the FP status register UnImp bit
is clear. It looks as if the FPA operation has produced an IEEEexception. Software may need to report this to the application, in
some OS-dependent manner.
Emulating floating point instructions
• Locate the instruction: it will either be at EPC (when the CPU status
register, SR bit BD, is clear); or when BD is set, indicating that the
exception happened in a branch delay slot, the FP instruction will be
at EPC+4.
• Decode the instruction: The encoding of FP arithmetic instructions is
very regular.
• Fetch the operands: the instruction encoding tells which FP registers
hold the operand(s).
• Call the emulator: to perform the actual operation.
• Check for exceptions: if there are any enabled IEEE exceptions. If the
system architect knows that IEEE exceptions can’t usefully happen
(perhaps because there is no mechanism in place for applications to
catch them), skip this stage.
• Patch the result: back into the appropriate FP destination register.
• Hop over the emulated instruction: if BD was clear, just restart at
EPC+4.
But if BD was set the program is going to have to decode and emulate
the branch instruction (at EPC) too, and restart at the branch target
location.

Debugging
Once the developer leaves behind the relative safety of a PROM monitor
and its debug support to develop the system PROM, finding out why the
code is not working correctly may become much more tedious.
At the worst, the programmer will have to use judicious calls to printf,
link the program in KSEG1 (i.e. uncached) and monitor CPU addresses
with a logic analyzer. Armed with a list of function addresses (e.g. the
output of the nm utility), and possibly a detailed disassembler listing for
the suspect function, it is often possible to deduce the bug. It is seldom
necessary to capture data values, although a few bits or a byte can be
useful if the analyzer has enough probes.
Some vendors offer R30xx disassemblers and special pods for an
analyzer to trace both instruction and data accesses.
Another technique is to include support for remote source-level
debugging in the new PROM. The use of a ROM emulator device may prove
helpful. This would allow the debugger to place “breakpoints” into the ROM
code.
† Some systems do this to avoid saving FP registers at context
switch if the application is not using the FPA.
15–10

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

UNIX-LIKE SYSTEM S/W
It is obviously impossible, in a few pages, to give a comprehensive
description of a big operating system. This section will provide some
background on what a portable big system does, and how it does it on
MIPS – so if the system needs to implement some fragment of this
functionality the programmer won’t be starting entirely from scratch. In
specific examples shown, the description below relates to the freely
redistributable ‘‘NetBSD’’ system, part of the Berkeley family.
The description is arranged as follows:
• Terminology: key words, often used with very particular meanings in
Unix-like systems:
• Components of a process:
• Protection: how the kernel protects itself and other processes from
misbehaving software;
• Kernel services:
• Virtual memory: how the MIPS architecture is used to build VM.
• Interrupts: how the CPU’s features relate to the needs of the OS.

Terminology
• Task: a thread of control, identified by a program counter and a stack.
In other contexts this may be called a ‘‘process’’ or ‘‘thread’’.
• Address space: the program memory context seen by an application.
For MIPS this is a single, simple 32-bit space, divided into two. The
lower 2Gbytes is accessible in user mode, but the upper 2Gbytes is
not usable except in kernel mode. Note that the address mapping
doesn’t change with CPU mode. There are no segments, no separate
I- and D-space.
This MIPS model fits very well onto the BSD family of Unix-like
systems, and was probably conceived with BSD’s requirements in
mind.
• Program: a bunch of code and data initialization, held on disc and
loaded when required.
A ‘‘process’’ combines all these three: it is a task in an address space
running a program.
• File: a named sequence of bytes coming from ‘‘outside’’. At its simplest
it is just data which can be written out to disc and subsequently read
back.
• Device: abstract, fairly unified interface to diverse real-world
peripherals. Devices are named like files, and offer the same basic
byte-stream model. Beneath this interface the kernel buffers data,
handles interrupts and hardware details, and also provides an escape
mechanism to keep device-specific functions tidy.
‘‘Device drivers’’ are the lowest layer of kernel software which deals
with hardware, and are supposed to isolate dependencies on
particular controllers/peripherals.
Network interfaces are handled differently, and networking code is
way beyond the scope of this chapter.
• Page fault: the OS maintains a mapping of program (virtual)
addresses to physical addresses. But it doesn’t have to keep all the
process pages in memory. Access to a page for which no translation
is defined causes a trap (a page fault which invokes a piece of software
which checks that the address is legitimate, and if so brings the page
into memory. When a page is touched for the first time, it will either
be loaded from disc (if it is program text or initialized data) or supplied
set to zero (if it is uninitialized data or stack).

15–11

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

Components of a process
The above description refers to a BSD ‘‘process’’ as a task, address space
and program all at the same time. This is a restriction, but it does keep
things simpler.
“Processes” are laid out in memory as shown in Figure 11.1, “Memory
layout of a BSD process”.
0xFFFF FFFF

kseg2
per-process data

0xC000 0000

IO registers (h/w dependent)
kseg1
0xA000 0000

kernel data

kseg0

kernel code
0x8000 0000

stack (grows down)

heap (grows up)

kuseg

declared data

program code
0x0000 0000

Figure H.1.

Memory layout of a BSD process

• Program text: every process has a program in memory which it can
run (it may be ‘‘virtual memory’’, but to the process it seems to be
there).
• Stack: every process has a stack, which grows downwards from the
top of the user-accessible space. Since the MIPS architecture has no
architecture-specified stack pointer, the OS is always willing to
allocate pages of memory in the stack region if ever the program gets
a page fault.
• Declared data: the data declared in a C program is noted in the object
file, and explicitly accessed by compiled-in code. Initialized data is
paged from the program file as needed, uninitialized data is supplied
as zero-filled pages.
• Heap: this is the traditional name for data space allocated during
program run-time. At the top of the data section the kernel maintains
a boundary address (the break); on a page fault addresses above this
are rejected as invalid. To allocate extra data the process can invoke
the sbrk() system call; this is usually done implicitly when calling a
free-space manager function such as malloc().

15–12

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

• Kernel data structures: when a process in BSD makes a system call
the process continues execution, but in kernel mode. Some kernel
activity (such as interrupts) doesn’t run on a particular process
context, but most does.
So important parts of the process address space are inside the kernel,
and are not accessible while the process is running in user mode. In
particular, the process in kernel mode gets access to the whole kernel
code and data (mapped into kseg0) and to all IO registers (mapped in
kseg1).
It is a boon that, while a process is running in the kernel, all its usermode data is accessible at exactly the same addresses as in user
mode. Some architectures have to implement a special-case ‘‘copy
user data to/from kernel space’’ instruction.
• proc structure: lurking in the kernel data area are the two key data
structures which define the process. Why two? The smaller of these is
the proc structure and contains information which may be required
even when the process is not itself executing, and;
• per-process data area (u area): this is the larger process structure,
and is accessible only when the process is active. By a special trick of
the MMU, the per-process data area is mapped to a constant virtual
address inside the kernel, in the kseg2 region.
• kernel stack: attached to the per-process area, mapped into kseg2, is
the stack used by the process when executing in the kernel. It is this
stack which is ‘‘borrowed’’ by interrupts.

System calls and protection
One of the goals of BSD is protection for robustness; to ensure that a
user-level program which goes wrong cannot disrupt the rest of the
system. This is basically achieved by the process address space:
• In user mode, the process can only get at its user-mode virtual
addresses, which are only those pages allocated by the kernel.
• To get into kernel mode, the process has to drop through a system call
trap and can then perform only the function the system call allows. It
is the duty of the system call itself to check its arguments for sanity,
and to make sure that it behaves properly.
Interrupts and inadvertent traps behave much like system calls,
albeit ones which don’t work on behalf of the user process.
Of course, since the process has the whole kernel mapped it can at any
time attempt a reference to kernel code or data; but in user mode this will
be immediately trapped, and find its way to a memory reference error
handler – which by default will kill the process.
R30xx security features are pretty much the minimum that will support
a BSD-style OS. Many architectures offer much more; but portable OS’,
since they want to be portable, use only the lowest common denominator
of security functions – and since all significant microprocessor OS’ are now
portable, the extra functionality is wasted.

What the kernel does
In the BSD system the kernel is the essential common ground between
processes, and must share out access to any resource for which processes
compete (CPU time, memory, disc bandwidth etc.). It must also provide
basic mechanisms so that processes which want to co-operate can
communicate with each other. BSD and other Unix-like systems are
traditionally rather kernel-heavy; more modern OS’ try to provide only
minimum functions in the kernel (which is then often called a microkernel),
handing over other jobs to distinct ‘‘server’’ processes.
• File system: the kernel provides access to the file system, which is
based on open/read/write/close functions. In practice this splits into
two; resolving names and then implementing file I/O.
15–13

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

There will usually be multiple file system implementations (but each
offering the same service); a file I/O system call will be redirected to
the correct code according to whether the file is local, on NFS, on a
DOS floppy disc, etc.
• Scheduling: BSD decides which process to run. Most of the time,
processes will run until they need some input – and then they’ll make
a system call to get the input and block until the input is ready.
But sometimes a process needs to compute for longer; in this case it
will be time-sliced; it will be allowed to run only for a second or so and
then another process will be given a go.
To prevent a compute-bound process from clogging up the CPU,
processes are given priorities, and any process which uses up its time
slice has its priority reduced. A priority-based scheduling decision is
made often – potentially, after any interrupt.
• Paging: the kernel shares memory by picking pages of memory which
don’t appear to have been used for a while, and throwing them out. A
data page which has been written by a process since it came in must
first be saved to a disc swap file.
The MIPS architecture gives no direct help in tracing what happens to
pages; in many architectures the MMU hardware notes (separately)
whenever a page is either referenced or written. In MIPS this must be
simulated; so the kernel picks pages and marks them as (from the
point of view of the hardware) ‘‘read only’’ or ‘‘invalid’’. Then it waits;
if a process references or writes the page a trap will be generated, and
the trap handler will look at the page status and set a software
referenced/written bit.
In this way processes which are not active slowly migrate out of
memory.
• Caching and sharing code: it often happens, particularly in a multiuser system, that there are multiple processes all running the same
program. NetBSD treats code pages (i.e. read-only pages marked as
loadable from a file) as sharable; when they are kept in memory they
are indexed by their disc location. During periods of relatively light
load (which is most of the time in most systems) much of memory has
nothing very useful in it; so code pages are allowed to stay there,
forming a least-recently used cache.
This means that a program which is repeatedly re-run to completion
goes much faster. Although each time a process must be created and
the whole program nominally ‘‘paged in’’, in practice all that is needed
is to construct a set of entries referencing the already memoryresident code.

Virtual memory implementation for MIPS
The R30xx hardware supports an arbitrary (though small) set of
translations in its 64-entry TLB. When an address is encountered which
doesn’t match with one of these, the CPU takes an exception (a tlbmiss)
and software must find a new translation and load it.
‘‘tlbmiss’’ events can occur very frequently when running large
programs, and the trap handler must run quickly. Misses for user-mode
addresses are vectored through a dedicated trap vector, to the utlbmiss
routine; since MIPS kernels can be built to run largely in the kseg0/kseg1
areas (which don’t require the TLB) the vast majority of TLB misses are
user ones.
To speed the trap handler, most systems will keep memory-resident
tables of page entries, in a format already bit-for-bit compatible with the
hardware-determined TLB entries.
It would be nice to do this by keeping a simple array of TLB entries,
indexed by virtual address. However, with a 2Gbyte range of user
addresses and 4Kbyte pages, the array would require 512K entries,

15–14

SOFTWARE DESIGN EXAMPLES

CHAPTER 15

occupying 2Mbytes of memory. Since the program address space has huge
‘‘holes’’ in it, most of this 2Mbytes of memory would be full of nothing –
which is a lot of memory to dedicate,
Two different solutions to this problem are used. MIPS Corp’s UMIPS
and RISC/os variants use a linear page table but don’t keep it all in
memory; NetBSD uses a memory-held secondary cache of page table
entries supporting a machine-independent data structure:
• Linear Page table not all in memory: the linear page table is located in
the virtual space kseg2. Although the whole page table is very large,
most of it is never referenced, never allocated a kseg2 translation, and
therefore costs nothing. The active parts of the page table correspond
with the stack, data and code parts of the process address space; and
for these the kseg2 translation is likely to remain live.
The CPU’s Context register is explicitly designed to do the work of
computing where the desired page table entry lies, saving a few more
instructions.
This does require that the utlbmiss handler can safely suffer a regular
trap, to cope with those occasions where the page table read falls on
a kseg2 address which is not currently translated by the TLB. This
nested exception is not allowed to happen in any other
circumstances; but its use here motivates another feature of the MIPS
hardware, and a convention:
a)
The status register’s internal stack of processor state (2 bits for
kernel/user mode and interrupt on/off) is three deep; allowing
an exception to occur in an exception handler, before the status
register gets saved.
b) The ‘‘nested’’ exception overwrites the EPC value (return address)
from the original address reference, so the utlbmiss handler
saves EPC into the general-purpose register k1; the regular trap
handler which deals with kernel TLB misses has to detect the
double-exception and return to the right place.
This is why there are two registers (k0,k1) reserved for exception
handling: most of the time only one is needed.
• Secondary cache of page table entries: NetBSD uses a different
technique. Here the TLB miss handler consults a software cache of
recently used page table entries. The software cache is implemented
with a simple 2-set hashing function, with a fast path for translations
which are in the same set as their predecessor. A modestly large cache
gives an excellent hit rate – so those few translations which miss here
can be computed by a C-language routine using architectureindependent tables.

Interrupt handling for MIPS
Interrupt handling in Unix-like OS’ are descended from the prioritybased system implemented in hardware by DEC’s PDP-11 and VAX
architectures. Priorities are numbered from 0 to 7 (though not all are
always used) – more recently, the numeric priorities have been getting
names.
• Priority model and spl: kernel code is arranged so that, in general,
each piece of code is accessible only at or above a particular priority
level. So, for example, once a program is at level 4 the CPU will only
accept interrupt requests prioritized at level 5 and above.
Most of the kernel code used by system calls runs at level 0.
Device code which needs to lock itself against asynchronouslyoccurring interrupt events can call a function such as spl4() (spl
stands for ‘‘set processor level’’): there is a separate call for each level.
spl4() returns a value representing the priority level when it was
called, so the code sequence:

15–15

CHAPTER 15

SOFTWARE DESIGN EXAMPLES

p = spl4();
/* do something which can’t be interrupted */
splx(p);

restores whatever is required to lower the level again.
Note that interrupt handlers can get called at two points: either as
soon as the interrupt signal is activated, or (if the processor is
currently at a higher spl) the handler will be called when a call to
splx() lowers the level below the interrupt’s priority.
How it works
The MIPS interrupt hardware knows nothing of levels, with only an
unprioritized mask for the interrupt inputs. But if an spl level can be
assigned to each of the interrupt inputs, then each of the spl..() routines
can be implemented by setting the interrupt mask to a value enabling only
those interrupts allocated a higher level.

15–16

ASSEMBLY LANGUAGE
PROGRAMMING TIPS

CHAPTER 16

Integrated Device Technology, Inc.

The MIPS-1 architecture found in the R30xx family is designed for highfrequency, single-cycle instruction operation. Also, as noted earlier, the
MIPS architecture does not carry a status register, nor does it directly
support various addressing formats. As a result, some operations that may
have been found in older CISC architectures must by synthesized from
multiple instructions in the MIPS architecture. The net execution time is
typically improved, however, since these complex instructions were
inherently multi-cycle in these older CISC architectures.
This chapter describes common programming problems and their
implementation in the MIPS architecture. Many of these operations are
directly supported by the synthetic instructions, described earlier.
Also note that many of these instructions require the use of $at (the
assembler temporary register) described earlier.

32-bit Address or Constant Values
As noted earlier in this manual, the MIPS-1 instruction set does not
have enough room in the bit encoding to directly support 32-bit constants
or constant address values. Thus, programmers must use combinations of
instructions to generate 32-bit values.
Again, these are commonly handled using the synthetic la or li
instructions. Depending on the immediate value, the assembler will
generate one or two instructions to implement the immediate load into the
register:
Operand

Instruction Sequence

Upper 16 bits
all zero

ori rd, value15..0

Upper 17 bits
all one

addi rd, $0, value15..0

Lower 16 bits
all zero

lui rd, value31..16

All other
values

lui rd, value15..0
ori rd, value31..16
Table 16.1. 32-bit immediate values

To jump to an absolute 32-bit address, a similar construct must be
used. The la synthetic instruction is used to load the target address into a
register; a jr (jump register) is then used to perform the jump.
Note that j and jal may be used in many instances. However, these
instructions take the high-order four bits of the current “PC” as the upper
four bits of the target address, and thus limit the program space that can
be reached. In practice, this limit may be larger than the address space of
most typical embedded applications.

Use of “Set” Instructions
The MIPS ISA provides a very powerful operation to enable the easy
synthesis of complex test operations.
The “set” instructions place a value of ‘1’ (true) or ‘0’ (false) into the
specified destination register to reflect the outcome of a specified
comparison operation. When used with conditional branch operations,
complex comparison sequences can be implemented, as well add-withcarry or subtract-with-borrow operation.

16–1

CHAPTER 16

ASSEMBLY LANGUAGE PROGRAMMING TIPS

Use of “Set” with Complex Branch Operations
The MIPS instruction set directly implements branch comparisons for
the following cases:
- two registers equal
- two registers not equal
- register greater-than-or-equal to zero
- register less-than-or-equal to zero
- register greater-than zero
- register less-than zero
These branch comparisons directly implement a wide range of common
test conditions directly in hardware. However, in certain situations the
programmer may require a more complicated test between two non-zero
registers. This is where the “set” instructions are used.
For example, if the programmer wishes to branch conditionally if one
register is less than another, a two instruction sequence is used:
slt
bne

$at, $a, $b
$at, $0, target

# branch to target if a < b

Using analogous instruction sequences, the programmer can synthesize
virtually any comparison between two registers using the various set
instructions.
Similarly, comparisons with immediate values can be implemented. For
example, to compare whether a register value is less-than-or-equal-to an
immediate:
slti
bne

$at, $a, imm+1
$at, $0, target

# branch to target of a <= imm

Of course, if the immediate value is large, then the programmer must
first build it into a register as described earlier in this chapter, and then
perform the comparison.
Many of these common operations are already built into the synthetic
instruction set supported by a given toolchain assembler package. The
programmer is advised to consult the reference manual.
Carry, borrow, overflow, and multi-precision math
The MIPS-1 ISA does not directly support a carry bit. Instead, the effects
of a carry bit can be synthesized when needed using the “set” constructs.
This enables the programmer to implement tests for overflow, multiprecision math, and add-with-carry operations.
For example, these constructs enable the programmer to perform tests
to determine whether an arithmetic operation resulted in a carry (or
borrow).
For add sequences, there are two cases to consider:
Case

Instruction Sequence

No possible carry from
previous operation

addu temp, A, B
sltu carryout, temp, B # carryout from A + B

Carry-in from previous
operation

not temp, A
sltu carryout, B, temp
xor carryout, 1 # carry-out from A+B+1
Table 16.2. Add-with-carry

Subtract with borrow works analogously:

16–2

ASSEMBLY LANGUAGE PROGRAMMING TIPS

CHAPTER 16

Case

Instruction Sequence

No borrow-in

sltu borrow, B, A #borrow-out from A-B

Borrow-in from previous

sltu borrow, B, A
xor borrow, 1 #borrow out from A-B-1

Table H.3. Subtract-with-borrow operation

Testing for overflow also uses the set instructions, coupled with two
basic rules:
• An addition operation has overflowed if:
— the sign of both operands is the same
— the sign of the result differs from the sign of the operands
• A subtraction has overflowed if
— the signs of the two operands are different
— the sign of the result is different from the sign of the minuend
Testing for these conditions is a simple programming exercise. For
example, testing for overflow in signed addition:
/* branch to Label if t1+t2 overflows
addu
t0, t1, t2
xor
t3, t1, t2
bltz
t3, 1f
xor
bltz
1f:

t3, t0, t1
t3, Label

/* no overflow */

16–3

*/
/* result in t0*/
/* check signs of operands*/
/* then no overflow*/
/* check sign of result */
/* overflow...*/

MACHINE INSTRUCTIONS
REFERENCE

APPENDIX A

Integrated Device Technology, Inc.

CPU Instruction Overview
This appendix provides a detailed description of the operation of each
user mode CPU Instruction for the MIPS I architecture. The instructions
are listed in alphabetical order.
Exceptions that may occur due to the execution of each instruction are
listed after the description of each instruction. Descriptions of the
immediate cause and manner of handling exceptions are omitted from the
instruction descriptions in this appendix.

Instruction Classes
CPU instructions are divided into the following classes:
• Load and Store instructions move data between memory and general
registers. They are all I-type instructions, since the only addressing
mode supported for the general registers is base register + 16-bit
immediate offset.
• Computational instructions perform arithmetic, logical and shift
operations on values in registers. They occur in both R-type (both
operands are registers) and I-type (one operand is a 16-bit immediate)
formats.
• Jump and Branch instructions change the control flow of a program.
Jumps are always made to absolute 26-bit word addresses (J-type
format), or register addresses (R-type), for returns and dispatches.
Branches have 16-bit offsets relative to the program counter (I-type).
Jump and Link instructions save their return address in register 31.
• Coprocessor instructions perform operations in the coprocessors.
Coprocessors have up to two register sets separate from the CPU.
Coprocessor loads and stores, similar to those for the general
registers, are defined for the coprocessors and are I-type.
Coprocessor computational instructions have coprocessor-dependent
formats.
• Special instructions perform a variety of tasks, including movement
of data between special and general registers, trap, and breakpoint.
They are always R-type.

A–1

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Instruction Formats
Every CPU instruction consists of a single word (32 bits) aligned on a
word boundary and the major instruction formats are shown in Figure
A.1:.
I-Type (Immediate)
31

26 25

21 20

16 15

immediate

J-Type (Jump)
31

26 25

target

R-Type (Register)
31

26 25

21 20

16 15

11 10

6 5

shamt

funct

6-bit operation code

5-bit source register specifier

5-bit target (source/destination) or branch condition

immediate

16-bit immediate, branch displacement or address
displacement

target

26-bit jump target address

5-bit destination register specifier

shamt

5-bit shift amount

funct

6-bit function field
Figure A.1: CPU Instruction Formats

Instruction Notation Conventions
In this appendix, all variable subfields in an instruction format (such as
rs, rt, immediate, etc.) are shown in lowercase names.
For the sake of clarity, an alias is sometimes used for a variable subfield
in the formats of specific instructions. For example,rs = base is used in the
format for load and store instructions. Such an alias is always lower case,
since it refers to a variable subfield.
In the instruction descriptions that follow, the Operation section
describes the operation performed by each instruction using a high-level
language notation.
Special symbols used in the notation are described below.

A–2

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Symbol

Meaning

←

Assignment.

Bit string concatenation.

Replication of bit value x into a y-bit string. Note: x is always a single-bit value.

xy..z

Selection of bits y through z of bit string x. Little-endian bit notation is always
used. If y is less than z, this expression is an empty (zero length) bit string.

+ - *
div

2’s complement or floating-point arithmetic: addition, subtraction, multiplication.

mod

2’s complement integer division.
2’s complement modulo.

Floating-point division.

2’s complement less than comparison.
Bit-wise logical NOR.

nor
xor

Bit-wise logical XOR.

and

Bit-wise logical AND.

Bit-wise logical OR.

GPRlen

The length, in bits (32for MIPS-I), of the CPU General Purpose Registers)

GPR[x]

General-Register x. The content of GPR[0] is always zero. Attempts to alter
the content of GPR[0] have no effect.

FCC[cc]

Floating-Point condition code cc. FCC[0] has the same value as COC[1].

CPR[z,x]

Coprocessor unit z, general register x.

CCR[z,x]

Coprocessor unit z, control register x.

COC[z]
BigEndianMem

Coprocessor unit z condition signal.
Big-endian mode as configured at reset (0 → Little, 1 → Big). Specifies the
endianness of the memory interface (see LoadMemory and StoreMemory),
and the endianness of Kernel and Supervisor mode execution.
Signal to reverse the endianness of load and store instructions. This feature is
available in User mode only, and is effected by setting the RE bit of the Status
register. Thus, ReverseEndian may be computed as (SR25 and User mode).
The endianness for load and store instructions (0 → Little, 1 → Big). In User
mode, this endianness may be reversed by setting SR25. Thus, BigEndianCPU
may be computed as BigEndianMem XOR ReverseEndian.
Only valid for MIPS-II instructions.

ReverseEndian

BigEndianCPU

LLbit
T+i:

Indicates the time steps between operations. Each of the statements within a
time step are defined to be executed in sequential order (as modified by conditional and loop constructs). Operations which are marked T+i: are executed
at instruction cycle i relative to the start of execution of the instruction. Thus,
an instruction which starts at time j executes operations marked T+i: at time
i + j. The interpretation of the order of execution between two instructions or
two operations which execute at the same time should be pessimistic; the order is not defined.
Table A.4: CPU Instruction Operation Notations

Instruction Notation Examples
The following examples illustrate the application of some of the
instruction notation conventions:

A–3

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Example #1:
GPR[rt] ←

immediate || 016

Sixteen zero bits are concatenated with an immediate
value (typically 16 bits), and the 32-bit string (with the lower
16 bits set to zero) is assigned to General-Purpose Register rt.

Example #2:
(immediate15)16 || immediate15...0
Bit 15 (the sign bit) of an immediate value is extended for
16 bit positions, and the result is concatenated with bits 15
through 0 of the immediate value to form a 32-bit sign
extended value.

Load and Store Instructions
In R30xx family processors all loads are implemented with a delay of
one instruction. The instruction immediately following a load may not use
the destination register of the load instruction; at least one instruction
must come between load and use. The hardware does not enforce this
restriction nor detect a failure to follow it. One exception to the load delay
is that Load Word Right and Load Word Left may specify a destination
register that is the same register used as the destination of an immediately
preceding load. This allows a LWL, LWR pair without intervening
instructions. The regular I-type load and store instructions use
base_register+offset addressing. In the load and store descriptions, the
functions listed below are used to summarize the handling of virtual
addresses and physical memory.s
Function

Meaning

AddressTranslation

Determines the physical address given the virtual address.
The function fails and an exception is taken if the required
translation is not present in the TLB (“E” parts only).

LoadMemory

Uses the cache and main memory to find the contents of the
word containing the specified physical address. The loworder two bits of the address and the Access Type field
indicates which of each of the four bytes within the data
word need to be returned. If the cache is enabled for this
access, the entire word is returned and loaded into the
cache.

StoreMemory

Uses the cache, write buffer, and main memory to store the
word or part of word specified as data in the word
containing the specified physical address. The low-order
two bits of the address and the Access Type field indicates
which of each of the four bytes within the data word should
be stored.
Table A.5: Load and Store Common Function

As shown below, the Access Type field indicates the size of the data item
to be loaded or stored. Regardless of access type or byte-numbering order
(endianness), the address specifies the byte which has the smallest byte

A–4

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

address in the addressed field. For a big-endian machine, this is the
leftmost byte and contains the sign for a 2’s complement number; for a
little-endian machine, this is the rightmost byte. Note for R30xx CPUs, the
only sizes valid are word and smaller.s
Access Type Mnemonic

Value

Meaning

DOUBLEWORD

8 bytes (64 bits)

SEPTIBYTE

7 bytes (56 bits)

SEXTIBYTE

6 bytes (48 bits)

QUINTIBYTE

5 bytes (40 bits)

WORD

4 bytes (32 bits)

TRIPLEBYTE

3 bytes (24 bits)

HALFWORD

2 bytes (16 bits)

BYTE

1 byte (8 bits)

Table A.6: Access Type Specifications for Loads/Store

The bytes within the addressed doubleword which are used can be
determined directly from the access type and the three low-order bits of the
address.

Jump and Branch Instructions
All jump and branch instructions have an architectural delay of exactly
one instruction. That is, the instruction immediately following a jump or
branch (that is, occupying the delay slot) is always executed while the
target instruction is being fetched from storage. A delay slot may not itself
be occupied by a jump or branch instruction; however, this error is not
detected and the results of such an operation are undefined.
If an exception or interrupt prevents the completion of a legal instruction
during a ranch delay slot, the hardware sets the EPC register to point at
the jump or branch instruction and an indication that the exception was
caused by the instruction in the delay slot. To continue the instruction
stream and re-execute the instruction that faulted, both the jump or
branch instruction and the instruction in the delay slot are reexecuted.
Because jump and branch instructions may be restarted after
exceptions or interrupts, they must be restartable. Therefore, when a
jump or branch instruction stores a return link value, register 31 (the
register in which the link is stored) may not be used as a source register.
Since instructions must be word-aligned, a Jump Register or Jump
and Link Register instruction must use a register containing a valid word
address. If the two low-order bits are not zero, an address exception will
occur when the jump target instruction is subsequently fetched.

Coprocessor Instructions
Coprocessors are alternate execution units, which have register files
separate from the CPU. The MIPS architecture provides a uniform
abstraction for a few coprocessor units, some of which are implemented in
any particular processor. The coprocessors may have two register spaces,
each space containing up to thirty-two registers.
Coprocessor
computational instructions may alter registers in either space.
• The first space, coprocessor general registers, may be directly loaded
from memory and stored into memory, and their contents may be
transferred between the coprocessor and processor general registers.
• The second space, coprocessor control registers, may only have their
contents transferred directly between the coprocessor and the
processor general registers.

A–5

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

System control for all MIPS processors is implemented as Coprocessor
0 (CP0) – the System Control Coprocessor. It provides the processor
control, memory management, and exception handling functions. The
CP0 instructions are specific to each CPU and are documented with the
CPU-specific information.
If a system includes a Floating Point Unit for floating-point computation,
it is implemented as Coprocessor 1 (CP1). The FPU instructions are
documented in Appendix B.

System Control Coprocessor (CP0) Instructions
There are some special limitations imposed on operations involving CP0
that is incorporated within the CPU. Load and store instructions are not
valid for CP0 registers; the move to/from coprocessor instructions are the
only valid mechanism for writing to and reading from the CP0 registers.

Instruct Set Details
The following pages contain an alphabetical listing of the CPU
instructions for the R30xx family.

A–6

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

ADD
31

ADD

Add Word
26 25

SPECIAL
000000
6

21 20
rs
5

16 15
rt

11 10
rd

0
00000
5

0
ADD
100000
6

Format:
ADD rd, rs, rt

Purpose:
Add two 32-bit values and produce a 32-bit result; arithmetic overflow causes an exception.

Description:
The word value in general register rt is added to the word value in general register rs and the result
word value is placed into general register rd.If the addition results in 32-bit 2’s complement
arithmetic overflow (carries out of bits 30 and 31 differ) then the destination register rd is not
modified and an integer overflow exception occurs.

Operation:
T:

GPR[rd] ←GPR[rs] + GPR[rt]

Exceptions:
Integer overflow exception

A–7

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

ADDI
31

26 25
ADDI
001000
6

ADDI

Add Immediate Word
21 20
rs
5

16 15

0
immediate

rt
5

Format:
ADDI rt, rs, immediate

Description:
The 16-bit immediate is sign-extended and added to the contents of general register rs to form the
result. The result is placed into general register rt.
An overflow exception occurs if carries out of bits 30 and 31 differ (2’s complement overflow). The
destination register rt is not modified when an integer overflow exception occurs.

Operation:
T:

GPR [rt] ← GPR[rs] +(immediate15)16 || immediate15...0

Exceptions:
Integer overflow exception

A–8

MACHINE INSTRUCTIONS REFERENCE

Add Immediate Unsigned
Word

ADDIU
31

26 25
ADDIU
001001
6

APPENDIX A

21 20
rs
5

ADDIU

16 15

0
immediate

rt
5

Format:
ADDIU rt, rs, immediate

Description:
The 16-bit immediate is sign-extended and added to the contents of general register rs to form the
result. The result is placed into general register rt. No integer overflow exception occurs under any
circumstances.
The only difference between this instruction and the ADDI instruction is that ADDIU never causes
an overflow exception.

Operation:
T:

temp ← GPR[rs] + (immediate15)48 || immediate15...0
if 32-bit-overflow (temp) then
GPR[rt] ← (temp31)32 || temp31...0
else
GPR[rt] ← temp

Exceptions:
None

A–9

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

ADDU
31

26 25
SPECIAL
000000
6

ADDU

Add Unsigned Word
21 20
rs
5

16 15

11 10
rd

0
00000
5

0
ADDU
100001
6

Format:
ADDU rd, rs, rt

Description:
Add two 32-bit values and produce a 32-bit result; arithmetic overflow is ignored (does not cause
an exception).
The word value in general register rt is added to the word value in general register rs and the result
word value is placed into general register rd. ADDU differs from ADD only when an arithmetic
overflow occurs. If the addition results in 32-bit 2’s complement overflow (carries out of bits 30 and
31 differ), the result word value is placed into register rd and no exception occurs.

Operation:
T:

GPR[rd] ←GPR[rs] + GPR[rt]

Exceptions:
None

A–10

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

AND
31

AND

And
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

0
AND
100100
6

Format:
AND rd, rs, rt

Description:
The contents of general register rs are combined with the contents of general register rt in a bit-wise
logical AND operation. The result is placed into general register rd.

Operation:
T:

GPR[rd] ← GPR[rs] and GPR[rt]

Exceptions:
None

A–11

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

ANDI
31

ANDI

And Immediate
26 25

ANDI
001100
6

21 20
rs
5

16 15

0
immediate

rt
5

Format:
ANDI rt, rs, immediate

Description:
The 16-bit immediate is zero-extended and combined with the contents of general register rs in a bitwise logical AND operation. The result is placed into general register rt.

Operation:
T:

GPR[rt] ← 016 || (immediate and GPR[rs]15...0)

Exceptions:
None

A–12

MACHINE INSTRUCTIONS REFERENCE

BEQ
31

APPENDIX A

BEQ

Branch On Equal
26 25

BEQ
000100
6

21 20
rs
5

16 15

0
offset

rt
5

Format:
BEQ rs, rt, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. The contents of general register rs and
the contents of general register rt are compared. If the two registers are equal, then the program
branches to the target address, with a delay of one instruction.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs] = GPR[rt])
T+1: if condition then
PC ← PC + target
endif
T:

Exceptions:
None

A–13

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Branch On Greater Than
Or Equal To Zero

BGEZ
31

26 25
REGIMM
000001
6

21 20
rs
5

BGEZ

16 15

BGEZ
00001
5

0
offset
16

Format:
BGEZ rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. If the contents of general register rs
have the sign bit cleared, then the program branches to the target address, with a delay of one
instruction.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 0)
T+1: if condition then
PC ← PC + target
endif
T:

Exceptions:
None

A–14

MACHINE INSTRUCTIONS REFERENCE

Branch On Greater Than
Or Equal To Zero And Link

BGEZAL
31

26 25
REGIMM
000001
6

APPENDIX A

21 20
rs
5

BGEZAL

16 15

0
offset

BGEZAL
10001
5

Format:
BGEZAL rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. Unconditionally, the address of the
instruction after the delay slot is placed in the link register, r31. If the contents of general register
rs have the sign bit cleared, then the program branches to the target address, with a delay of one
instruction.
General register rs may not be general register 31, because such an instruction is not restartable. An
attempt to execute this instruction is not trapped, however.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 0)
GPR[31] ← PC + 8
T+1: if condition then
PC ← PC + target
endif
T:

Exceptions:
None

A–15

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

BGTZ
31

Branch On Greater Than Zero

26 25
BGTZ
000111
6

21 20
rs
5

16 15

0
00000
5

BGTZ
0

offset
16

Format:
BGTZ rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. The contents of general register rs are
compared to zero. If the contents of general register rs have the sign bit cleared and are not equal
to zero, then the program branches to the target address, with a delay of one instruction.

Operation:
T:

target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 0) and (GPR[rs] ≠ 032)

T+1: if condition then
PC ← PC + target
endif

Exceptions:
None

A–16

MACHINE INSTRUCTIONS REFERENCE

Branch on Less Than
Or Equal To Zero

BLEZ
31

26 25
BLEZ
000110
6

APPENDIX A

21 20
rs
5

BLEZ

16 15

0
offset

0
00000
5

Format:
BLEZ rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. The contents of general register rs are
compared to zero. If the contents of general register rs have the sign bit set, or are equal to zero,
then the program branches to the target address, with a delay of one instruction.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 1) or (GPR[rs] = 032)
T+1: if condition then
PC ← PC + target
endif
T:

Exceptions:
None

A–17

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

BLTZ
31

Branch On Less Than Zero
26 25

REGIMM
000001
6

21 20
rs
5

16 15

BLTZ
00000
5

BLTZ
0

offset
16

Format:
BLTZ rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. If the contents of general register rs
have the sign bit set, then the program branches to the target address, with a delay of one
instruction.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 1)
T+1: if condition then
PC ← PC + target
endif
T:

Exceptions:
None

A–18

MACHINE INSTRUCTIONS REFERENCE

BLTZAL
31

Branch On Less
Than Zero And Link

26 25
REGIMM
000001
6

APPENDIX A

21 20
rs
5

BLTZAL

16 15

0
offset

BLTZAL
10000
5

Format:
BLTZAL rs, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. Unconditionally, the address of the
instruction after the delay slot is placed in the link register, r31. If the contents of general register
rs have the sign bit set, then the program branches to the target address, with a delay of one
instruction.
General register rs may not be general register 31, because such an instruction is not restartable. An
attempt to execute this instruction with register 31 specified as rs is not trapped, however.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs]31 = 1)
GPR[31] ← PC + 8
T+1: if condition then
PC ← PC + target
endif

Exceptions:
None

A–19

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

BNE
31

BNE

Branch On Not Equal
26 25

BNE
000101
6

21 20
rs
5

16 15

0
offset

rt
5

Format:
BNE rs, rt, offset

Description:
A branch target address is computed from the sum of the address of the instruction in the delay slot
and the 16-bit offset, shifted left two bits and sign-extended. The contents of general register rs and
the contents of general register rt are compared. If the two registers are not equal, then the program
branches to the target address, with a delay of one instruction.

Operation:
target ← (offset15)14 || offset || 02
condition ← (GPR[rs] ≠ GPR[rt])
T+1: if condition then
PC ← PC + target

Exceptions:
None

A–20

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

BREAK
31

BREAK

Breakpoint
26

SPECIAL
000000
6

65
code
20

0
BREAK
001101
6

Format:
BREAK

Description:
A breakpoint trap occurs, immediately and unconditionally transferring control to the exception
handler.
The code field is available for use as software parameters, but is retrieved by the exception handler
only by loading the contents of the memory word containing the instruction.

Operation:
T:

BreakpointException

Exceptions:
Breakpoint exception

A–21

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Move Control From
Coprocessor

CFCz
31

26 25

COPz
0 1 0 0 x x*
6

21 20

CF
00010
5

16 15

CFCz

11 10

0
0
00000
11

Format:
CFCz rt, rd

Description:
The contents of coprocessor control register rd of coprocessor unit z are loaded into general register
rt.
This instruction is not valid for CP0.

Operation:
T:
data ← CCR[z,rd]
T+1: GPR[rt] ← data

Exceptions:
Coprocessor unusable exception

*Opcode Bit Encoding:
Bit # 31 30
1
CFC1 0

29
0

28
0

27
0

26
1

25
0

22 21

Bit # 31 30
1
CFC2 0

29
0

28
0

27
1

26
0

25
0

22 21

Bit # 31 30
1
CFC3 0

29
0

28
0

27
1

26
1

25
0

22 21

CFCz

0
0

Opcode
Coprocessor Suboperation
Coprocessor Unit Number

A–22

MACHINE INSTRUCTIONS REFERENCE

COPz
31

APPENDIX A

Coprocessor Operation
26

COPz

25 24

CO
COPz
0 1 0 0 x x* 1
1
6

cofun
25

Format:
COPz cofun

Description:
A coprocessor operation is performed. The operation may specify and reference internal
coprocessor registers, and may change the state of the coprocessor condition line, but does not
modify state within the processor or the cache/memory system. Details of coprocessor operations
are contained in other appendices.

Operation:
CoprocessorOperation (z, cofun)

Exceptions:
Coprocessor unusable exception
Coprocessor interrupt or Floating-Point Exception (CP1 only for some processors)

*Opcode Bit Encoding:

COPz

Bit # 31 30 29 28 27 26 25
C0P0 0 1 0 0 0 0 1

Bit # 31 30 29 28 27 26 25
C0P1 0 1 0 0 0 1 1

Bit # 31 30 29 28 27 26 25
C0P2 0 1 0 0 1 0 1

Bit # 31 30 29 28 27 26 25
C0P3 0 1 0 0 1 1 1

Opcode

CO sub-opcode (see end of Appendix A)
Coprocessor Unit Number

A–23

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

CTCz
31

Move Control to Coprocessor

26 25
COPz
0100zz*
6

21 20

CT
00110
5

16 15

11 10
rd

CTCz
0

0
000 0000 0000
11

Format:
CTCz rt, rd

Description:
The contents of general register rt are loaded into control register rd of coprocessor unit z.
This instruction is not valid for CP0.

Operation:
T:

data ← GPR[rt]
T + 1: CCR[z,rd] ← data

Exceptions:
Coprocessor unusable

A–24

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

DIV
31

DIV

Divide Word
26 25

SPECIAL
000000
6

21 20
rs
5

16 15
rt
5

0
00 0000 0000
10

0
DIV
011010
6

Format:
DIV rs, rt

Description:
The contents of general register rs are divided by the contents of general register rt, treating both
operands as 2’s complement values. No overflow exception occurs under any circumstances, and
the result of this operation is undefined when the divisor is zero.
This instruction is typically followed by additional instructions to check for a zero divisor and for
overflow.
When the operation completes, the quotient word of the double result is loaded into special register
LO, and the remainder word of the double result is loaded into special register HI.
If either of the two preceding instructions is MFHI or MFLO, the results of those instructions are
undefined. Correct operation requires separating reads of HI or LO from writes by two or more
instructions.

A–25

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

DIV

Divide Word (Continued)

Operation:
T–2:
T–1:
T:

LO
HI
LO
HI
LO
HI

← undefined
← undefined
← undefined
← undefined
← GPR[rs] div GPR[rt]
← GPR[rs] mod GPR[rt]

Exceptions:
None

A–26

DIV

MACHINE INSTRUCTIONS REFERENCE

DIVU
31

APPENDIX A

DIVU

Divide Unsigned Word
26 25

SPECIAL
000000
6

21 20
rs

16 15
0
000000 0000
10

0
DIVU
011011
6

Format:
DIVU rs, rt

Description:
The contents of general register rs are divided by the contents of general register rt, treating both
operands as unsigned values. No integer overflow exception occurs under any circumstances, and
the result of this operation is undefined when the divisor is zero.
On processors with 64-bit registers the operands must be valid sign-extended 32-bit values. If they
are not, the result is undefined.
This instruction is typically followed by additional instructions to check for a zero divisor.
When the operation completes, the quotient word of the double result is loaded into special register
LO, and the remainder word of the double result is loaded into special register HI.
If either of the two preceding instructions is MFHI or MFLO, the results of those instructions are
undefined. Correct operation requires separating reads of HI or LO from writes by two or more
instructions.

Operation:
T–2:
T–1:
T:

LO
HI
LO
HI
LO
HI

← undefined
← undefined
← undefined
← undefined
← (0 || GPR[rs]) div (0 || GPR[rt])
← (0 || GPR[rs]) mod (0 || GPR[rt])

Exceptions:
None

A–27

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Jump

26 25
J
000010
6

J
0

target
26

Format:
J target

Description:
The 26-bit target address is shifted left two bits and combined with the high-order bits of the
address of the delay slot. The program unconditionally jumps to this calculated address with a
delay of one instruction.

Operation:
T:
temp ← target
T+1: PC ← PC31...28 || temp || 02

Exceptions:
None

A–28

MACHINE INSTRUCTIONS REFERENCE

JAL
31

APPENDIX A

Jump And Link
26 25

JAL
000011
6

JAL
0

target
26

Format:
JAL target

Operation:
temp ← target
GPR[31] ← PC + 8
T+1: PC ← PC 31...28 || temp || 02
T:

Exceptions:
None

A–29

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

JALR
31

26 25
SPECIAL
000000
6

JALR

Jump And Link Register
21 20

16 15

rs
5

11 10

00000
5

0
JALR
001001
6

Format:
JALR rs
JALR rd, rs

Description:
The program unconditionally jumps to the address contained in general register rs, with a delay of
one instruction. The address of the instruction after the delay slot is placed in general register rd.
The default value of rd, if omitted in the assembly language instruction, is 31.
Register specifiers rs and rd may not be equal, because such an instruction does not have the same
effect when re-executed. However, an attempt to execute this instruction is not trapped, and the
result of executing such an instruction is undefined.
A Jump and Link Register instruction that uses a register whose low-order 2 bits are non-zero, or
specifies an address outside of the accessible address space, causes an Address Error Exception
when the jump is executed. The Exception PC points to the location of the Jump instruction causing
the error, and the instruction in the delay slot is not executed. If desired, system software can
emulate the delay instruction and advance the PC to the target of the jump before delivering the
exception to the user program.

Operation:
T:
T+1:

temp ← GPR [rs]
GPR[rd] ← PC + 8
PC ← PC + target

Exceptions:
Address error exception

A–30

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Jump Register

SPECIAL
000000
6

21 20

rs
5

000 0000 0000 0000
15

0
JR
001000
6

Format:
JR rs

Description:
The program unconditionally jumps to the address contained in general register rs, with a delay of
one instruction.
Since instructions must be word-aligned, a Jump Register instruction must specify a target register
(rs) whose two low-order bits are zero. If these low-order bits are not zero, an address exception
will occur when the jump target instruction is subsequently fetched.

Operation:
T:

temp ← GPR[rs]

T+1:

PC ← PC + target

Exceptions:
Address error exception

A–31

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

LB
31

Load Byte
26 25

LB
100000
6

21 20
base
5

16 15

0
offset

rt
5

Format:
LB rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The contents of the byte at the memory location specified by the effective address are signextended and loaded into general register rt.

Operation:
T:

vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE – 1 ... 2 || (pAddr1...0 xor ReverseEndian2)
mem ← LoadMemory (uncached, BYTE, pAddr, vAddr, DATA)
byte ← vAddr1...0 xor BigEndianCPU2

T+1: GPR[rt] ← ( mem7+8* byte)24||mem7+8*byte..8*byte

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception

A–32

MACHINE INSTRUCTIONS REFERENCE

LBU
31

APPENDIX A

LBU

Load Byte Unsigned
26 25

21 20

LBU
100100
6

base
5

16 15

0
offset

rt
5

Format:
LBU rt, offset(base)

Operation:
T:

T+1: GPR[rt] ← 024 || mem7+8* byte...8* byte

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception

A–33

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Load Halfword

26 25
LH
100001
6

21 20
base
5

16 15

0
offset

rt
5

Format:
LH rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The contents of the halfword at the memory location specified by the effective address are
sign-extended and loaded into general register rt.
If the least-significant bit of the effective address is non-zero, an address error exception occurs.

Operation:
vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE – 1...2 || (pAddr1...0 xor (ReverseEndian || 0))
mem ← LoadMemory (uncached, HALFWORD, pAddr, vAddr, DATA)
byte ← vAddr1...0 xor (BigEndianCPU || 0)
T+1: GPR[rt] ← (mem15+8*byte)16 || mem15+8*byte...8* byte
T:

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception

A–34

MACHINE INSTRUCTIONS REFERENCE

LHU
31

APPENDIX A

LHU

Load Halfword Unsigned
26 25

LHU
100101
6

21 20
base
5

16 15

0
offset

rt
5

Format:
LHU rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The contents of the halfword at the memory location specified by the effective address are
zero-extended and loaded into general register rt.
If the least-significant bit of the effective address is non-zero, an address error exception occurs.

Exceptions:
TLB refill exception
Bus Error exception

TLB invalid exception
Address error exception

A–35

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

LUI
31

LUI

Load Upper Immediate
26 25

LUI
001111
6

21 20

0
00000
5

16 15

0
immediate

rt
5

Format:
LUI rt, immediate

Description:
The 16-bit immediate is shifted left 16 bits and concatenated with 16 bits of low-order zeros. The
32-bit result is then placed into general register rt. If rt is a 64-bit register, then the result is sign
extended.

Operation:
GPR[rt] ← immediate || 016

Exceptions:
None

A–36

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

LW
31

Load Word
26 25

LW
100011
6

21 20
base
5

16 15

0
offset

rt
5

Format:
LW rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The contents of the word at the memory location specified by the effective address are
loaded into general register rt. In 64-bit mode, the loaded word is sign-extended.
If either of the two least-significant bits of the effective address is non-zero, an address error
exception occurs.

Operation:
vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
mem ← LoadMemory (uncached, WORD, pAddr, vAddr, DATA)
T+1: GPR[rt] ← mem
T:

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception

A–37

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

LWCz
31

Load Word To Coprocessor

26 25
LWCz
1 1 0 0 x x*
6

21 20
base
5

16 15

LWCz
0

offset

rt
5

Format:
LWCz rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The processor reads a word from the addressed memory location, and makes the data
available to coprocessor unit z.
The manner in which each coprocessor uses the data is defined by the individual coprocessor
specifications.
If either of the two least-significant bits of the effective address is non-zero, an address error
exception occurs.
This instruction is not valid for use with CP0.

A–38

MACHINE INSTRUCTIONS REFERENCE

LWCz

APPENDIX A

Load Word To Coprocessor
(continued)

LWCz

Operation:
vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
byte ← vAddr1...0
mem ← LoadMemory (uncached, DOUBLEWORD, pAddr, vAddr, DATA)
T+1: COPzLW (rt, mem)

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception
Coprocessor unusable exception

Opcode Bit Encoding:

LWCz

Bit # 31
LWC1 1

Bit # 31
LWC2 1

Bit # 31
LWC3 1

Opcode

Coprocessor Unit Number

A–39

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

LWL
31

LWL

Load Word Left
26 25

LWL
100010
6

21 20
base

16 15

0
offset

Format:
LWL rt, offset(base)

Description:
This instruction can be used in combination with the LWR instruction to load a register with four
consecutive bytes from memory, when the bytes cross a word boundary. LWL loads the left
portion of the register with the appropriate part of the high-order word; LWR loads the right
portion of the register with the appropriate part of the low-order word.
The LWL instruction adds its sign-extended 16-bit offset to the contents of general register base to
form a virtual address which can specify an arbitrary byte. It reads bytes only from the word in
memory which contains the specified starting byte. From one to four bytes will be loaded,
depending on the starting byte specified.
Conceptually, it starts at the specified byte in memory and loads that byte into the high-order (leftmost) byte of the register; then it loads bytes from memory into the register until it reaches the loworder byte of the word in memory. The least-significant (right-most) byte(s) of the register will not
be changed.
memory
(big-endian)
address 4
address 0

4
0

5
1

6
2

before

$24

LWL $24,1($0)
after

A–40

MACHINE INSTRUCTIONS REFERENCE

LWL

APPENDIX A

Load Word Left
(continued)

LWL

The contents of general register rt are internally bypassed within the processor so that no NOP is
needed between an immediately preceding load instruction which specifies register rt and a
following LWL (or LWR) instruction which also specifies register rt.
No address exceptions due to alignment are possible.

Operation:
T:

vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE–1...2 || (pAddr1...0 xor ReverseEndian2)
if BigEndianMem = 0 then
pAddr ← pAddrPSIZE–31...2 || 02
endif
byte ← vAddr1...0 xor BigEndianCPU2
mem ← LoadMemory (uncached, byte, pAddr, vAddr, DATA)
GPR[rt] ← mem7+8*byte...0 || GPR[rt]23–8*byte...0

A–41

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Load Word Left
(continued)

LWL

Given a doubleword in a register and a doubleword in memory, the operation of LWL is as follows:

LWL
Register

Memory

BigEndianCPU = 0
vAddr2..0

type

destination

BigEndianCPU = 1
offset

destination

type

LEM BEM

0
1
2
3
4
5
6
7

S
S
S
S
S
S
S
S

P
O
N
M
L
K
J
I

LEM
BEM
Type
Offset
S

F
P
O
N
F
L
K
J

G
G
P
O
G
G
L
K

H
H
H
P
H
H
H
L

0
1
2
3
0
1
2
3

0
0
0
0
4
4
4
4

7
6
5
4
3
2
1
0

LEM BEM

S
S
S
S
S
S
S
S

I J
J K
K L
L F
MN
N O
OP
P F

K
L
G
G
O
P
G
G

L
H
H
H
P
H
H
H

Little-endian memory (BigEndianMem = 0)
BigEndianMem = 1
AccessType sent to memory
pAddr2...0 sent to memory
sign-extend of destination31

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception

A–42

offset

3
2
1
0
3
2
1
0

4
4
4
4
0
0
0
0

0
1
2
3
4
5
6
7

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

LWR
31

LWR

Load Word Right
26 25

LWR
100110
6

21 20
base

16 15

0
offset

Format:
LWR rt, offset(base)

Description:
This instruction can be used in combination with the LWL instruction to load a register with four
consecutive bytes from memory, when the bytes cross a word boundary. LWR loads the right
portion of the destination register rt with the appropriate part of the low-order word; LWL loads
the left portion of the register with the appropriate part of the high-order word.
The LWR instruction adds its sign-extended 16-bit offset to the contents of general register base to
form a virtual address which can specify an arbitrary byte. It loads bytes only from the word in
memory which contains the specified starting byte. From one to four bytes will be merged into the
destination register rt, depending on the starting byte specified.
Conceptually, it starts at the specified byte in memory and loads that byte into the low-order (rightmost) byte of the register; then it loads bytes from memory into the register until it reaches the highorder byte of the word in memory. The most significant (left-most) byte(s) of the register will not
be changed
The contents of general register rt are internally bypassed within the processor so that no NOP is
needed between an immediately preceding load instruction which specifies register rt and a
following LWR (or LWL) instruction which also specifies register rt.
memory
(big-endian)
address 4
address 0

4
0

5
1

6
2

before

LWR $24,4($0)
after
No address exceptions due to alignment are possible.

A–43

$24

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Load Word Right
(continued)

LWR

Operation:
T: vAddr ← ((offset15)16 || offset15..0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE–1..2 || (pAddr1..0 xor ReverseEndian2)
if BigEndianMem = 0 then
pAddr ← pAddrPSIZE–31..2 || 02
endif
byte ← vAddr1..0 xor BigEndianCPU2
mem ← LoadMemory (uncached, byte, pAddr, vAddr, DATA)
GPR[rt] ← mem31..32-8*byte || GPR[rt]31–8*byte..0
Given a word in a register and a word in memory, the operation of LWR is as follows:

LWR
Register

Memory

BigEndianCPU = 0
vAddr2..0

type

destination

BigEndianCPU = 1
offset

destination

type

LEM BEM

0
1
2
3
4
5
6
7

S
X
X
X
S
X
X
X

M
E
E
E
I
E
E
E

LEM
BEM
Type
Offset
S
X

N
M
F
F
J
I
F
F

O
N
M
G
K
J
I
G

P
O
N
M
L
K
J
I

0
1
2
3
0
1
2
3

0
1
2
3
4
5
6
7

4
4
4
4
0
0
0
0

LEM BEM

X
X
X
S
X
X
X
S

E F G
E F I
E I J
I J K
E F G
E F M
E MN
MNO

I
J
K
L
M
N
O
P

Little-endian memory (BigEndianMem = 0)
BigEndianMem = 1
AccessType sent to memory
pAddr2...0 sent to memory
sign-extend of destination31
unchanged or sign-extend of destination31

Exceptions:
TLB refill exception
TLB invalid exception
Bus error exception
Address error exception
A–44

offset

0
1
2
3
0
1
2
3

7
6
5
4
3
2
1
0

0
0
0
0
4
4
4
4

MACHINE INSTRUCTIONS REFERENCE

MFCz
31

APPENDIX A

26 25
COPz
0 1 0 0 x x*
6

MFCz

Move From Coprocessor
21 20

16 15

MF
00000
5

11 10
0
000 0000 0000
11

Format:
MFCz rt, rd

Description:
The contents of coprocessor register rd of coprocessor z are loaded into general register rt.

Operation:
T:

data ← CPR[z,rd]

T+1: GPR[rt] ← data

Exceptions:
Coprocessor unusable exception
Reserved instruction exception (coprocessor 3)

Opcode Bit Encoding:
Bit # 31
MFC0 0

22 21

Bit # 31
MFC1 0

22 21

Bit # 31
MFC2 0

22 21

Bit # 31
MFC3 0

22 21

MFCz

Coprocessor Unit Number

A–45

0
0

Coprocessor Suboperation

Opcode

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

MFHI
31

MFHI

Move From HI
26 25

SPECIAL
000000
6

16 15

0
00 0000 0000
10

11 10
rd
5

6
0
00000
5

0
MFHI
010000
6

Format:
MFHI rd

Description:
The contents of special register HI are loaded into general register rd.
To ensure proper operation in the event of interruptions, the two instructions which follow a MFHI
instruction may not be any of the instructions which modify the HI register: MULT, MULTU, DIV,
DIVU, MTHI.

Operation:
T:

GPR[rd] ← HI

Exceptions:
None

A–46

MACHINE INSTRUCTIONS REFERENCE

MFLO
31

MFLO

Move From Lo

26 25
SPECIAL
000000
6

APPENDIX A

16 15

0
00 0000 0000
10

11 10
rd
5

6
0
00000
5

0
MFLO
010010
6

MFLO rd

Description:
The contents of special register LO are loaded into general register rd.
To ensure proper operation in the event of interruptions, the two instructions which follow a MFLO
instruction may not be any of the instructions which modify the LO register: MULT, MULTU, DIV,
DIVU, MTLO.

Operation:
T:

GPR[rd] ← LO

Exceptions:
None

A–47

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

MTCz
31

26 25

COPz
0 1 0 0 x x*
6

MTCz

Move To Coprocessor
21 20

MT
00100
5

16 15

11 10

0
000 0000 0000
11

Format:
MTCz rt, rd

Description:
The contents of general register rt are loaded into coprocessor register rd of coprocessor z.

Operation:
32

T:
data ← GPR[rt]
T+1: CPR[z,rd] ← data

Exceptions:
Coprocessor unusable exception

*Opcode Bit Encoding:
Bit # 31
C0P0 0

22 21

Bit # 31
C0P1 0

22 21

Bit # 31
C0P2 0

22 21

Bit # 31
C0P3 0

22 21

MTCz

Opcode

Coprocessor Unit Number

A–48

0
0

Coprocessor Suboperation

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

MTHI

Move To HI

21 20
rs

SPECIAL
000000
6

0
000 000000000000
15

0
MTHI
010001
6

Format:
MTHI rs

Description:
The contents of general register rs are loaded into special register HI.
Instructions that write to the HI and LO registers are not interlocked and serialized; a result written
to the HI/LO pair must be read before another result is written. If a MTHI operation is executed
following a MULT, MULTU, DIV, or DIVU instruction, but before any MFLO, MFHI, MTLO, or
MTHI instructions, the contents of the companion special register LO are undefined.

Operation:
T–2: HI ← undefined
T–1: HI ← undefined
T:

HI ← GPR[rs]

Exceptions:
None

A–49

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

MTLO
31

26
SPECIAL
000000
6

MTLO

Move To LO
25

21 20
rs
5

65
0
000000000000000
15

0
MTLO
010011
6

Format:
MTLO rs

Description:
The contents of general register rs are loaded into special register LO.
Instructions that write to the HI and LO registers are not interlocked and serialized; a result written
to the HI/LO pair must be read before another result is written. If a MTLO operation is executed
following a MULT, MULTU, DIV, or DIVU instruction, but before any MFLO, MFHI, MTLO, or
MTHI instructions, the contents of the companion special register HI are undefined.

Operation:
T–2: LO ← undefined
T–1: LO ← undefined
T:

LO ← GPR[rs]

Exceptions:
None

A–50

MACHINE INSTRUCTIONS REFERENCE

MULT
31

MULT

Multiply Word

26 25

SPECIAL
000000
6

APPENDIX A

21 20
rs
5

16 15
0
00 0000 0000
10

rt
5

MULT
011000
6

Format:
MULT rs, rt

Description:
The contents of general registers rs and rt are multiplied, treating both operands as 32-bit 2’s
complement values. No integer overflow exception occurs under any circumstances.
When the operation completes, the low-order word of the double result is loaded into special
register LO, and the high-order word of the double result is loaded into special register HI.
If either of the two preceding instructions is MFHI or MFLO, the results of these instructions are
undefined. Correct operation requires separating reads of HI or LO from writes by a minimum of
two other instructions.

Operation:
T–2: LO
HI
T–1: LO
HI
T:
t
LO
HI

← undefined
← undefined
← undefined
← undefined
← GPR[rs] * GPR[rt]
← t31...0
← t63...32

Exceptions:
None

A–51

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

MULTU
31

Multiply Unsigned Word

26 25
SPECIAL
000000
6

21 20
rs
5

16 15

6
0
00 0000 0000
10

rt
5

MULTU
5

MULTU
011001
6

Format:
MULTU rs, rt

Description:
The contents of general register rs and the contents of general register rt are multiplied, treating
both operands as unsigned values. No overflow exception occurs under any circumstances.
When the operation completes, the low-order word of the double result is loaded into special
register LO, and the high-order word of the double result is loaded into special register HI.
If either of the two preceding instructions is MFHI or MFLO, the results of these instructions are
undefined. Correct operation requires separating reads of HI or LO from writes by a minimum of
two instructions.

Operation:
T–2: LO
HI
T–1: LO
HI
T:
t
LO
HI

← undefined
← undefined
← undefined
← undefined
← (0 || GPR[rs]) * (0 || GPR[rt])
← t31...0
← t63...32

Exceptions:
None

A–52

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

NOR
31

NOR

Nor
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

NOR
100111
6

Format:
NOR rd, rs, rt

Description:
The contents of general register rs are combined with the contents of general register rt in a bit-wise
logical NOR operation. The result is placed into general register rd.

Operation:
T:

GPR[rd] ← GPR[rs] nor GPR[rt]

Exceptions:
None

A–53

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

OR
31

Or
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

OR
100101
6

Format:
OR rd, rs, rt

Description:
The contents of general register rs are combined with the contents of general register rt in
a bit-wise logical OR operation. The result is placed into general register rd.

Operation:
T:

GPR[rd] ← GPR[rs] or GPR[rt]

Exceptions:
None

A–54

MACHINE INSTRUCTIONS REFERENCE

ORI
31

APPENDIX A

ORI

Or Immediate
26 25

ORI
001101
6

21 20
rs
5

16 15

0
immediate

rt
5

Format:
ORI rt, rs, immediate

Description:
The 16-bit immediate is zero-extended and combined with the contents of general register rs in a bitwise logical OR operation. The result is placed into general register rt.

Operation:
T:

GPR[rt] ← GPR[rs]31...16 || (immediate or GPR[rs]15...0)

Exceptions:
None

A–55

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SB
31

26 25
SB
101000
6

Store Byte

16 15

21 20
base
5

offset

rt
5

Format:
SB rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The least-significant byte of register rt is stored at the effective address.

Operation:
T:

vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE-1...2 || (pAddr1...0 xor ReverseEndian2)
byte ← vAddr1...0 xor BigEndianCPU2
data ← GPR[rt]31–8*byte...0 || 08*byte
StoreMemory (uncached, BYTE, data, pAddr, vAddr, DATA)

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

A–56

MACHINE INSTRUCTIONS REFERENCE

SH
31

APPENDIX A

Store Halfword
26 25

SH
101001
6

21 20
base
5

16 15

0
offset

rt
5

Format:
SH rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form an
unsigned effective address. The least-significant halfword of register rt is stored at the effective
address. If the least-significant bit of the effective address is non-zero, an address error exception
occurs.

Operation:
T:

vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE-1...2 || (pAddr1...0 xor (ReverseEndian || 0))
byte ← vAddr1...0 xor (BigEndianCPU || 0)
data ← GPR[rt]31–8*byte...0 || 08*byte
StoreMemory (uncached, HALFWORD, data, pAddr, vAddr, DATA)

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

A–57

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SLL

Shift Word Left Logical

26 25
SPECIAL
000000
6

21 20

0
00000
5

16 15
rt
5

11 10

SLL
000000
6

Format:
SLL rd, rt, sa

Description:
The contents of the low-order word of general register rt are shifted left by sa bits, inserting zeros
into the low-order bits. The word result is placed in register rd.
If rd is a 64-bit register, the result word is sign-extended when it is placed in the register. The result
word is sign extended even if the shift amount is zero; this instructions with a zero shift amount
can be used to truncate a 64-bit value and sign extend the lower word. Unlike nearly all other word
operations the input operand does not have to be a properly sign-extended word value to produce
a valid result.

Operation:
T:

GPR[rd] ← GPR[rt]31– sa...0 || 0sa

Exceptions:
None

A–58

MACHINE INSTRUCTIONS REFERENCE

SLLV
31

APPENDIX A

SLLV

Shift Word Left Logical Variable
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

0
SLLV
000100
6

Format:
SLLV rd, rt, rs

Description:
The contents of the low-order word of general register rt are shifted left the number of bits specified
by the low-order five bits contained in general register rs, inserting zeros into the low-order bits.
The word-value result is placed in register rd.

Operation:
T:

s ← GP[rs]4...0
GPR[rd]← GPR[rt](31–s)...0 || 0s

Exceptions:
None

A–59

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SLT

Set On Less Than

26 25
SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

0
SLT
101010
6

Format:
SLT rd, rs, rt

Description:
The contents of general register rt are subtracted from the contents of general register rs.
Considering both quantities as signed integers, if the contents of general register rs are less than the
contents of general register rt, the result is set to one; otherwise the result is set to zero.
The result is placed into general register rd.
No integer overflow exception occurs under any circumstances. The comparison is valid even if
the subtraction used during the comparison overflows.

Operation:
T:

if GPR[rs] < GPR[rt] then
GPR[rd] ← 031 || 1
else
GPR[rd] ← 032
endif

Exceptions:
None

A–60

MACHINE INSTRUCTIONS REFERENCE

SLTI
31

APPENDIX A

Set On Less Than Immediate
26 25

SLTI
001010
6

21 20
rs
5

16 15

SLTI
0

immediate

rt
5

Format:
SLTI rt, rs, immediate

Description:
The 16-bit immediate is sign-extended and subtracted from the contents of general register rs.
Considering both quantities as signed integers, if rs is less than the sign-extended immediate, the
result is set to one; otherwise the result is set to zero.
The result is placed into general register rt.
No integer overflow exception occurs under any circumstances. The comparison is valid even if
the subtraction used during the comparison overflows.

Operation:
T:

if GPR[rs] < (immediate15)16 || immediate15...0 then
GPR[rd] ← 031 || 1
else
GPR[rd] ← 032
endif

Exceptions:
None

A–61

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Set On Less Than
Immediate Unsigned

SLTIU
31

26 25
SLTIU
001011
6

21 20
rs
5

SLTIU

16 15

0
immediate

rt
5

Format:
SLTIU rt, rs, immediate

Description:
The 16-bit immediate is sign-extended and subtracted from the contents of general register rs.
Considering both quantities as unsigned integers, if rs is less than the sign-extended immediate, the
result is set to one; otherwise the result is set to zero.
The result is placed into general register rt.
No integer overflow exception occurs under any circumstances. The comparison is valid even if
the subtraction used during the comparison overflows.

Operation:
T:

if (0 || GPR[rs]) < (immediate15)16 || immediate15...0 then
GPR[rd] ← 031 || 1
else
GPR[rd] ← 032
endif
endif

Exceptions:
None

A–62

MACHINE INSTRUCTIONS REFERENCE

SLTU
31

APPENDIX A

SLTU

Set On Less Than Unsigned
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

SLTU
101011
6

Format:
SLTU rd, rs, rt

Description:
The contents of general register rt are subtracted from the contents of general register rs.
Considering both quantities as unsigned integers, if the contents of general register rs are less than
the contents of general register rt, the result is set to one; otherwise the result is set to zero.
The result is placed into general register rd.
No integer overflow exception occurs under any circumstances. The comparison is valid even if
the subtraction used during the comparison overflows.

Operation:
T:

if (0 || GPR[rs]) < 0 || GPR[rt] then
GPR[rd] ← 031 || 1
else
GPR[rd] ← 032
endif

Exceptions:
None

A–63

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SRA
31

SRA

Shift Word Right Arithmetic
26 25

SPECIAL
000000
6

21 20

0
00000
5

16 15
rt
5

11 10

SRA
000011
6

Format:
SRA rd, rt, sa

Description:
The contents of the low-order word of general register rt are shifted right by sa bits, sign-extending
the high-order bits.
The result is placed in register rd.

Operation:
T:

GPR[rd] ← (GPR[rt]31)sa || GPR[rt] 31...sa

Exceptions:
None

A–64

MACHINE INSTRUCTIONS REFERENCE

Shift Word Right
Arithmetic Variable

SRAV
31

26 25
SPECIAL
000000
6

APPENDIX A

21 20
rs
5

16 15

11 10
rd

SRAV
6

0
00000
5

SRAV
000111
6

Format:
SRAV rd, rt, rs

Description:
The contents of general register rt are shifted right by the number of bits specified by the low-order
five bits of general register rs, sign-extending the high-order bits.
The result is placed in register rd.

Operation:
T:

s ← GPR[rs]4...0
GPR[rd] ← (GPR[rt]31)s || GPR[rt]31...s

Exceptions:
None

A–65

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SRL

Shift Word Right Logical

26 25
SPECIAL
000000
6

21 20

0
00000
5

16 15
rt
5

11 10

SRL
000010
6

Format:
SRL rd, rt, sa

Description:
The low-order word of general register rt is shifted right by sa bits, inserting zeros into the highorder bits.
The result is placed in register rd.
Operation:
T:

GPR[rd] ← 0 sa || GPR[rt]31...sa

Exceptions:
None

A–66

MACHINE INSTRUCTIONS REFERENCE

SRLV
31

Shift Word Right Logical Variable

26 25
SPECIAL
000000
6

APPENDIX A

21 20
rs
5

16 15

11 10
rd

0
00000
5

SRLV
5

SRLV
000110
6

Format:
SRLV rd, rt, rs

Description:
The low-order word of general register rt are shifted right by the number of bits specified by the
low-order five bits of general register rs, inserting zeros into the high-order bits.
The result is placed in register rd.

Operation:
T:

s ← GPR[rs]4...0
GPR[rd] ← 0s || GPR[rt]31...s

Exceptions:
None

A–67

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SUB

Subtract Word

26 25
SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

SUB
100010
6

Format:
SUB rd, rs, rt

Description:
The contents of general register rt are subtracted from the contents of general register rs to form a
result. The result is placed into general register rd.
The only difference between this instruction and the SUBU instruction is that SUBU never traps on
overflow.
An integer overflow exception takes place if the carries out of bits 30 and 31 differ (2’s complement
overflow). The destination register rd is not modified when an integer overflow exception occurs.

Operation:
T:

GPR[rd] ← GPR[rs] – GPR[rt]

Exceptions:
Integer overflow exception

A–68

MACHINE INSTRUCTIONS REFERENCE

SUBU
31

SUBU

Subtract Unsigned Word

26 25
SPECIAL
000000
6

APPENDIX A

21 20
rs
5

16 15

11 10
rd

0
00000
5

0
SUBU
100011
6

Format:
SUBU rd, rs, rt

Description:
The contents of general register rt are subtracted from the contents of general register rs to form a
result.
The result is placed into general register rd.
The only difference between this instruction and the SUB instruction is that SUBU never traps on
overflow. No integer overflow exception occurs under any circumstances.

Operation:
T:

GPR[rd] ← GPR[rs] – GPR[rt]

Exceptions:
None

A–69

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

Store Word

26 25
SW
101011
6

21 20
base
5

16 15

0
offset

rt
5

Format:
SW rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. The contents of general register rt are stored at the memory location specified by the
effective address.
If either of the two least-significant bits of the effective address are non-zero, an address error
exception occurs.

Operation:
T:

vAddr ← ((offset15)16 || offset15...0) +

(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
data ← GPR[rt]
StoreMemory (uncached, WORD, data, pAddr, vAddr, DATA)

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

A–70

MACHINE INSTRUCTIONS REFERENCE

SWCz
31

APPENDIX A

Store Word From Coprocessor

26 25
SWCz
1 1 1 0 x x*
6

21 20

SWCz

16 15

base

offset

Format:
SWCz rt, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form a virtual
address. Coprocessor unit z sources a word, which the processor writes to the addressed memory
location.
The data to be stored is defined by individual coprocessor specifications.
This instruction is not valid for use with CP0.
If either of the two least-significant bits of the effective address is non-zero, an address error
exception occurs.
Execution of the instruction referencing coprocessor 3 causes a reserved instruction exception, not
a coprocessor unusable exception.

Operation:
vAddr ← ((offset15)16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
byte ← vAddr1...0
data ← COPzSW (byte, rt)
StoreMemory (uncached, WORD, data, pAddr, vAddr, DATA)

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception
Coprocessor unusable exception
Opcode Bit Encoding:

SWCz

Bit # 31
SWC1 1

Bit # 31
SWC2 1

Bit # 31
SWC3 1

SW opcode

Coprocessor Unit Number

A–71

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SWL
31

SWL

Store Word Left
26 25

SWL
101010
6

21 20
base

16 15

0
offset

Format:
SWL rt, offset(base)

Description:
This instruction can be used with the SWR instruction to store the contents of a register into four
consecutive bytes of memory, when the bytes cross a word boundary. SWL stores the left portion
of the register into the appropriate part of the high-order word of memory; SWR stores the right
portion of the register into the appropriate part of the low-order word.
The SWL instruction adds its sign-extended 16-bit offset to the contents of general register base to
form a virtual address which may specify an arbitrary byte. It alters only the word in memory
which contains that byte. From one to four bytes will be stored, depending on the starting byte
specified.
Conceptually, it starts at the most-significant byte of the register and copies it to the specified byte
in memory; then it copies bytes from register to memory until it reaches the low-order byte of the
word in memory.
No address exceptions due to alignment are possible.
memory
(big-endian)
address 4
address 0

4
0

5
1

6
2

before

$24

SWL $24,1($0)
address 4
address 0

4
0

5
A

6
B

7
C

after

Operation:
T:

vAddr ← ((offset15)16 || offset 15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE –1...2 || (pAddr1...0 xor ReverseEndian2)
If BigEndianMem = 0 then
pAddr ← pAddrPSIZE – 1...2 || 02
endif
byte ← vAddr1...0 xor BigEndianCPU2
data ← 024–8*byte || GPR[rt]31...24–8*byte
Storememory (uncached, byte, data, pAddr, vAddr, DATA)

A–72

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Store Word Left
(Continued)

SWL

Given a doubleword in a register and a doubleword in memory, the operation of SWL is as follows:

SWL
Register

Memory

BigEndianCPU = 0

BigEndianCPU = 1
offset

vAddr2..0

0
1
2
3
4
5
6
7

destination

I
I
I
I
I
I
I
E

J
J
J
J
J
J
E
F

K
K
K
K
K
E
F
G

L M
L M
L M
L E
EM
F M
GM
HM

N
N
E
F
N
N
N
N

type

O
E
F
G
O
O
O
O

LEM
BEM
Type
Offset

E
F
G
H
P
P
P
P

0
1
2
3
0
1
2
3

offset

LEM BEM

0
0
0
0
4
4
4
4

7
6
5
4
3
2
1
0

type

destination

E
I
I
I
I
I
I
I

F
E
J
J
J
J
J
J

G
F
E
K
K
K
K
K

H
G
F
E
L
L
L
L

MN
MN
MN
MN
E F
ME
MN
MN

O
O
O
O
G
F
E
O

P
P
P
P
H
G
F
E

Little-endian memory (BigEndianMem = 0)
BigEndianMem = 1
AccessType sent to memory
pAddr2...0 sent to memory

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

A–73

3
2
1
0
3
2
1
0

LEM BEM

4
4
4
4
0
0
0
0

0
1
2
3
4
5
6
7

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SWR
31

SWR

Store Word Right
26 25

SWR
101110
6

21 20
base

16 15

0
offset

Format:
SWR rt, offset(base)

Description:
This instruction can be used with the SWL instruction to store the contents of a register into four
consecutive bytes of memory, when the bytes cross a boundary between two words. SWR stores
the right portion of the register into the appropriate part of the low-order word; SWL stores the left
portion of the register into the appropriate part of the low-order word of memory.
The SWR instruction adds its sign-extended 16-bit offset to the contents of general register base to
form a virtual address which may specify an arbitrary byte. It alters only the word in memory
which contains that byte. From one to four bytes will be stored, depending on the starting byte
specified.
Conceptually, it starts at the least-significant (rightmost) byte of the register and copies it to the
specified byte in memory; then copies bytes from register to memory until it reaches the high-order
byte of the word in memory.
No address exceptions due to alignment are possible.
memory
(big-endian)
address 4
address 0

4
0

5
1

6
2

before

$24

SWR $24,1($0)
address 4
address 0

D
0

5
1

6
2

7
3

after

Operation:
T:

vAddr ← ((offset15)16 || offset 15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
pAddr ← pAddrPSIZE – 1...2 || (pAddr1...0 xor ReverseEndian2)
BigEndianMem = 0 then
pAddr ← pAddrPSIZE – 31...2 || 02
endif
byte ← vAddr1...0 xor BigEndianCPU2
data ← GPR[rt]31–8*byte || 08*byte
Storememory (uncached, WORD-byte, data, pAddr, vAddr, DATA)

A–74

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Store Word Right
(Continued)

SWR

Given a doubleword in a register and a doubleword in memory, the operation of SWR is as follows:

SWR
Register

Memory

BigEndianCPU = 1

BigEndianCPU = 0

offset

offset
vAddr2..0

0
1
2
3
4
5
6
7

destination

I
I
I
I
E
F
G
H

J
J
J
J
F
G
H
J

K
K
K
K
G
H
K
K

L
L
L
L
H
L
L
L

E
F
G
H
M
M
M
M

F
G
H
N
N
N
N
N

LEM
BEM
Type
Offset

type

G
H
O
O
O
O
O
O

H
P
P
P
P
P
P
P

3
2
1
0
3
2
1
0

LEM BEM

0
1
2
3
4
5
6
7

4
4
4
4
0
0
0
0

type

destination

H
G
F
E
I
I
I
I

J
H
G
F
J
J
J
J

K
K
H
G
K
K
K
K

L
L
L
H
L
L
L
L

MN
MN
MN
MN
H N
GH
F G
E F

O
O
O
O
O
O
H
G

P
P
P
P
P
P
P
H

Little-endian memory (BigEndianMem = 0)
BigEndianMem = 1
AccessType sent to memory
pAddr2...0 sent to memory

Exceptions:
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

A–75

0
1
2
3
0
1
2
3

LEM BEM

7
6
5
4
3
2
1
0

0
0
0
0
4
4
4
4

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

SYSCALL
31

System Call

26 25
SPECIAL
000000
6

SYSCALL
6

Code
20

0
SYSCALL
0 0 1 1 00
6

Format:
SYSCALL

Description:
A system call exception occurs, immediately and unconditionally transferring control to the
exception handler.
The code field is available for use as software parameters, but is retrieved by the exception handler
only by loading the contents of the memory word containing the instruction.

Operation:
T:

SystemCallException

Exceptions:
System Call exception

A–76

MACHINE INSTRUCTIONS REFERENCE

XOR
31

APPENDIX A

XOR

Exclusive Or
26 25

SPECIAL
000000
6

21 20
rs
5

16 15

11 10
rd

0
00000
5

0
XOR
100110
6

Format:
XOR rd, rs, rt

Description:
The contents of general register rs are combined with the contents of general register rt in a bit-wise
logical exclusive OR operation.
The result is placed into general register rd.

Operation:
T:

GPR[rd] ← GPR[rs] xor GPR[rt]

Exceptions:
None

A–77

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE

XORI
31

XORI

Exclusive OR Immediate
26 25

XORI
001110
6

21 20
rs
5

16 15

0
immediate

rt
5

Format:
XORI rt, rs, immediate

Description:
The 16-bit immediate is zero-extended and combined with the contents of general register rs in a bitwise logical exclusive OR operation.
The result is placed into general register rt.

Operation:
T:

GPR[rt] ← GPR[rs] xor (016 || immediate)

Exceptions:
None

A–78

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Instruction Summary
Instr Fields
op
31-26

func
25-21

20-16

15-11

10-6

Asm

Description

5-0

‘‘Register’’ format instructions
0

rs1

rs2

sll

rs1

rs2

srl

rs1

rs2

sra

rs1

rs2

sllv

rs1

rs2

srlv

rs1

rs2

srav

Jump to address in register
(no offset)

jalr

Call function at address
from register. Can store the
return address in any
register, even though
anything but ra is nor
normally useful.

Shift left (to smaller bits) by
a constant. sll is ‘‘logical’’,
brings in zeroes from the
top. srl is ‘‘arithmetic’’,
duplicating bit 31, so
implementing a correct
signed division by 2^n
Shift right (to higher bits) by
a constant, bringing in
zeroes to the low bits
shift (left logical, left
arithmetic, and right) by the
amount stored in another
register

Cause ‘‘syscall’’ trap,
conventionally used for
syscall
system call from user-mode
to operating system

break

mfhi

mthi

mflo

mtlo

0
0

rs1

rs2

mult

rs1

rs2

multu

rs1

rs2

div

rs1

rs2

divu

Cause ‘‘Bp’’ trap,
conventionally used for
debugger breakpoint.
Access to multiply/divide
unit registers ‘‘lo’’ and ‘‘hi’’.
mflo/mfhi move data from
‘‘lo’’/‘‘hi’’ into an integer
register; mtlo/mthi go the
other way.
Multiply two integer
registers, put result into
‘‘hi’’/‘‘lo’’ when done. mult
sign-extends the result, but
multu does not.
(signed and unsigned
versions of) divide two
integer registers and put the
result (quotient) and
remainder in ‘‘lo’’ and ‘‘hi’’
respectively.

A–79

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE
Instr Fields

Asm

func

31-26

25-21

20-16

15-11

10-6

5-0

rs1

rs2

add

rs1

rs2

addu

rs1

rs2

sub

rs1

rs2

subu

rs1

rs2

and

rs1

rs2

rs1

rs2

xor

rs1

rs2

nor

rs1

rs2

slt

rs1

rs2

sltu

Description

3-operand add. The only
difference between them is
that add causes a trap if a
result overflows into bit 31,
but addu never traps.
3-operand subtract. sub
can trap on overflow, subu
won’t.

3-operand bitwise logical
operations

Set destination to 1 if rs1
= 0

offset

bltzal

offset

bgezal

if rs1 <0 or rs1 >= 0
respectively, branch to
function. Set ra to the
notional ‘‘return’’ address,
even if the branch is not
taken.

Long ‘‘in-region’’ jump and call

word address

unconditional jump (26-bit
word address). Note that the
top 4 bits of the program
address of the target
location comes from the
instruction’s own location.
You have to use a jr (jump
register) instruction to
reach outside of your
256Mbyte region.

word address

jal

function call (26-bit word
address)

Compare and branch instructions
4

rs1

rs2

offset

beq

branch if rs1 == rs2

rs1

rs2

offset

bne

branch if rs1 != rs2

rs1

offset

blez

rs1

offset

bgtz

branch if rs1 <= 0 or rs1 > 0
respectively. Encoded as if
they had two source
operands with rs2 selecting
zero.

‘‘Immediate’’ arithmetic and logical instructions

A–80

MACHINE INSTRUCTIONS REFERENCE

APPENDIX A

Instr Fields
op

func

31-26

25-21

20-16

signed constant

addi

signed constant

addiu

signed constant

subi

signed constant

subiu

unsigned constant

andi

unsigned constant

ori

unsigned constant

xori

15-11

10-6

Asm

5-0

unsigned
constant

Description

lui

Arithmetic operations with
one source register, a 16-bit
signed constant, and a
separate destination. As for
3-operand arithmetic, the
unsigned forms addiu and
subiu have identical results
but never cause an overflow
trap.
Note that ‘‘load immediate
signed’’ can be synthesised
as an addi with register zero.
Logical operation with 16-bit
constant, zero-extended for
these instructions. ‘‘load
immediate unsigned’’ is
synthesised with an ori with
register zero.
Load UPPER immediate (not
unsigned). The 16-bit
constant is loaded into the
high-order 16-bits of the
register, and the low-order
bits cleared to zero.
A ‘‘load immediate’’ with a
value which won’t fit into 16
bits is synthesised by a lui
followed by an ori, which fills
in the low 16 bits.

CPU control instructions (‘‘Co-processor zero’’)
24

mfc0

mtc0

offset

bc0f

offset

bc0t

tlbr

tlbwi

tlbwr

tlbp

These instructions are
described in the chapter
‘‘System Software
Considerations’’

rfe

Floating point instructions (except load/store)
25

Functions of these are detailed in the chapter ‘‘FLOATING POINT COPROCESSOR’’ above. Their encodings are detailed in the appendix ‘‘FP
Instruction encoding’’, below.

Load and store instructions
32

offset

Load byte and sign-extend

offset

Load halfword and signextend

A–81

APPENDIX A

MACHINE INSTRUCTIONS REFERENCE
Instr Fields

func
15-11

10-6

Asm

Description

31-26

25-21

20-16

5-0

offset

lwl

‘‘Load word left’’, see section
on ‘‘Unaligned loads and
stores’’

offset

load word

offset

lbu

Load byte and zero-extend

offset

lhu

Load halfword and zeroextend

offset

lwr

‘‘Load word right’’, see
section on ‘‘Unaligned loads
and stores’’

rs1

rs2

offset

store byte

rs1

rs2

offset

store halfword

rs1

rs2

offset

swl

‘‘Store word left’’, see section
on ‘‘Unaligned loads and
stores’’

rs1

rs2

offset

store word

rs1

rs2

offset

swr

‘‘Store word right’’, see
section on ‘‘Unaligned loads
and stores’’

offset

swc1

Store word from FP register
fs

offset

lwc1

Load word into FP register fd

A–82

FPA INSTRUCTION
REFERENCE

APPENDIX B

Integrated Device Technology, Inc.

FPU Instruction Set Details
This section documents the instructions for the floating-point unit (FPU)
in MIPS processors. It contains some descriptive material at the beginning,
a detailed description for each instruction in alphabetic order, and an
instruction opcode encoding table at the end of the section.
The descriptive material describes the FPU instruction categories, the
instruction encoding formats, the valid operands for FPU computational
instructions, compare and condition values, FPU use of the coprocessor
registers, and a description of the notation used for the detailed instruction
description.
This section does not describe the operation of floating-point arithmetic,
the exception conditions within FP arithmetic, the exception mechanism of
the FPU, or the handling of these FP exceptions.
FPU Instructions
The floating-point unit (FPU) is implemented as Coprocessor unit 1
(CP1) within the MIPS architecture. A floating-point instruction needs
access to coprocessor 1 to execute; if CP1 is not enabled, an FP instruction
will cause a Coprocessor Unusable exception. The FPU has a load/store
architecture. All computations are done on data held in registers, and data
is transferred between registers and the rest of the system with dedicated
load, store, and move instructions.
The FPU instructions fall into the following categories:
• Data Transfer
• Conversion
• Arithmetic
• Register-to-Register Data Movement
• Branch
Floating-Point Data Transfer
All movement of data between the floating-point coprocessor general
registers and the rest of the system is accomplished by:
• Load memory to CP1 general register
• Store CP1 general register to memory
• Move CPU register to CP1 general register
• Move CP1 general register to CPU register
These operations are unformatted; no format conversions are performed
and, therefore, no floating-point exceptions can occur.
The coprocessor also contains floating-point control registers. The only
data movement operations supported for them are:
• Copy CPU register to FPU control register
• Copy FPU control register to CPU register
Floating-Point Conversions
The floating-point unit has instructions to convert among the operand
types as well as operations which combine conversion with rounding using
a particular rounding mode. The conversion operations are:
• fixed-point to floating-point
• floating to fixed
• floating to floating (of another size)

B–1

APPENDIX B

FPA INSTRUCTION REFERENCE

Floating-Point Arithmetic
The floating-point arithmetic instructions are:
• add
• subtract
• multiply
• divide
• absolute value
• negate
• compare
All operations satisfy the requirements of IEEE Standard 754
requirements for accuracy; a result which is identical to an infiniteprecision result rounded to the specified format, using the current
rounding mode.
Floating-Point Register-to-Register Move
There are FPU instructions to move formatted operands among
registers:
• FP move
• FP register move-conditional on FP condition code
• FP register move-conditional on CPU register value
• CPU register move-conditional on FP condition code
Floating-Point Branch
The FP compare instruction produces a condition code. The FPU has
instructions to conditionally branch on the FP condition.

FP Computational Instructions and Valid Operands
The Floating-point unit computational instructions operate on
structured data and the operands can have one of several operand formats.
The format of the operands, and perhaps the result, for an instruction is
specified by either the 5-bit fmt field or 3-bit fmt3 field in the instruction
encoding; decoding for these fields is shown in Table B.7.
fmt

fmt3

Mnemonic

Size

Format

0-15

single

Binary floating-point

double

Binary floating-point

Reserved
for E

extended

Reserved for Extended
binary floating-point

Reserved
for Q

quad

Reserved for Quad binary
floating-point.

single

32-bit binary fixed-point

Reserved
for L

longword

64-bit binary fixed-point

22–31

6-7

Reserved

Reserved
Table B.7. Format Field Decoding

A particular operation is valid only for operands of certain formats.

B–2

FPA INSTRUCTION REFERENCE

APPENDIX B

FP Compare and Condition values
The coprocessor branch on condition true/false instructions can be
used to logically negate any predicate. Thus, the 32 possible conditions
require only 16 distinct comparisons, as shown below.
Condition
Branch Mnemonic
True

False

Compare Relations
Code

Greater
Than

Less
Than

Equal

Unordere
d

Invalid
Operation
Exception If
Unordered

NEQ

UEQ

OGL

OLT

UGE

ULT

OGE

OLE

UGT

ULE

OGT

Yes

NGLE

GLE

Yes

SEQ

SNE

Yes

NGL

Yes

NLT

Yes

NGE

Yes

NLE

Yes

NGT

Yes

Table B.8. Logical Negation of Predicates by Condition True/False

FPU Register Specifiers
The data transfer instructions and the computational instructions view
the Coprocessor 1 general registers and the data in them in different ways.
This section describes the general registers in the coprocessor, how data
transfer instructions transfer operand data, how the FPU uses registers to
hold the different types and sizes of operands, and how the FPU
computational instructions specify operands.
The CP1 register is the fundamental addressable unit in the
coprocessor. All instructions that refer to the CP1 registers use the 32 CP1
general register numbers as register specifiers in the instruction encoding.
Some register numbers are not valid specifiers for some instructions; this
is discussed below.
The data transfer operations consist of memory load/store and move to/
from CPU register instructions. These instructions, with one exception
noted below, transfer unformatted data to/from a single CP1 register.
Most of the transfer instructions are the generic load/store/move
instructions used with all coprocessors and they do not have any special
operation for CP1.
The FPU operates on operands of different lengths. Some operands
exceed the CP1 general register size, so the FPU computational
instructions use the CP1 general registers in a structured way. If the FPU
operand exceeds the CP1 register size, a set of adjacent CP1 general
registers are used to hold the data for the operand. All multi-register

B–3

APPENDIX B

FPA INSTRUCTION REFERENCE

operands must be in “aligned” sets of registers; an operand that requires
two registers must be in an even register and the next-higher odd register.
When the FPU operand is in a set of CP1 registers, the lowest-numbered
register in the set is used as the FPU operand specifier or FPU register
specifier in the instruction encoding.
The sets of registers are structured in a big-endian order for both big
and little endian processors. The least-significant portion of the operand
is put into the lowest-numbered CP1 register in the set, and the mostsignificant is put into the highest-numbered register.

32-bit CP1 registers
All 32-bit processors have 32-bit CP1 registers. The 64-bit processors
have a 32-bit-CP1-register emulation mode in which CP1 appears to
possess 32-bit registers. The primary FP data type is double floating-point,
which requires 64 bits of register space. For simplicity in implementation,
the minimum FPU operand size is a doubleword in the CP1 register file.
Operands of type word size (W and S), are placed into the low word of the
doubleword.
The MIPS I version of the architecture has only word load/store/move
instructions. To transfer anything but a W or S operand takes multiple
instructions that each reference one of the 32-bit CP1 general registers.
The load/store/move instructions use all the CP1 register numbers as
specifiers because they do not refer to formatted FP operands.
32-bit CP1 register use and significance by operand
type
W,S

L, D

Valid specifiers

unused /
undefined

data

most

least

Table B.9. Valid FP Operand Specifiers with 32-bit Coprocessor 1 Registers.

B–4

FPA INSTRUCTION REFERENCE

APPENDIX B

FPU Register Access for 32-bit CP1 Registers
value <-- ValueFPR(fpr, fmt)/* undefined for odd fpr */
case fmt of
S, W:
value <-- FGR[fpr+0]
D:
/* undefined for fpr not even */
value <-- FGR[fpr+1] || FGR[fpr+0]
end
StoreFPR(fpr, fmt, value):/* undefined for odd fpr */
case fmt of
S, W:
FGR[fpr+1] <-- undefined
FGR[fpr+0] <-- value
D:
FGR[fpr+1] <-- value63...32
FGR[fpr+0] <-- value31...0
end
NOTE: The notation “FGR[fpr]” is either the physical
32-bit register or the logical 32-bit register for a 64-bit
processor in 32-bit register emulation mode. It does not
imply a specific mechanism for the 32-bit register
Instruction Notation Conventions
For the FPU instruction detail documentation, all variable subfields in
an instruction format (such as fs, ft, immediate, and so on) are shown in
lower-case. The instruction name (such as ADD, SUB, and so on) is shown
in upper-case.
For the sake of clarity, an alias may be used for a variable subfield in the
formats of specific instructions. For example, rs = base is used in the
format for load and store instructions. Such an alias is always lower case,
since it refers to a variable subfield.
In some instructions, the instruction subfields op and function can have
constant 6-bit values. When reference is made to these instructions,
upper-case mnemonics are used. For instance, in the floating-point ADD
instruction we use op = COP1 and function = ADD. In other cases, a single
field has both fixed and variable subfields, so the name contains both
upper and lower case characters. Bit encodings for mnemonics are shown
at the end of this section, and are also included with each individual
instruction.
The instruction description includes an Operation section that describes
the operation of the instruction in a pseudocode resembling a
programming language.
In the instruction description examples that follow, the Operation
section describes the operation performed by each instruction using a
high-level language notation.

B–5

APPENDIX B

FPA INSTRUCTION REFERENCE

Load and Store Memory
In the load and store operation descriptions, the functions listed below
are used to summarize the handling of virtual addresses and physical
memory.
Function

Meaning

AddressTranslation

Determines the physical address given the virtual
address. The function fails and an exception is taken if
the required translation is not present in the TLB (“E”
versions only).

LoadMemory

Uses the cache and main memory to find the contents of
the word containing the specified physical address. The
low-order two bits of the address and the Access Type field
indicates which of each of the four bytes within the data
word need to be returned. If the cache is enabled for this
access, the entire word is returned and loaded into the
cache.

StoreMemory

Uses the cache, write buffer, and main memory to store
the word or part of word specified as data in the word
containing the specified physical address. The low-order
two bits of the address and the Access Type field indicates
which of each of the four bytes within the data word
should be stored.
Table B.10. Load and Store Common Functions

All coprocessor loads and stores reference aligned-word data items.
Thus, for word loads and stores, the access type field is always WORD, and
the low-order two bits of the address must always be zero.
Regardless of byte-numbering order (endianness), the address specifies
that byte which has the smallest byte-address in the addressed field. For
a big-endian machine, this is the leftmost byte; for a little-endian machine,
this is the rightmost byte.

Instruction Descriptions
The FP instructions are described in detail in alphabetic order. Each
page contains the following information for the instruction:
• Instruction mnemonic and name
• Assembler format
• Description of the instruction
• Operation of the instruction described in pseudocode.
• Exceptions that the instruction can cause
• FP exception conditions that the instruction can cause (as FloatingPoint Exceptions)

B–6

FPA INSTRUCTION REFERENCE

APPENDIX B

Floating-Point
Absolute Value

ABS.fmt
26 25

31
COP1
010001
6

21 20
fmt
5

16 15

0
00000
5

ABS.fmt

11 10

6 5

0
ABS
000101
6

Format:
ABS.fmt fd, fs

Description:
fd ← |fs|
The contents of the FPU register specified by fs are interpreted in the specified format and the
arithmetic absolute value is taken. The result is placed in the floating-point register specified by fd.
The absolute value operation is arithmetic; a NaN operand signals invalid operation.
This instruction is valid only for single- and double-precision floating-point formats.
The fields fs and fd must specify valid operand registers for the type fmt and the logical size of
coprocessor 1 general registers. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR(fd, fmt, AbsoluteValue(ValueFPR(fs, fmt)))

Exceptions:
Coprocessor unusable exception
Coprocessor exception trap

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception

B–7

APPENDIX B

FPA INSTRUCTION REFERENCE

ADD.fmt
26 25

31
COP1
010001
6

ADD.fmt

Floating-Point Add
21 20

16 15

11 10

6 5

fmt

0
ADD
000000
6

Format:
ADD.fmt fd, fs, ft

Description:
fd ← fs + ft
The contents of the FPU registers specified by fs and ft are interpreted in the specified format and
arithmetically added. The result is rounded as if calculated to infinite precision and then rounded
to the specified format (fmt), according to the current rounding mode. The result is placed in the
floating-point register (FPR) specified by fd.
The fields fs, ft, and fd must specify valid operand registers, given the logical size of coprocessor 1
general registers, for the type fmt. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR (fd, fmt, ValueFPR(fs, fmt) + ValueFPR(ft, fmt))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception
Inexact exception
Overflow exception
Underflow exception

B–8

FPA INSTRUCTION REFERENCE

APPENDIX B

BC1F
31

Branch On Floating-Point False

26 25
COP1
010001
6

21 20 18

BC
01000
5

cc
3

1 1
7 6

BC1F
0

nd tf
0 0
1 1

offset
16

Format:
BC1F offset

(cc=0)

Description:
A branch target address is computed from the sum of the address of the instruction in the delay
slot, and the 16-bit offset, shifted left two bits and sign-extended. If the contents of the floating point
condition code specified by cc are zero (equal to the value of the tf field), the target address is
branched to with a delay of one instruction.
The condition codes are set by the floating-point compare instruction.
MIPS I specifies a single floating-point condition that is available as the coprocessor 1 condition
signal (Cp1Cond) and the C bit in the FP Control and Status register. This instruction always tests
the Cp1Cond signal. The first assembler format instruction shown, with an implied cc field of zero,
is the only form allowed for processors that implement the MIPS I instruction.
This instruction has a scheduling restriction. The condition information is sampled during the
preceding instruction and there must be at least one instruction between this branch instruction
and the compare instruction that changes the condition code. Hardware does not enforce this
restriction.

Operation:
MIPS I has a single condition signal, the COprocessor Condition signal CpCond(1).
T–1:
T:
T+1:

condition ← COC[1] = tf
target ← (offset15)GPRlen-(16+2) || offset || 02
if condition then
PC ← PC + target

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception

B–9

APPENDIX B

FPA INSTRUCTION REFERENCE

Branch On Floating Point True

BC1T
31

26 25
COP1
010001
6

21 20 18

BC
01000
5

cc
3

1 1
7 6

BC1T
0

nd tf
0 1
1 1

offset
16

Format:
BC1T offset

(cc=0)

Description:
A branch target address is computed from the sum of the address of the instruction in the delay
slot, and the 16-bit offset, shifted left two bits and sign-extended. If the contents of the floating point
condition code specified by cc are one (equal to the value of the tf field), the target address is
branched to with a delay of one instruction.
The condition codes are set by the floating-point compare instruction.
MIPS I specifies a single floating-point condition that is available as the coprocessor 1 condition
signal (Cp1Cond) and the C bit in the FP Control and Status register. This instruction always tests
the Cp1Cond signal. The first assembler format instruction shown, with an implied cc field of zero,
is the only form allowed for processors that implement the MIPS I instruction.
This instruction has a scheduling restriction. The condition information is sampled during the
preceding instruction and there must be at least one instruction between this branch instruction
and the compare instruction that changes the condition code. Hardware does not enforce this
restriction.

Operation:
MIPS I has a single condition signal, the COprocessor Condition signal COC.
T–1:
T:
T+1:

condition ← COC[1] = tf
target ← (offset15)GPRlen-(16+2) || offset || 02
if condition then
PC ← PC + target

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception

B–10

FPA INSTRUCTION REFERENCE

APPENDIX B

C.cond.fmt
31

26 25
COP1

21 20
fmt

010001
6

Floating-Point
Compare

16 15
ft

11 10
fs

C.cond.fmt

8 7 6 5

4 3

cond

00
2

11
2

Format:
C.cond.fmt

fs, ft

(cc=0)

Description:
The contents of the floating-point registers specified by fs and ft are interpreted in the specified
format and arithmetically compared. A result is determined based on the comparison and the
condition, cond, specified in the instruction. The result is stored in the condition code specified by
cc. If one of the values is a “Not a Number,” and the high-order bit of the condition field is set, an
invalid operation trap is taken.
MIPS I specifies a single floating-point condition that is available as the coprocessor 1 condition
signal (Cp1Cond) and as the C bit in the FP Control and Status register. This instruction always sets
the Cp1Cond signal. The first assembler format instruction shown, with an implied cc field of zero,
is the only form allowed for processors that implement the MIPS I instruction.
Comparisons are exact and neither overflow nor underflow. Four mutually exclusive relations are
possible results: “less than,” “equal,” “greater than,” and “unordered.” The last case arises when
one or both of the operands are NaN; every NaN compares “unordered” with everything,
including itself. Comparisons ignore the sign of zero, so +0 “equals” -0.
This instruction has a timing restriction. The contents of the destination condition code specified
by cc, or the Cp1Cond signal is immediately available only within the floating-point unit. A oneinstruction delay is provided to propagate the condition code to the remainder of the processor.
The value of the condition code is undefined during this one-instruction delay. No hardware
interlock is provided to detect this hazard.
The implications for compiler code scheduling is that a compare instruction may be immediately
followed by a dependent floating-point conditional move instruction, but may not be immediately
followed by a dependent branch on floating-point coprocessor condition instruction or a
dependent integer conditional move instruction. Note that this restriction applies only to the
particular condition code specified by cc; the other condition codes are unaffected.
The fields fs and ft must specify valid operand registers for the type fmt and the logical size of
coprocessor 1 general registers. If they are not valid specifiers, the result is undefined.

B–11

APPENDIX B

FPA INSTRUCTION REFERENCE

C.cond.fmt

Floating-Point
Compare
(Continued)

Operation:
if NaN(Value FPR(fs, fmt)) or NaN(ValueFPR(ft, fmt)) then
less ← false
equal ← false
unordered ← true
if cond3 then
signal InvalidOperationException
endif
else
less ← ValueFPR(fs, fmt) < ValueFPR(ft, fmt)
equal ← ValueFPR(fs, fmt) = ValueFPR(ft, fmt)
unordered ← false
endif
condition ← (cond2 and less) or (cond1 and equal) or
cond0 and unordered)
COC[1] ← condition

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception

B–12

C.cond.fmt

FPA INSTRUCTION REFERENCE

APPENDIX B

Move Control Word
from Floating-Point (CP1)

CFC1
31

26 25
COP1
010001
6

21 20

CF
00010
5

16 15

11 10
fs

CFC1
0

0
000 0000 0000
11

Format:
CFC1 rt, fs

Description:
The contents of the FPU control register fs are loaded into general register rt.
This operation is only defined when fs equals 0 or 31.
The contents of general register rt are undefined for time T of the instruction immediately following
this load instruction.

Operation:
T:
temp ← FCR[fs]
T+1: GPR[rt] ← (temp31)GPRlen-32 || temp

Exceptions:
Coprocessor unusable exception

B–13

APPENDIX B

FPA INSTRUCTION REFERENCE

Move Control Word
to Floating-Point (CP1)

CTC1
31

26 25
COP1
010001
6

21 20

CT
00110
5

16 15

11 10
fs

CTC1
0
0
000 0000 0000
11

Format:
CTC1 rt, fs

Description:
The contents of general register rt are loaded into FPU control register fs. This operation is only
defined when fs equals 0 or 31.
Writing to Control Register 31, the floating-point Control/Status register, causes an interrupt or
exception if any cause bit and its corresponding enable bit are both set. The register will be written
before the exception occurs. The contents of floating-point control register fs are undefined for time
T of the instruction immediately following this load instruction.

Operation:
T:
T+1:

temp ← GPR[rt]31...0
FCR[fs] ← temp
COC[1] ← FCR[31]23

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception
Division by zero exception
Inexact exception
Overflow exception
Underflow exception

B–14

FPA INSTRUCTION REFERENCE

APPENDIX B

CVT.D.fmt
26 25

31
COP1
010001
6

Floating-Point
Convert to Double
Floating-Point Format
21 20

fmt
5

16 15

0
00000
5

CVT.D.fmt

11 10

6 5

0
CVT.D
100001
6

Format:
CVT.D.fmt fd, fs

Description:
The contents of the floating-point register specified by fs is interpreted in the specified source
format, fmt, and arithmetically converted to the double floating-point format. The result is placed
in the floating-point register specified by fd.
This instruction is valid only for conversions from single floating-point format or 32-bit fixed-point
format.
If fmt specifies the single floating-point or single fixed-point format then the operation is exact.
The field fs, and fd must specify valid operand registers given the logical size of coprocessor 1
general registers; fs for the type fmt and fd for double floating-point. If they are not valid specifiers,
the result is undefined.

Operation:
T:

StoreFPR (fd, D, ConvertFmt(ValueFPR(fs, fmt), fmt, D))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Invalid operation exception
Unimplemented operation exception
Inexact exception
Overflow exception
Underflow exception

B–15

APPENDIX B

FPA INSTRUCTION REFERENCE

Floating-Point
Convert to Single
Floating-Point Format

CVT.S.fmt
26 25

31
COP1
010001
6

CVT.S.fmt

21 20
fmt
5

16 15

0
00000
5

11 10

6 5

0
CVT.S
100000
6

Format:
CVT.S.fmt fd, fs

Description:
The contents of the floating-point register specified by fs are interpreted in the specified source
format, fmt, and arithmetically converted to the single binary floating-point format. The result is
placed in the floating-point register specified by fd. Rounding occurs according to the currently
specified rounding mode.
This instruction is valid only for conversions from double floating-point format, or from 32-bit
fixed-point format.
The field fs, and fd must specify valid operand registers given the logical size of coprocessor 1
general registers; fs for the type fmt and fd for single floating-point. If they are not valid specifiers,
the result is undefined.

Operation:
T:

StoreFPR(fd, S, ConvertFmt(ValueFPR(fs, fmt), fmt, S))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Invalid operation exception
Unimplemented operation exception
Inexact exception
Overflow exception
Underflow exception

B–16

FPA INSTRUCTION REFERENCE

APPENDIX B

CVT.W.fmt
26 25

31
COP1
010001
6

Floating-Point
Convert to
Fixed-Point Format

CVT.W.fmt

21 20
fmt
5

16 15

0
00000
5

11 10

6 5

0
CVT.W
100100
6

Format:
CVT.W.fmt fd, fs

Description:
The contents of the floating-point register specified by fs are interpreted in the specified source
format, fmt, and arithmetically converted to the single-word fixed-point format. The result is
placed in the floating-point register specified by fd.
This instruction is valid only for conversion from a single- or double-precision floating-point
formats.
The field fs, and fd must specify valid operand registers given the logical size of coprocessor 1
general registers; fs for the type fmt and fd for single-word fixed-point. If they are not valid
specifiers, the result is undefined.
When the source operand is an Infinity or NaN, or the correctly rounded integer result is outside
the range of the single-word fixed-point result type (–231 to 231- 1), the Invalid operation exception
is raised. If the Invalid operation is not enabled then no exception is taken and the largest positive
result (231–1) is returned.

Operation:
T:

StoreFPR(fd, W, ConvertFmt(ValueFPR(fs, fmt), fmt, W))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Invalid operation exception
Unimplemented operation exception
Inexact exception
Overflow exception

B–17

APPENDIX B

FPA INSTRUCTION REFERENCE

DIV.fmt
26 25

31
COP1
010001
6

DIV.fmt

Floating-Point Divide
21 20

16 15

11 10

6 5

fmt

0
DIV
000011
6

Format:
DIV.fmt fd, fs, ft

Description:
fd ← fs / ft
The contents of the floating-point registers specified by fs and ft are interpreted in the specified
format and fs is arithmetically divided by ft. The result is rounded as if calculated to infinite
precision and then rounded to the specified format, according to the current rounding mode. The
result is placed in the floating-point register specified by fd.
This instruction is valid for only single or double precision floating-point formats.
The fields fs, ft, and fd must specify valid operand registers, given the logical size of coprocessor 1
general registers, for the type fmt. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR (fd, fmt, ValueFPR(fs, fmt)/ValueFPR(ft, fmt))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Division-by-zero exception
Overflow exception

Invalid operation exception
Inexact exception
Underflow exception

B–18

FPA INSTRUCTION REFERENCE

APPENDIX B

Load Word
to Floating-Point (CP1)

LWC1
31

26 25
LWC1
110001
6

21 20
base
5

LWC1

16 15

0
offset

ft
5

Format:
LWC1 ft, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form an
unsigned effective address. The contents of the word at the memory location specified by the
effective address are loaded into floating-point (coprocessor 1) general register ft.
The effective address must be word-aligned. If either of the two least-significant bits of the effective
address is non-zero, an address error exception occurs.

B–19

APPENDIX B

FPA INSTRUCTION REFERENCE

MFC1
31

Move Word from Floating-Point (CP1)

26 25
COP1
010001
6

21 20

MF
00000
5

16 15

11 10
fs

MFC1
0

0
000 0000 0000
11

Format:
MFC1 rt, fs

Description:
The contents of register fs from the floating-point coprocessor are stored into processor register rt.
The contents of register rt are undefined for time T of the instruction immediately following this
load instruction.

Operation:
T:
T+1:

data ← CPR[1, fs];
GPR[rt] ← data

Exceptions:
Coprocessor unusable exception

B–20

FPA INSTRUCTION REFERENCE

APPENDIX B

MOV.fmt
26 25

COP1
010001
6

MOV.fmt

Floating-Point Move
21 20
fmt
5

16 15

0
00000
5

11 10

6 5

0
MOV
000110
6

Format:
MOV.fmt fd, fs

Description:
fd ← fs
The contents of the FPU register specified by fs are interpreted in the specified format and are copied
into the FPU register specified by fd.
The move is non-arithmetic; it causes no IEEE 754 exceptions.
This instruction is valid only for single- or double-precision floating-point formats.
The fields fs and fd must specify valid operand registers for the type fmt and the logical size of
coprocessor 1 general registers. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR(fd, fmt, ValueFPR(fs, fmt))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception

B–21

APPENDIX B

FPA INSTRUCTION REFERENCE

MTC1
31

Move Word to Floating-Point (CP1)

26 25
COP1
010001
6

21 20

MT
00100
5

16 15

11 10
fs

MTC1
0

0
000 0000 0000
11

Format:
MTC1 rt, fs

Description:
The contents of register rt are loaded into the FPU general register at location fs.
The contents of floating-point register fs is undefined for time T of the instruction immediately
following this load instruction.

Operation:
T:
T+1:

data ← GPR[rt]
CPR[1, fs] ← data

Exceptions:
Coprocessor unusable exception

B–22

FPA INSTRUCTION REFERENCE

APPENDIX B

MUL.fmt
26 25

31
COP1
010001
6

MUL.fmt

Floating-Point Multiply
21 20

16 15

11 10

6 5

fmt

0
MUL
000010
6

Format:
MUL.fmt fd, fs, ft

Description:
fd ← fs × ft
The contents of the floating-point registers specified by fs and ft are interpreted in the specified
format and arithmetically multiplied. The result is rounded as if calculated to infinite precision and
then rounded to the specified format, according to the current rounding mode. The result is placed
in the floating-point register specified by fd.
This instruction is valid only for single- or double-precision floating-point formats.
The fields fs, ft, and fd must specify valid operand registers, given the logical size of coprocessor 1
general registers, for the type fmt. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR (fd, fmt, ValueFPR(fs, fmt) * ValueFPR(ft, fmt))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception
Inexact exception
Overflow exception
Underflow exception

B–23

APPENDIX B

FPA INSTRUCTION REFERENCE

NEG.fmt
26 25

31
COP1
010001
6

NEG.fmt

Floating-Point Negate
21 20
fmt
5

16 15

0
00000
5

11 10

6 5

0
NEG
000111
6

Format:
NEG.fmt fd, fs

Description:
fd ← - fs
The contents of the FPU register specified by fs are interpreted in the specified format and the
arithmetic negation is taken (polarity of the sign-bit is changed). The result is placed in the FPU
register specified by fd.
The negate operation is arithmetic; an NaN operand signals invalid operation.
The fields fs and fd must specify valid operand registers for the type fmt and the logical size of
coprocessor 1 general registers. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR(fd, fmt, Negate(ValueFPR(fs, fmt)))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception

B–24

FPA INSTRUCTION REFERENCE

APPENDIX B

SUB.fmt
26 25

31
COP1
010001
6

SUB.fmt

Floating-Point Subtract
21 20

16 15

11 10

6 5

fmt

0
SUB
000001
6

Format:
SUB.fmt fd, fs, ft

Description:
The contents of the floating-point registers specified by fs and ft are interpreted in the specified
format and arithmetically subtracted. The result is rounded as if calculated to infinite precision and
then rounded to the specified format, according to the current rounding mode. The result is placed
in the floating-point register specified by fd.
This instruction is valid only for single- or double-precision floating-point formats.
The fields fs, ft, and fd must specify valid operand registers, given the logical size of coprocessor 1
general registers, for the type fmt. If they are not valid specifiers, the result is undefined.

Operation:
T:

StoreFPR (fd, fmt, ValueFPR(fs, fmt) – ValueFPR(ft, fmt))

Exceptions:
Coprocessor unusable exception
Floating-Point exception

Floating-Point Exceptions:
Unimplemented operation exception
Invalid operation exception
Inexact exception
Overflow exception
Underflow exception

B–25

APPENDIX B

FPA INSTRUCTION REFERENCE

SWC1
31

Store Word
from Floating-Point (CP1)

26 25
SWC1
111001
6

21 20
base
5

16 15

SWC1
0

offset

ft
5

Format:
SWC1 ft, offset(base)

Description:
The 16-bit offset is sign-extended and added to the contents of general register base to form an
unsigned effective address. The word from floating-point (coprocessor 1) general register ft is
stored at the memory location specified by the effective address.
The effective address must be word-aligned. If either of the two least-significant bits of the effective
address is non-zero, an address error exception occurs.

Operation:
T:

vAddr ← ((offset15)GPRlen-16 || offset15...0) + GPR[base]
(pAddr, uncached) ← AddressTranslation (vAddr, DATA)
data ← CPR[1, ft]
StoreMemory (uncached, WORD, data, pAddr, vAddr, DATA)

Exceptions:
Coprocessor unusable
TLB refill exception
TLB invalid exception
TLB modification exception
Bus error exception
Address error exception

B–26

FPA INSTRUCTION REFERENCE

APPENDIX B

FPA Instruction Set Summary
Instr Fields
op
31-26

Asm

func
25-21

20-16

15-11

10-6

Description

5-0

Single Precision Arithmetic Instructions
17

fs2

fs1

add.s

3-operand single precision
add.

fs2

fs1

sub.s

3-operand single precision
subtraction.

fs2

fs1

mul.s

3-operand single precision
multiply.

fs2

fs1

div.s

3-operand single precision
divide.

abs.s

Single-precision
absolute
value of fs is placed into fd.

mov.s

Single precision move of value in fs into fd.

Format Conversion
17

cvt.s.d

Convert a double precision
value to single precision.

cvt.d.s

Convert a single precision
value to double precision.

cvt.s.
w

Convert an integer “word”
value to single precision.

cvt.d.
w

Convert an integer “word”
value to double precision.

cvt.w.
s

Convert a single precision
value to a word value.

cvt.w.
d

Convert a double precision
value to a word value

Single Precision Comparison Operations:
No Invalid Operation Exception taken for Unordered Operands
17

fs1

fs2

c.f.s

Result will be false

fs1

fs2

c.un.s

True if fp values are “unordered”

fs1

fs2

c.eq.s

True if the two values are
equal

fs1

fs2

c.ueq.
s

True if equal or unordered.

fs1

fs2

c.ueq.
s

True if equal or unordered.

fs1

fs2

c.olt.s

True if ordered and less than

fs1

fs2

c.ult.s

True if unordered or less
than

fs1

fs2

c.ole.s

Ordered and (equal or less
than)

B–27

APPENDIX B

FPA INSTRUCTION REFERENCE
Instr Fields

func

31-26

25-21

20-16

15-11

10-6

5-0

fs1

fs2

Asm

Description

c.ule.s

Unordered or less than or
equal.

Single Precision Comparison Operations:
Invalid Operation Exception Signalled for Unordered Operands
17

fs1

fs2

c.sf.s

Result will be false

fs1

fs2

c.ngle. True if fp values are “unors
dered”

fs1

fs2

c.seq.
s

fs1

fs2

c.ngl.s True if equal or unordered.

fs1

fs2

c.lt.s

True if equal or unordered.

fs1

fs2

c.nge.
s

True if ordered and less than

fs1

fs2

c.le.s

True if unordered or less
than

fs1

fs2

c.ngt.s

Ordered and (equal or less
than)

True if the two values are
equal

Double Precision Comparison Operations:
No Invalid Operation Exception taken for Unordered Operands
17

fs1

fs2

c.f.d

Result will be false

fs1

fs2

c.un.d

True if fp values are “unordered”

fs1

fs2

c.eq.d

True if the two values are
equal

fs1

fs2

c.ueq.
d

True if equal or unordered.

fs1

fs2

c.ueq.
d

True if equal or unordered.

fs1

fs2

c.olt.d

True if ordered and less than

fs1

fs2

c.ult.d

True if unordered or less
than

fs1

fs2

c.ole.d

Ordered and (equal or less
than)

fs1

fs2

c.ule.d

Unordered or less than or
equal.

Double Precision Comparison Operations:
Invalid Operation Exception Signalled for Unordered Operands
17

fs1

fs2

c.sf.d

Result will be false

fs1

fs2

c.ngle. True if fp values are “unord
dered”

fs1

fs2

c.seq.
d

fs1

fs2

c.ngl.d True if equal or unordered.

True if the two values are
equal

B–28

FPA INSTRUCTION REFERENCE

APPENDIX B

Instr Fields
op

func

Asm

Description

31-26

25-21

20-16

15-11

10-6

5-0

fs1

fs2

c.lt.d

True if equal or unordered.

fs1

fs2

c.nge.
d

True if ordered and less than

fs1

fs2

c.le.d

True if unordered or less
than

fs1

fs2

c.ngt.
d

Ordered and (equal or less
than)

Double Precision Arithmetic Instructions
17

fs2

fs1

add.d

3-operand double precision
add.

fs2

fs1

sub.d

3-operand double precision
subtraction.

fs2

fs1

mul.d

3-operand double precision
multiply.

fs2

fs1

div.d

3-operand double precision
divide.

abs.d

Double-precision
absolute
value of fs is placed into fd.

mov.d

Double precision move of value in fs into fd.

Data Movement Operations
49

offset

lwc1

Load word to FPA

offset

swc1

Store word from FPA

B–29

CP0 OPERATION REFERENCE

APPENDIX C

Integrated Device Technology, Inc.

CP0 Operation Details
This section documents the operations for the on-chip CP0 in R30xx
family processors. It contains a detailed description for each instruction in
alphabetic order.

MMU Operations
Most of the CP0 operations are designed to manage the on-chip TLB of
“E” versions of the family. Instructions are provided to read, write, and
probe the TLB.

Exception Operations
A single instruction is provided to support exception operation: the rfe
instruction restores the proper Interrupt Enable and Kernel/User mode
bits of the status register on return from exception.
Dand Register Movement Operations
The standard mtc0, ctc0, mfc0, and cfc0 operations were described in
Appendix A.

Operation Descriptions
The CP0 instructions are described in detail in alphabetic order. Each
page contains the following information for the instruction:
• Instruction mnemonic and name
• Assembler format
• Description of the instruction
• Operation of the instruction described in pseudocode.
• Exceptions that the instruction can cause

C–1

APPENDIX C

CP0 OPERATION REFERENCE

Restore from
Exception

RFE
26

31
COP0
010000
6

RFE
6 5

CO
1

0
RFE
010000
6

Format:
RFE

Description:
RFE restores the “previous” interrupt enable mask bit and kernel/user mode bit (IEp and KUp) of
the Status Register into the corresponding “current” status bits (IEc and KUc), and restores the
“old” Status bits (IEo and KUo) into the corresponding “previous” status bits (IEp and KUp). The
“old” status bits remain unchanged.
The MIPS architecture does not specify the operation of memory references associated with load/
store instructions immediately prior to an RFE instruction. Normally, the RFE instruction follows
in the delay slot of a JR instruction to restore the PC.

Operation:
T:

SR←SR31..4||SR5..2

Exceptions:
Coprocessor unusable exception

C–2

CP0 OPERATION REFERENCE

APPENDIX C

TLBP
26

31
COP0
010000
6

TLBP

TLB Probe
21

6 5

CO
1

0
TLBP
0 01 0 0 0
6

Format:
TLBP

Description:
The Index register is loaded with the address of the TLB entry whose contents match the contents
of the EntryHi register. If no TLB entry matches, the high-order bit of the Index register is set.
The architecture does not specify the operation of memory references associated with instructions
immediately after a TLBP instruction, nor is the operation specified if more than one TLB entry
matches.
This instruction is only valid for “E” versions of the R30xx family. Its result for members without
an on-chip TLB is undefined.

Operation:
T:

Index ←1||031)

for i in 0..63
if (TLB[i]63..44 = EntryHi31..12) and {TLB[i]8 or (TLB[i]43..38 = EntryHi11..6)) then
Index ← 018||i5..0||08
endif
endfor

Exceptions:
Coprocessor unusable exception

C–3

APPENDIX C

CP0 OPERATION REFERENCE

TLBR

Read Indexed TLB Entry
26

31
COP0
010000
6

TLBR
6 5

CO
1

0
TLBR
000001
6

Format:
TLBR

Description:
The EntryHi and EntryLo registers are loaded with the contents of the TLB entry pointed at by the
contents of the TLB Index register.
This operation is only valid for “E” version members of the R30xx family. Its result for members
without an on-chip TLB is undefined.

Operation:
T:

EntryHi ←TLB[Index13..8]63..32
EntryHi ←TLB[Index13..8]31..0

Exceptions:
Coprocessor unusable exception

C–4

CP0 OPERATION REFERENCE

APPENDIX C

TLBWI
26

31
COP0
010000
6

TLBWI

Write Indexed TLB Entry
25

6 5

CO
1

0
TLBWI
000010
6

Format:
TLBWI

Description:
The TLB entry pointed at by the contents of the Index register is loaded with the contents of the
EntryHi and EntryLo registers.
This operation is only valid for “E” version members of the R30xx family. Its result for members
without an on-chip TLB is undefined.

Operation:
T:

TLB[Index13..8] ← EntryHi||EntryLo

Exceptions:
Coprocessor unusable exception

C–5

APPENDIX C

CP0 OPERATION REFERENCE

TLBWR
26

31
COP0
010000
6

Write Random TLB Entry

TLBWR
6 5

CO
1

0
TLBWR
000110
6

Format:
TLBWR

Description:
The TLB entry pointed at by the contents of the Random register is loaded with the contents of the
EntryHi and EntryLo registers.
This operation is only valid for “E” version members of the R30xx family. Its result for members
without an on-chip TLB is undefined.

Operation:
T:

TLB[Random13..8] ← EntryHi||EntryLo

Exceptions:
Coprocessor unusable exception

C–6

ASSEMBLER LANGUAGE
SYNTAX

APPENDIX D

Integrated Device Technology, Inc.

This appendix describes the assembler syntax valid for most R30xx
assemblers..
The compiler-dir directives in the syntax are for use by compilers only,
and they are not described in this book.
statement-list:
statement
statement statement-list
statement:
stat \n
stat ;
stat:
label
label instruction
label data
instruction
data
symdef
directive
label:
identifier :
decimal :
identifier:
[A-Za-z.$_][A-Za-z0-9.$_]
instruction:
opcode
opcode operand
opcode operand , operand
opcode operand , operand , operand
opcode:
add
sub
etc.
operand:
register
( register )
addr-immed ( register )
addr-immed
float-register
float-const
register:
$decimal
float-register:
$fdecimal

D–1

APPENDIX D

ASSEMBLER LANGUAGE SYNTAX

addr-immed:
label-expr
label-expr + expr
label-expr - expr
expr
label-expr:
label-ref
label-ref - label-ref
label-ref:
numeric-ref
identifier
.
numeric-ref:
decimalf
decimalb
data:
data-mode data-list
.ascii string
.asciiz string
.space expr
data-mode:
.byte
.half
.hword
.word
.int
.long
.short
.float
.single
.double
.quad
.octa
data-list:
data-expr
data-list , data-expr
data-expr:
expr
float-const
expr : repeat
float-const : repeat
repeat:
expr
symdef:
constant-id = expr
constant-id:
identifier
directive:
set-dir
segment-dir
D–2

ASSEMBLER LANGUAGE SYNTAX

APPENDIX D

align-dir
symbol-dir
block-dir
compiler-dir
set-dir:
.set
.set
.set
.set
.set
.set

[no]volatile
[no]reorder
[no]at
[no]macro
[no]bopt
[no]move

segment-dir:
.text
.data
.rdata
.sdata
align-dir:
.align expr
symbol-dir:
.globl identifier
.extern identifier , constant
.comm identifier , constant
.lcomm identifier , constant
block-dir:
.ent identifier
.ent identifier , constant
.aent identifier , constant
.mask expr , expr
.fmask expr , expr
.frame register , expr , register
.end identifier
.end
compiler-dir:
.alias register , register
.bgnb expr
.endb expr
.file constant string
.galive
.gjaldef
.gjrlive
.lab identifier
.livereg expr , expr
.noalias register , register
.option flag
.verstamp constant constant
.vreg expr , expr
expr:
expr binary-op expr
term
term:
unary-operator term
primary
primary:
D–3

APPENDIX D

ASSEMBLER LANGUAGE SYNTAX

constant
( expr )

binary-op: one of
*
/
+
–
<<
>>
&
^
|

unary-operator: one of
+
–
~

constant:
decimal
hexadecimal
octal
character-const
constant-id
decimal:
[1-9][0-9]+
hexadecimal:
0x[0-9a-fA-F]+
0X[0-9a-fA-F]+

octal:
0[0-7]+

character-const:
’x’
string:
"xxxx"
float-const: for example
1.23 .23
0.23 1.

1.0

1.2e101.2e-15

D–4

OBJECT CODE FORMATS

APPENDIX E

Integrated Device Technology, Inc.

This appendix describes two object file formats that are often used in
MIPS development systems. Object files are created by the compiler and/
or assembler, and the link editor. An object file is a binary representation
of part or all of a program, and usually has two distinct forms:
• Relocatable object file : holds the code and data resulting from the
compilation of a single module, suitable for linking with other
relocatable object files to create an executable object file. A relocatable
file includes relocation information and symbol tables which allow the
link editor to combine the individual modules, and to patch (relocate)
instructions or data which depend on the program’s final location in
memory. Other parts of the file may encode information to support
symbolic debugging.
• Executable object file : holds a complete program, suitable for direct
execution by a CPU. This file will not include relocation information,
but may add a simple header which tells the operating system or
bootstrap loader where each part of the object file is to be located in
memory.
The software development system should be equipped with tools to allow
the programmer to inspect the contents of an object file, or to convert it
into alternative (possibly ASCII) formats which can be downloaded to a
PROM programmer or evaluation board. Common tools are described
below.

SECTIONS AND SEGMENTS
An object file consists of a number of separate sections: most correspond
to the program’s instructions and data, but some additional sections hold
information for linkers and debuggers. Each section has a name to identify
it (e.g. ‘‘.text’’ and ‘‘ .rdata’’), and a complete list of the standard program
sections recognized by the development toolchain should be included in its
documentation.
The reason for splitting the program up like this is so that the link editor
can then merge the different parts of the program that need to be located
together in memory (e.g. a ROMable program needs all code and read-only
data in ROM, but writable data in RAM). When the link editor produces the
final executable object file it concatenates all sections of the same name
together, and then further merges those sections which are located
together in memory into a smaller number of contiguous segments. An
object file header is prepended to identify the position of each segment in
the file, and its intended location in memory.

ECOFF OBJECT FILE FORMAT (RISC/OS)
The original MIPS Corp. compilers were Unix-based and until fairly
recently used the ECOFF object code format. Development systems from
other vendors often use or at least support inter-linking with this format,
in the interests of compatibility. ECOFF is based on an earlier format
called COFF, which stands for Common Object File Format, and first
appeared in early versions of Unix System V. COFF was a brave (and
largely unsuccessful) attempt to define a flexible object code format that
would be portable to a large number of processor architectures.
The ‘‘E’’ in ‘‘ECOFF’’ stands for Extended. The MIPS engineers wanted
the flexibility of COFF to support gp-relative addressing, which would have
been impossible with the restrictive format used on earlier Unix systems.
However they decided to replace the COFF symbol table and debug data

E–1

APPENDIX E

OBJECT CODE FORMATS

with a completely different design. The ECOFF symbol table format is
certainly much more powerful and compact than the rather primitive
COFF format, but it is also much more difficult to generate and interpret.
Fortunately, embedded system applications are unlikely to be
concerned with the internal structure of the symbol tables. The
programmer probably only needs to recognize the COFF file header and
‘‘optional’’ a.out header, which are largely unchanged from the original
COFF definitions.

File header
The COFF file header consists of the following 20 bytes at the start of the
file:
Offset

Type

Name

Purpose

unsigned short

f_magic

Magic number (see
below)

unsigned short

f_nscns

Number of sections

long

f_timdat

Time and date stamp
(Unix style)

long

f_symptr

File offset of symbol
table

long

f_nsyms

Number of symbols

unsigned short

f_opthdr

Size of optional header

unsigned short

f_flags

Various flag bits

From this list only the following fields are really important:
• f_magic : must be one of the following values: Object files with the
Name

Value

Meaning

MIPSEBMAGIC

0x0160

Big-endian MIPS binary

MIPSELMAGIC

0x0162

Little-endian MIPS binary

SMIPSEBMAGIC

0x6001

Big-endian MIPS binary
with little-endian headers

SMIPSELMAGIC

0x6201

Little-endian MIPS binary
with big-endian headers

SMIPS... magic numbers were generated on hosts of the opposite
endianness, and software will have to individually byte-swap each
field required from the file and a.out headers.
• f_opthdr : the size in the file of the a.out header: this valueis used to
work out the program’s offset in the file.
• f_nscns : the number of section headers in the file: this is also needed
to work out the program’s offset.

Optional a.out header
The a.out header is a left-over from earlier Unix versions, which has
been shoe-horned into COFF. It follows the COFF file header, and does the
job of coalescing the COFF sections into exactly three contiguous
segments: text (instructions and read-only data); data (initialized, writable
data); and BSS (uninitialized data, set to zero).
Offset

Type

Name

Purpose

short

magic

Magic number

short

vstamp

Version stamp

E–2

OBJECT CODE FORMATS

APPENDIX E
4

long

tsize

Text size

long

dsize

Data size

long

bsize

BSS size

long

entry

Entry-point address

long

text_start

Text base address

long

data_start

Data base address

long

bss_start‡

BSS base address

long

gprmask‡

General registers ‘‘used’’
mask

long

cprmask[4]‡

Coprocessor registers
used masks

long

gpvalue‡

GP value for this file

Those fields marked ‡ are new to ECOFF, and not found in the original
COFF definition.
The magic number in this structure does not specify the type of CPU,
but describes the layout of the object file, as follows: The following macro
Name

Value

Meaning

OMAGIC

0x0107

Text segment is writable

NMAGIC

0x0108

Text segment is read-only

ZMAGIC

0x010b

File is demand-pageable (not for
embedded use)

shows how to calculate the file offset of the text segment. In words, and
ignoring ZMAGIC files, it is found after the COFF file header, a.out header
and COFF section headers, rounded up to the next 8 or 16 byte boundary
(depending on the compiler version).
#define FILHSZ sizeof(struct filehdr)
#define SCNHSZ /*sizeof(struct scnhdr)*/ 40
#define N_TXTOFF(f, o) \
((a).magic == ZMAGIC ? 0 : ((a).vstamp <23 ? \
((FILHSZ + (f).opthdr + (f).f_nscns * SCNHSZ + 7) & ~7) : \
((FILHSZ + (f).opthdr + (f).f_nscns * SCNHSZ + 15) & ~15) ) )

Example loader
The following code fragment draws together the above information to
implement a very simple-minded ECOFF file loader, as might be found in
a bootstrap PROM which can read files from disk or network. It returns the
entry-point address of the program, or zero on failure.
unsigned long load_ecoff (int fd)
{
struct filhdr fh;
struct aouthdr ah;
/* read file header and check */
read (fd, &fh, sizeof (fh));
#ifdef MIPSEB
if (fh.f_magic != MIPSEBMAGIC)
#else
if (fh.f_magic != MIPSELMAGIC)
#endif
return 0;

E–3

APPENDIX E

OBJECT CODE FORMATS

/* read a.out header and check */
read (fd, &ah, sizeof (ah));
if (ah.magic != OMAGIC && ah.magic != NMAGIC)
return 0;
/* read text and data segments, and clear bss */
lseek (fd, N_TXTOFF (fh, ah), SEEK_SET);
read (fd, ah.text_start, ah.tsize);
read (fd, ah.data_start, ah.dsize);
memset (ah.bss_start, 0, ah.bsize);
return ah.entry;
}

Further reading
For more detailed information on the original COFF format, consult a
Unix System V.3 Programmer’s Guide. The ECOFF symbol table extensions
are not documented, but the header files which define it (which are
copyright of MIPS Corporation, now MTI) have now been made available for
re-use and redistribution. You’ll find copies with the rights documented in
recent versions of GNU binary utilities.

ELF (MIPS ABI)
ELF, which stands for Executable and Linking Format, is an attempt to
improve on COFF and define an object file format which supports a range
of different processors, while allowing vendor-specific extensions that do
not break compatibility with other tools. It first appeared in Unix System
V Release 4, and is used by recent versions of MIPS Corp compilers, and
some other development systems.
As in the examination of COFF, this manual will look only at the
minimum amount of the structure which is necessary to load an
executable file into memory.

File header
The ELF file header consists of 52 bytes at the start of the file, and
provides the means to determine the location of all the other parts of the
file. The following fields are relevant when loading an ELF file:
Offset

Type

Name

Purpose

unsigned char

e_ident[16]

File format identification

unsigned short

e_type

Type of object file

unsigned short

e_machine

CPU type

unsigned long

e_version

File format

unsigned long

e_entry

Entry point address

unsigned long

e_phoff

Program header file
offset

unsigned long

e_shoff

Section header file offset

unsigned long

e_flags

CPU-specific flags

unsigned short

e_ehsize

File header size

unsigned short

e_phentsize

Program header entry
size

unsigned short

e_phnum

Number of program
header entries

unsigned short

e_shentsize

Section header entry size

E–4

OBJECT CODE FORMATS

APPENDIX E
48

unsigned short

e_shnum

Number of section
header entries

unsigned short

e_shstrndx

Section header string
table index

• e_ident : contains machine-independent data to identify this as an
ELF file, and describe its layout. The individual bytes within it are as
follows:
Offset

Name

Expected Value

Purpose

EI_MAG0

ELFMAG0=0x7f

EI_MAG1

ELFMAG1=’E’

Magic number
identifying an ELF file

EI_MAG1

ELFMAG2=’L’

EI_MAG3

ELFMAG3=’F’

EI_CLASS

ELFCLASS32=1

Identifies file’s word
size.

EI_DATA

ELFDATA2LSB=1

Indicates little-endian
headers and program

ELFDATA2MSB=2

Indicates big-endian
headers and program

EV_CURRENT=1

Gives file format
version number

EI_VERSION

• e_machine : Specifies the CPU type for which this file is intended,
selected from the values in the table below.
Obviously for this discussion the value should be EM_MIPS.
Name

Value

Meaning

EM_M32

AT&T WE32100

EM_SPARC

SPARC

EM_386

Intel 80386

EM_68K

Motorola 68000

EM_88K

Motorola 88000

EM_860

Intel 80860

EM_MIPS

MIPS R3000

• e_entry : The entry point address of the program.
• e_phoff : The file offset of the program header, which will be required
to load the program.
• e_phentsize : The size (in bytes) of each program header entry.
• e_phnum : The number of entries in the program header.

Program Header
Having verified the ELF file header, software will require the program
header. This part of the file contains a variable number of entries, each of
which specify a segment to be loaded into memory. Each entry is at least
32 bytes long and has the following layout:
Offset

Type

Name

Purpose

unsigned long

p_type

Type of entry

E–5

APPENDIX E

OBJECT CODE FORMATS

unsigned long

p_offset

File offset of segment

unsigned long

p_vaddr

Virtual address of
segment

unsigned long

p_paddr

Physical address of
segment (unused)

unsigned long

p_filesz

Size of segment in file

unsigned long

p_memsz

Size of segment in memory

unsigned long

p_flags

Segment attribute flags

unsigned long

p_align

Segment alignment (power
of 2)

The relevant fields are as follows:
• p_type : Only entries marked with a type of PT_LOAD (1) should be
loaded; others can be safely ignored.
• p_offset : The absolute offset in the file of the start of this segment.
• p_vaddr : The virtual address in memory at which the segment should
be loaded.
• p_filesz : The size of the segment in the file; this may be zero.
• p_memsz : The size of the segment in memory. If this is greater than
p_filesz, then the extra bytes should be cleared to zero.
• p_flags : A bitmap giving read, write and execute permissions for the
segment. This is largely irrelevant for embedded systems, but does
allow the code segment to be identified.
Name

Value

Meaning

PF_X

0x1

Execute

PF_W

0x2

Write

PF_R

0x4

Read

Example loader
The following code fragment draws together the above information to
implement a very simple-minded ELF file loader, as might be found in a
bootstrap PROM which can read files from disk or network. It returns the
entry-point address of the program, or zero on failure.
unsigned long load_elf (int fd)
{
Elf32_Ehdr eh;
Elf32_Phdr ph[16];
int i;
/* read file header and check */
read (fd, &eh, sizeof (eh));
/* check header validity */
if (eh.e_ident[EI_MAG0] != ELFMAG0 ||
eh.e_ident[EI_MAG1] != ELFMAG1 ||
eh.e_ident[EI_MAG2] != ELFMAG2 ||
eh.e_ident[EI_MAG3] != ELFMAG3 ||
eh.e_ident[EI_CLASS] != ELFCLASS32 ||
#ifdef MIPSEB
eh.e_ident[EI_DATA] != ELFDATA2MSB ||
#else
eh.e_ident[EI_DATA] != ELFDATA2LSB ||
#endif
eh.e_ident[EI_VERSION] != EV_CURRENT ||
eh.e_machine != EM_MIPS)
return 0;

E–6

OBJECT CODE FORMATS

APPENDIX E

/* is there a program header of the right size */
if (eh.e_phoff == 0 || eh.e_phnum == 0 || eh.e_phnum > 16 ||
eh.e_phentsize != sizeof(Elf32_Phdr))
return 0;
/* read program header */
lseek (fd, eh.e_phoff, SEEK_SET);
read (fd, ph, eh.e_phnum * eh.e_phentsize);
/* load each program segments */
for (i = 0; i p_filesz) {
lseek (fd, ph[i].p_offset, SEEK_SET);
read (fd, ph[i].p_vaddr, ph[i].p_filesz);
}
if (ph[i].p_filesz
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.1
Linearized : No
Encryption : Standard V1.2 (40-bit)
User Access : Print, Copy, Annotate, Fill forms, Extract, Assemble, Print high-res
Create Date : 1995:10:06 18:23:51
Producer : Acrobat Distiller 2.0 for Macintosh
Modify Date : 1995:10:09 10:11:51
Page Count : 354
Page Mode : UseOutlines

EXIF Metadata provided by EXIF.tools

R3000 Manual

Navigation menu

Versions of this User Manual:

Views

Navigation