R3000 Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 354
Download | |
Open PDF In Browser | View PDF |
Table of Contents IDT R30xx Family Software Reference Manual Revision 1.0 1994 Integrated Device Technology, Inc. Portions 1994 Algorithmics, Ltd. Chapter 16 contains some material that is 1988 Prentice-Hall. Appendices A & B contain material that is 1994 by Mips Technology, Inc. i–1 Table of Contents About IDT Integrated Device Technology, Inc. has been a MIPS semiconductor partner since 1988, and has led efforts to bring the high-performance inherent in the MIPS architecture to embedded systems engineers. These efforts include derivatives of MIPS R3xxx and R4xxx CPUs, development tools, and applications support. Additional information about IDT’s RISC family can be obtained from your local sales representative. Alternately, IDT can be reached directly at: Corporate Marketing (800) 345-7015 RISC Applications "Hotline" (408) 492-8208 RISC Applications FAX (408) 492-8469 RISC Applications Internet rischelp@idtinc.com About Algorithmics Much of this manual was written by Dominic Sweetman and Nigel Stephens of Algorithmics Ltd in London, England, under contract to IDT. Algorithmics were early enthusiasts for the MIPS architecture, designing their first MIPS systems and system software in 1986/87. A small engineering company, Algorithmics provide enabling technologies for companies designing in both R30xx family CPUs and the 64-bit R4x00 architecture. This includes training, toolkits, GNU C support, and evaluation boards. Dominic Sweetman can be reached at the following:. Dominic Sweetman Algorithmics Ltd 3 Drayton Park London N5 1NU ENGLAND. phone: +44 71 700 3301 fax: +44 71 700 3400 email: dom@algor.co.uk i–2 Table of Contents About This Manual This manual is targeted to a systems programmer building an R30xxbased system. It contains the architecture specific operations and programming conventions relevant to such a programmer. This manual is not intended to be a tutorial on structured programming, real-time operating systems, any particular high-level programming language, or any particular toolchain. Other references are better suited to those topics. This manual does contain specific code fragments and the most common programming conventions that are specific to the IDT R30xx RISController family. The manual was consciously limited to the R30xx family; information relevant to the R4xxx family of processors may be found, but the device specific programs (such as cache management, exception handling, etc.) shown as examples are specific to the R30xx family. This manual contains references to the toolchains most commonly used by the authors (IDT, Inc., and Algorithmics, Ltd.). Code fragments shown are typically from software used by and/or provided by these companies, includeing development tools such as IDT/c and software utilities (such as IDT/kit, IDT/sim, and Micromonitor). A wide variety of other, 3rd party products, are also available to support R30xx development, under the Advantage-IDT program. The reader of this manual is encouraged to look at all the available tools to determine which toolchains and utilities best fit the system development requirements. Additional information on the IDT family of RISC processors, and their support tools, is available from your local IDT salesman. i–3 Table of Contents Integrated Device Technology, Inc. reserves the right to make changes to its products or specifications at any time, without notice, in order to improve design or performance and to supply the best possible product. IDT does not assume any responsibility for use of any circuitry described other than the circuitry embodied in an IDT product. The Company makes no representations that circuitry described herein is free from patent infringement or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent, patent rights or other rights, of Integrated Device Technology, Inc. LIFE SUPPORT POLICY Integrated Device Technology's products are not authorized for use as critical components in life support devices or systems unless a specific written agreement pertaining to such intended use is executed between the manufacturer and an officer of IDT. 1. Life support devices or systems are devices or systems which (a) are intended for surgical implant into the body or (b) support or sustain life and whose failure to perform, when properly used in accordance with instructions for use provided in the labeling, can be reasonably expected to result in a significant injury to the user. 2. A critical component is any components of a life support device or system whose failure to perform can be reasonably expected to cause the failure of the life support device or system, or to affect its safety or effectiveness. The IDT logo is a registered trademark and BiCameral, BurstRAM, BUSMUX, CacheRAM, DECnet, Double-Density, FASTX, Four-Port, FLEXI-CACHE, Flexi-PAK, Flow-thruEDC, IDT/c, IDTenvY, IDT/sae, IDT/sim, IDT/ux, MacStation, MICROSLICE, Orion, PalatteDAC, REAL8, R3041, R3051, R3052, R3081, R3721, R4600, RISCompiler, RISController, RISCore, RISC Subsystem, RISC Windows, SARAM, SmartLogic, SyncFIFO, SyncBiFIFO, SPC, TargetSystem and WideBus are trademarks of Integrated Device Technology, Inc. MIPS is a registered trademark of MIPS Computer Systems, Inc All others are trademarks of their respective companies.. i–4 Table of Contents IDT R30xx Family Software Reference Manual Table of Contents Introduction........................................................................................................................1 What is a RISC?......................................................................................................... 1-1 PIPELINES ................................................................................................................ 1-2 The IDT R3xxx Family CPUs ................................................................................... 1-3 MIPS Architecture Levels.......................................................................................... 1-4 MIPS-1 Compared with CISC Archtectures.............................................................. 1-4 Unusual Instruction Encoding Features ............................................................... 1-5 Addressing and Memory Accesses ...................................................................... 1-5 Operations not Directly Supported ...................................................................... 1-6 Multiply and Divide Operations ................................................................................ 1-7 Programmer-visible Pipeline Effects ......................................................................... 1-7 A Note on Machine and Assembler Language .......................................................... 1-8 MIPs-1 (R30xx) Architecture............................................................................................2 Programmer’s View of the Processor Archtecture..................................................... 2-1 Registers..................................................................................................................... 2-1 Conventional Names and Uses of General-Purpose Registers .................................. 2-2 Notes on Conventional Register Names ............................................................. 2-2 Integer Multiply Unit and Registers .......................................................................... 2-3 Instruction Types ....................................................................................................... 2-4 Loading and Storing: Addressing Modes .................................................................. 2-5 Data types in Memory and Registers ......................................................................... 2-6 Integer Data Types .............................................................................................. 2-6 Unaligned Loads and Stores ............................................................................... 2-6 Floating Point Data in Memory .......................................................................... 2-7 Basic Address Space .................................................................................................. 2-8 Summary of System Addressing................................................................................ 2-9 Kernel vs. User Mode .......................................................................................... 2-9 Memory map for CPUs without MMU Hardware............................................. 2-10 Subsegments in the R3041 – Memory Width Configuration ...................... 2-10 System Control Coprocessor Architecture......................................................................3 CPU Control Summary .............................................................................................. 3-1 CPU Control and ‘‘CO-PROCESSOR 0’’................................................................. 3-2 CPU Control Instructions..................................................................................... 3-2 Standard CPU control registers............................................................................ 3-3 PRId Register ................................................................................................ 3-4 SR Register .................................................................................................... 3-4 Cause Register ............................................................................................... 3-7 EPC Register ................................................................................................. 3-8 BadVaddr Register ........................................................................................ 3-8 R3041, R3071, and R3081 Specific Registers..................................................... 3-8 i–5 Table of Contents Count and Compare Registers (R3041 only) .................................................3-8 Config Register (R3071 and R3081) .............................................................3-8 Config Register (R3041) ...............................................................................3-9 BusCtrl Register (R3041 only) ....................................................................3-10 PortSize Register (R3041 only) ...................................................................3-11 What registers are relevant when?......................................................................3-11 Exception Management.....................................................................................................4 Exceptions ..................................................................................................................4-1 Precise Exceptions................................................................................................4-1 When Exceptions Happen ....................................................................................4-2 Exception vectors .................................................................................................4-2 Exception Handling – Basics................................................................................4-3 Nesting Exceptions ...............................................................................................4-4 An Exception Routine ..........................................................................................4-4 Interrupts...................................................................................................................4-12 Conventions and Examples ................................................................................4-14 Cache Management ...........................................................................................................5 Caches and Cache Management .................................................................................5-1 Cache Isolation and Swapping .............................................................................5-3 Initializing and Sizing the Caches ........................................................................5-4 Invalidation...........................................................................................................5-6 Testing and Probing..............................................................................................5-8 Configuration (R3041/71/81 only) .......................................................................5-8 Write Buffer................................................................................................................5-9 Implementing wbflush()......................................................................................5-10 Memory Management and the TLB ................................................................................6 Memory Management and the TLB ...........................................................................6-1 MMU Registers Described ...................................................................................6-3 EntryHi, EntryLo ...........................................................................................6-3 Index ..............................................................................................................6-4 Random ..........................................................................................................6-4 Context ...........................................................................................................6-4 MMU Control Instructions ...................................................................................6-5 Programming Interface to the TLB.......................................................................6-5 How Refill Happens ......................................................................................6-5 Using ASIDs ..................................................................................................6-6 The Random Register and Wired Entries ......................................................6-6 Memory Translation – Setup ................................................................................6-6 TLB Exception Sample Code ...............................................................................6-7 Basic Exception Handler ...............................................................................6-7 Fast kuseg Refill from Page Table ................................................................6-7 Simulating Dirty Bits............................................................................................6-8 Use of TLB in Debugging ..........................................................................................6-8 TLB Management Utilities.........................................................................................6-9 Reset Initialization.............................................................................................................7 Starting Up..................................................................................................................7-1 Probing and Recognizing the CPU .......................................................................7-4 Bootstrap Sequences .............................................................................................7-5 Starting Up an Application ...................................................................................7-5 i–6 Table of Contents Floating Point Coprocessor...............................................................................................8 The IEEE754 Standard and its Background .............................................................. 8-1 What is Floating Point?.............................................................................................. 8-2 IEEE exponent field and bias............................................................................... 8-3 IEEE mantissa and normalization........................................................................ 8-3 Strange values use reserved exponent values ...................................................... 8-3 MIPS FP Data formats ......................................................................................... 8-4 MIPS Implementation of IEEE754............................................................................ 8-5 Floating Point Registers............................................................................................. 8-6 Floating Point Eeceptions/Interrupts.......................................................................... 8-6 The Floating Point Control/Status Register ............................................................... 8-6 Floating Point Implementation/Revision Register..................................................... 8-8 Guide to FP Instructions ............................................................................................ 8-8 Load/Store............................................................................................................ 8-8 Move Between Registers ..................................................................................... 8-9 3-Operand Arithmetic Operations........................................................................ 8-9 Unary (sign-changing) Operations..................................................................... 8-10 Conversion Operations....................................................................................... 8-10 Conditional Branch and Test Instructions.......................................................... 8-10 Instruction Timing Requirements ............................................................................ 8-12 Instruction Timing for Speed ................................................................................... 8-12 Initialization and Enable On Demand...................................................................... 8-12 Floating Point Emulation ......................................................................................... 8-13 Assembler Language Programming.................................................................................9 Syntax Overview........................................................................................................ 9-1 Key Points to Note ............................................................................................... 9-1 Register-to-Register Instructions ............................................................................... 9-2 Immediate (Constant) Operands ................................................................................ 9-3 Multiply/Divide.......................................................................................................... 9-4 Load/Store Instructions.............................................................................................. 9-5 Unaligned Loads and Store.................................................................................. 9-5 Addressing Modes ..................................................................................................... 9-6 Gp-Relative Addressing....................................................................................... 9-6 Jumps, Subroutine Calls and Branches...................................................................... 9-8 Conditional Branches................................................................................................. 9-8 Co-processor Conditional Branches .................................................................... 9-9 Compare and Set ........................................................................................................ 9-9 Coprocessor Transfers ............................................................................................... 9-9 Coprocessor Hazards ......................................................................................... 9-10 Assembler Directives ............................................................................................... 9-10 Sections .............................................................................................................. 9-10 .text, .rdata, .data ......................................................................................... 9-10 .lit4, .lit8 ...................................................................................................... 9-10 Program Segments in Memory ................................................................... 9-11 .bss .............................................................................................................. 9-12 .sdata, .sbss .................................................................................................. 9-12 Stack and Heap ........................................................................................... 9-12 Special Symbols .......................................................................................... 9-12 Data Definition and Alignment.......................................................................... 9-12 i–7 Table of Contents .byte, .half, .word ........................................................................................ 9-13 .float, .double .............................................................................................. 9-13 .ascii, .asciiz ................................................................................................ 9-13 .align ............................................................................................................ 9-13 .comm, .lcomm ........................................................................................... 9-13 .space ........................................................................................................... 9-14 Symbol Binding Attributes ................................................................................ 9-14 .globl ........................................................................................................... 9-14 .extern .......................................................................................................... 9-15 .weakext ...................................................................................................... 9-15 Function Directives............................................................................................ 9-15 .ent, .end ...................................................................................................... 9-15 .aent ............................................................................................................. 9-16 .frame, .mask, .fmask .................................................................................. 9-16 Assembler Control (.set) .................................................................................... 9-17 .set noreorder/reorder .................................................................................. 9-17 .set volatile/novolatile ................................................................................. 9-17 .set noat/at ................................................................................................... 9-18 .set nomacro/macro ..................................................................................... 9-18 .set nobopt/bopt ........................................................................................... 9-18 The Complete Guide to Assembler Instructions...................................................... 9-18 Alphabetic List of Assembler Instructions .............................................................. 9-30 C Programming................................................................................................................10 The Stack, Subroutine Linkage, Parameter Passing ................................................ 10-1 Stack Argument Structure.................................................................................. 10-1 Which Arguments go in What Registers ........................................................... 10-1 Examples from the C Library ............................................................................ 10-2 Exotic Example; Passing Structures .................................................................. 10-2 How Printf() and Varargs Work ........................................................................ 10-3 Returning Value from a Function ...................................................................... 10-4 Macros for Prologues and Epilogues ................................................................. 10-4 Stack-Frame Allocation ..................................................................................... 10-4 Leaf Functions ............................................................................................ 10-4 Non-Leaf Functions .................................................................................... 10-5 Functions Needing Run-Time Computed Stack Locations ........................ 10-7 Shared and Non-Shared Libraries............................................................................ 10-9 Sharing Code in Single-Address Space Systems ............................................... 10-9 Sharing Code Across Address Spaces ............................................................. 10-10 An Introduction to Optimization............................................................................ 10-11 Common Optimizations ................................................................................... 10-11 How to Prevent Unwanted Effects From Optimization................................... 10-14 Optimizer-Unfriendly Code and How to Avoid It........................................... 10-15 Portability Considerations ..............................................................................................11 Writing Portable C ................................................................................................... 11-1 C Language Standards ...................................................................................... 11-1 C Library Functions and POSIX ....................................................................... 11-2 Data Representations and Alignment....................................................................... 11-3 Notes on Structure Layout and Padding ............................................................ 11-3 Isolating System Dependencies ............................................................................... 11-5 i–8 Table of Contents Locating System Dependencies ......................................................................... 11-5 Fixing Up Dependencies.................................................................................... 11-5 Isolating Non-Portable Code ....................................................................... 11-6 Using Assembler................................................................................................ 11-6 Endianness ............................................................................................................... 11-7 What It Means to the Programmer..................................................................... 11-8 Bitfield Layout and Endianness .................................................................. 11-9 Changing the Endianness of a MIPS CPU....................................................... 11-10 Designing and Specifying for Configurable Endianness ................................. 11-10 Read-Only Instruction Memory ................................................................ 11-10 Writable (Volatile) Memory ..................................................................... 11-11 Byte-Lane Swapping ................................................................................. 11-11 Configurable IO Controllers ..................................................................... 11-12 Portability and Endianness-Independent Code ................................................ 11-13 Endianness-Independent Code .................................................................. 11-13 Compatibility Within the R30XX Family.............................................................. 11-13 Porting to MIPS: Frequently Encountered Issues.................................................. 11-15 Considerations for Portability to Future Devices................................................... 11-16 Writing Power-On Diagnostics.......................................................................................12 Golden Rules for Diagnostics Programming ........................................................... 12-1 What Should Tests Do? ........................................................................................... 12-2 How to Test the Diagnostic Tests? .......................................................................... 12-3 Overview of Algorithmics’ Power-On Selftest........................................................ 12-3 Starting Points.................................................................................................... 12-3 Control and Environment Variables .................................................................. 12-4 Reporting............................................................................................................ 12-4 Unexpected Exceptions During Test Sequence ................................................. 12-5 Driving Test Output Devices ............................................................................. 12-5 Restarting the System ........................................................................................ 12-5 Standard Test Sequence ..................................................................................... 12-5 Notes on the Test Sequence ............................................................................... 12-6 Annotated Examples from the Test Code .......................................................... 12-9 Instruction Timing and Optimization............................................................................13 Notes and Examples........................................................................................... 13-1 Additional Hazards .................................................................................................. 13-2 Early Modification of HI and LO ...................................................................... 13-2 Bitfields in CPU Control Registers.................................................................... 13-3 Non-Obvious Hazards........................................................................................ 13-3 Software Tools for Board Bring-Up...............................................................................14 Tools Used in Debug ............................................................................................... 14-1 Initial Debugging ..................................................................................................... 14-2 Porting Micromonitor .............................................................................................. 14-2 Running Micromonitor ............................................................................................ 14-2 Initial IDT/SIM Activity .......................................................................................... 14-2 A Final Note on IDT/KIT ........................................................................................ 14-3 Software Design Examples ..............................................................................................15 Application Software ............................................................................................... 15-1 Memory Map ..................................................................................................... 15-1 Starting Up ......................................................................................................... 15-1 i–9 Table of Contents C Library Functions ........................................................................................... 15-2 Input and Output ......................................................................................... 15-3 Character Class Tests .................................................................................. 15-3 String Functions .......................................................................................... 15-3 Mathematical Functions .............................................................................. 15-3 Utility Functions ......................................................................................... 15-3 Diagnostics .................................................................................................. 15-4 Variable Argument Lists ............................................................................. 15-4 Non-Local Jumps ........................................................................................ 15-4 Signals ......................................................................................................... 15-4 Date and Time ............................................................................................. 15-4 Running the Program ......................................................................................... 15-4 Debugging the Program ..................................................................................... 15-5 Embedded System Software .................................................................................... 15-5 Memory Map ..................................................................................................... 15-6 Starting Up ......................................................................................................... 15-6 Embedded System Library Functions................................................................ 15-7 Trap and Interrupt Handling ....................................................................... 15-8 Simple Interrupt Routines ........................................................................... 15-8 Floating-Point Traps and Interrupts ............................................................ 15-9 Emulating Floating Point Instructions ...................................................... 15-10 Debugging........................................................................................................ 15-10 Unix-Like System S/W .......................................................................................... 15-11 Terminology..................................................................................................... 15-11 Components of a Process ................................................................................. 15-12 System Calls and Protection ............................................................................ 15-13 What the Kernel Does...................................................................................... 15-13 Virtual Memory Implementation for MIPS ..................................................... 15-14 Interrupt Handling for MIPS............................................................................ 15-15 How it Works ............................................................................................ 15-16 Assembly Language Programming Tips........................................................................16 32-bit Address or Constant Values .................................................................... 16-1 Use of “Set” Instructions ................................................................................... 16-1 Use of “Set” with Complex Branch Operations ......................................... 16-2 Carry, Borrow, Overflow, and Multi-Precision Math ................................. 16-2 Machine Instructions Reference (Appendix A)..............................................................A CPU Instruction Overview.................................................................................. A-1 Instruction Classes .............................................................................................. A-1 Instruction Formats ............................................................................................. A-2 Instruction Notation Conventions ....................................................................... A-2 Instruction Notation Examples ..................................................................... A-3 Load and Store Instructions ................................................................................ A-4 Jump and Branch Instructions............................................................................. A-5 Coprocessor Instructions..................................................................................... A-5 System Control Coprocessor (CP0) Instructions ................................................ A-6 Instruct Set Details.............................................................................................. A-6 Instruction Summary......................................................................................... A-79 FPA Instruction Reference (Appendix B).......................................................................B FPU Instruction Set Details .................................................................................B-1 i–10 Table of Contents FPU Instructions ...........................................................................................B-1 Floating-Point Data Transfer ........................................................................B-1 Floating-Point Conversions ..........................................................................B-1 Floating-Point Arithmetic .............................................................................B-2 Floating-Point Register-to-Register Move ....................................................B-2 Floating-Point Branch ...................................................................................B-2 FP Computational Instructions and Valid Operands ...........................................B-2 FP Compare and Condition values ......................................................................B-3 FPU Register Specifiers.......................................................................................B-3 32-bit CP1 registers..............................................................................................B-4 FPU Register Access for 32-bit CP1 Registers..............................................B-5 Instruction Notation Conventions ..................................................................B-5 Load and Store Memory ......................................................................................B-6 Instruction Descriptions .......................................................................................B-6 FPA Instruction Set Summary ...........................................................................B-27 CP0 Operation Reference (Appendix C) ........................................................................C CP0 Operation Details .........................................................................................C-1 MMU Operations .................................................................................................C-1 Exception Operations...........................................................................................C-1 Dand Register Movement Operations............................................................C-1 Operation Descriptions ........................................................................................C-1 Assembler Language Syntax (Appendix D)....................................................................D Object Code Formats (Appendix E)................................................................................E Sections and Segments...............................................................................................E-1 ECOFF Object File Format (RISC/OS).....................................................................E-1 File Header...........................................................................................................E-2 Optional a.out Header ..........................................................................................E-2 Example Loader ...................................................................................................E-3 Further Reading ...................................................................................................E-4 ELF (MIPS ABI)........................................................................................................E-4 File Header...........................................................................................................E-4 Program Header ...................................................................................................E-5 Example Loader ...................................................................................................E-6 Further Reading ...................................................................................................E-7 Object Code Tools .....................................................................................................E-7 Glossary of Common "MIPS" Terms............................................................................. F DRAWINGS 1.1 MIPS 5-Stage Pipeline..........................................................................................1.2 1.2 The Pipeline and Branch Delays.......................................................................... 1-7 1.3 The Pipeline and Load Delays ............................................................................. 1-8 3.1 PRId Register Fields ............................................................................................ 3-4 3.2 Fields in Status Register....................................................................................... 3-4 3.3 Fields in the Cause Register................................................................................. 3-7 3.4 Fields in the R3071/81 Config Register............................................................... 3-8 3.5 Fields in the R3041 Config (Cache Configuration)Register................................ 3-9 3.6 Fields in the R3041 Bus Control (BusCtrl) Register ......................................... 3-10 5.1 Direct Mapped Cache .......................................................................................... 5-1 6.1 EntryHi and EntryLo Register Fields .................................................................. 6-3 i–11 Table of Contents 6.2 6.3 6.4 6.5 8.1 8.2 9.1 10.1 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 15.1 A.1 EntryHi and EntryLo Register Fields .................................................................. 6-3 Fields in the Index Register ................................................................................. 6-4 Fields in the Random Register............................................................................. 6-4 Fields in the Context Register.............................................................................. 6-4 FPA Control/Status Register Fields ..................................................................... 8-6 FPA Implementation/Revision Register .............................................................. 8-8 Program Segments in Memory .......................................................................... 9-11 Stackframe for a Non-Leaf Function ................................................................. 10-5 Structure Layout and Padding in Memory......................................................... 11-3 Data Representation with #pragma Pack(1) ...................................................... 11-4 Data Representation with #pragma Pack(2) ...................................................... 11-5 Typical Big-Endians Picture .............................................................................. 11-8 Little Endians Picture......................................................................................... 11-8 Bitfields and Big-Endian.................................................................................... 11-9 Bitfields and Little-Endian............................................................................... 11-10 Garbled String Storage when Mixing Modes .................................................. 11-11 Byte-Lane Swapper.......................................................................................... 11-12 Memory Layout of a BSD Process .................................................................. 15-12 CPU Instruction Formats .................................................................................... A-2 TABLES 1.1 R30xx Family Members Compared..................................................................... 1-4 2.1 Conventional Names of Registers with Usage Mnemonics................................. 2-2 3.1 Summary of CPU Control Registers (Not MMU) ............................................... 3-3 3.2 ExcCode Values: Different kinds of Exceptions ................................................. 3-7 4.1 Reset and Exception Entry Points (Vectors) for R30xx Family .......................... 4-3 4.2 Interrupt Bitfields and Interrup Pins .................................................................. 4-13 6.1 CPU Control Registers for Memory Management .............................................. 6-3 8.1 Floating Point Data Formats ................................................................................ 8-4 8.2 Rounding Modes Encoded in FP Control/Status Register................................... 8-7 8.4 FP Move Instructions........................................................................................... 8-9 8.5 FPA 3-Operand Arithmetic................................................................................ 8-10 8.6 FPA Sign-Changing Operators .......................................................................... 8-10 8.7 FPA Data Conversion Operations...................................................................... 8-10 8.8 FP Test Instructions ........................................................................................... 8-11 9.1 Assembler Register and Identifier Conventions ................................................ 9-20 9.2 Assembler Instructions....................................................................................... 9-20 12.1 Test Sequence in Brief ....................................................................................... 12-5 16.1 32-bit Immediate Values.................................................................................... 16-1 16.2 Add-With-Carry................................................................................................. 16-2 16.3 Subtract-with-Borrow Operation ....................................................................... 16-3 A.1 CPU Instruction Operation Notations................................................................. A-3 A.2 Load and Store Common Function ..................................................................... A-4 A.3 Access Type Specifications for Load/Store........................................................ A-5 B.1 Format Field Decoding ........................................................................................B-2 B.2 Logical Negation of Predicates by Condition True/False....................................B-3 B.3 Valid FP Operand Specifiers with 32-bit Coprocessor 1 Registers.....................B-4 B.4 Load and Store Common Functions ....................................................................B-6 i–12 ® INTRODUCTION CHAPTER 1 Integrated Device Technology, Inc. 1 IDT’s R30xx family of RISC microcontrollers family includes the R3051, R3052, R3071, R3081 and R3041 processors. The different members of the family offer different price/performance trade-offs, but are all basically integrated versions of the MIPS R3000A CPU. The R3000A CPU is well known for the high-performance Unix systems implemented around it; less publicized but equally impressive is the performance it has brought to a wide variety of embedded applications. IDT’s RISController family also includes devices built around MIPS R4000 64-bit microprocessor technology. These devices, such as the IDT R4600 Orion microprocessor, offer even higher levels of performance than the R3000A derivative family. However, these devices also feature slightly different OS models, and allow 64-bit kernels and applications. Thus, they are sufficiently different from the R30xx family that this manual is focused exclusively on the R30xx family. This manual is aimed at the programmer dealing with the IDT R30xx family components. Although most programming occurs using a high-level language (usually “C”), and with little awareness of the underlying system or processor architecture, certain operations require the programmer to use assembly programming, and/or be aware of the underlying system or processor structure. This manual is designed to be consulted when addressing these types of issues. WHAT IS A RISC? The MIPS CPU is one of the “RISC’’ CPUs, born out of a particularly fertile period of academic research and development. RISC CPUs (‘‘Reduced Instruction Set Computer’’) share a number of architectural attributes to facilitate the implementation of high-performance processors. Most new architectures (as opposed to implementations) since 1986 owe their remarkable performance to features developed a few years earlier by a couple of seminal research projects. Someone commented that ‘‘a RISC is any computer architecture defined after 1984’’; although meant as a jibe at the industry’s use of the acronym, the comment’s truth also derives from the widespread acceptance of the conclusions of that research. One of these was the ‘‘MIPS’’ project at Stanford University. The project name MIPS puns the familiar ‘‘millions of instructions per second’’ by taking its name from the key phrase ‘‘Microcomputer without Interlocked Pipeline Stages’’. The Stanford group’s work showed that pipelining, a wellknown technique for speeding up computers, had been under-exploited by earlier architectures. 1–1 CHAPTER 1 INTRODUCTION PIPELINES Instruction sequence instr 1 instr 2 I-cache register file ALU D-cache register file IF RD ALU MEM WB IF RD ALU IF instr 3 RD MEM ALU WB MEM WB Time Figure 1.1. MIPS 5-stage pipeline Pipelined processors operate by breaking instruction execution into multiple small independent “stages”; since the stages are independent, multiple instructions can be in varying states of completion at any one time. Also, this organization tends to facilitate higher frequencies of operation, since very complex activities can be broken down into “bitesized” chunks. The result is that multiple instructions are executing at any one time, and that instructions are initiated (and completed) at very high frequency. MIPS has consistently been among the most aggressive in the utilization of these techniques. Pipelining depends for its success on another technique; using caches to reduce the amount of time spent waiting for memory. The MIPS R3000A architecture uses separate instruction and data caches, so it can fetch an instruction and read or write a memory variable in the same clock phase. By mating high-frequency operation to high memory-bandwidth, very high-performance is achieved. In CISC architectures, caches are often seen as part of memory. A RISC architecture makes more sense if the dual caches are regarded as very much part of the CPU; in fact, the pipelines of virtually all RISC processors require caches to maintain execution. The CPU normally runs from cache and a cache miss (where data or instructions have to be fetched from memory) is seen as an exceptional event. For the R3000A and its derivatives, instruction execution is divided into five phases (called pipestages), with each pipestage taking a fixed amount of time (see “MIPS 5-stage pipeline” on page 1-2). Again, note that this model assumes that instruction fetches and data accesses can be satisfied from the processor caches at the processor operation frequency. All instructions are rigidly defined to follow the same sequence of pipestages, even where the instruction does nothing at some stage. The net result is that, so long as it keeps hitting the cache, the CPU starts an instruction every clock. "Figure 1.1. MIPS 5-stage pipeline”, illustrates this operation. Instruction execution activity can be described as occurring in the individual pipestages: • IF : (‘‘instruction fetch’’) gets the next instruction from the instruction cache (I-cache). • RD : (‘‘read registers’’) decodes the instruction and fetches the contents of any CPU registers it uses. • ALU : (‘‘arithmetic/logic unit’’) performs an arithmetic or logical operation in one clock (floating point math and integer multiply/ divide can’t be done in one clock and are done differently; this is described later). 1–2 INTRODUCTION CHAPTER 1 • MEM : the stage where the instruction can read/write memory variables in the data cache (D-cache). Note that for typical programs, three out of four instructions do nothing in this stage; but allocating the stage to each instruction ensures that the processor never has two instructions wanting the data cache at the same time. • WB : (‘‘write back’’) store the value obtained from an operation back to the register file. A rigid pipeline does limit the kinds of things instructions can do; in particular: • Instruction length : ALL instructions are 32 bits (exactly one machine ‘‘word’’) long, so that they can be fetched in a constant time. This itself discourages complexity; there are not enough bits in the instruction to encode really complicated addressing modes, for example. • No arithmetic on memory variables : data from cache or memory is obtained only in stage 4, which is much too late to be available to the ALU. Memory accesses occur only as simple load or store instructions which move the data to or from registers (this is described as a ‘‘load/ store architecture’’). However, the MIPS project architects also attended to the best thinking of the time about what makes a CPU an easy target for efficient optimizing compilers. So MIPS CPUs have 32 general purpose registers, 3-operand arithmetical/logical instructions and eschew complex special-purpose instructions which compilers can’t usually generate. THE IDT R3xxx FAMILY CPUS MIPS Corporation was formed in 1984 to make a commercial version of the Stanford MIPS CPU. The commercial CPU was enhanced with memory management hardware, first appearing late in 1985 as the R2000. An ambitious external floating point math co-processor (the R2010 FPA) first shipped in mid-87. The R3000, shipped in 1988, is almost identical from the programmer’s viewpoint (although small hardware enhancements combined to give a substantial boost to performance). The R3000A was done in 1989, to improve the frequency of operation over the original R3000 (other minor enhancements were added, such as the ability for user tasks to operate with the opposite “endianness” from the kernel). The R2000/R3000 chips include a cache controller – the implementation of external caches merely required a few industry standard SRAMs and some address latches. The math co-processor shares the cache buses to interpret instructions (in parallel with the integer CPU) and transfer operands and results between the FPA and memory or the integer CPU. The division of function was ingenious, practical and workable, allowing the R2000/3000 generation to be built without extravagant ultra-high pincount packages. However, as clock speeds increased the very high-speed signals in the cache interface increased design complexity and limited operational frequency. In addition, overall chip count for the basic execution core proved to be a limitation for area and power sensitive embedded systems. The R3051, R3052, R3071, R3081 and R3041 are the members (so far) of a family of products defined, designed, and manufactured by IDT. The chips integrate the functions of the R3000A CPU, cache memory and (R3081 only) math co-processor. This means that all the fastest logic is on chip; so the integrated chips are not only cheaper and smaller than the original implementation, but also much easier to use. The parts differ in their cache sizes, whether they include onchip MMU and/or FPA, clock rates and packaging options. In addition, although all parts can be used pin-compatibly, certain products feature optional enhancements in their bus-interface that may serve to reduce system cost or complexity, and other subtle enhancements for cost or performance. The major differences are summarized in "Table 1.1. R30xx family members compared”. 1–3 CHAPTER 1 Part 3051 3051E 3052 3052E INTRODUCTION Cache I+D 4K + 1K 8K + 2K MMU – × – × 16K+4K/ 8K+8K – 3081E 16K+4K/ 8K+8K × 3071 16K+4K/ 8K+8K – 3071E 16K+4K/ 8K+8K × 3041 2K + 0.5K – 3081 FPA Clock (MHz) Package Options – 20-40 PLCC 32-bit MUX’ed A/D – 20-40 PLCC 32-bit MUX’ed A/D × 20-50 PLCC Optional 1/2 frequency bus operation Optional 1x Clock Input – 33-50 PLCC 1/2 frequency bus operation 1x Clock Input – 16-25 PLCC TQFP Variable port width interface. System Interface Table 1.1. R30xx family members compared MIPS ARCHITECTURE LEVELS There are multiple generations of the MIPS architecture. The most commonly discussed are the MIPS-1, MIPS-2, and MIPS-3 architectures. MIPS-1 is the ISA found in the R2000 and R3000 generation CPUs. It is a 32-bit ISA, and defines the basic instruction set. Any application written with the MIPS-1 instruction set will operate correctly on all generations of the architecture. The MIPS-2 ISA is also 32-bit. It adds some instructions to speed up floating point data movement, branch-likely instructions, and other minor enhancements. This was first implemented in the MIPS R6000 ECL microprocessor. The MIPS-3 ISA is a 64-bit ISA. In addition to supporting all MIPS-1 and MIPS-2 instructions, the MIPS-3 ISA contains 64-bit equivalents of certain earlier instructions that are sensitive to operand size (e.g. load double and load word are both supported), including doubleword (64-bit) data movement and arithmetic. This ISA was first implemented in the R4000 as a clean (“seamless”) transition from the existing 32-bit architecture. Note that these ISA levels do not necessarily imply a particular structure for the MMU, caches, exception model, or other kernel specific resources. Thus, different implementations of ISA compatible chips may require different kernels. In the case of the R30xx family, all devices implement the MIPS-1 ISA. Many devices are also kernel compatible with the R3000A, but some devices (most notably those without an MMU) may require small kernel changes or different boot modules†. MIPS-1 COMPARED WITH CISC ARCHITECTURES Although the MIPS architecture is fairly straight-forward, there are a few features, visible only to assembly programmers, which may at first appear surprising. In addition, operations familiar to CISC architectures are † Historically, many embedded MIPS applications have run exclusively out of the “kseg0 and kseg1” memory regions (described later in the book). For these applications, the presence or absence of the MMU is largely irrelevant. 1–4 INTRODUCTION CHAPTER 1 irrelevant to the MIPS architecture. For example, the MIPS architecture does not mandate a stack pointer or stack usage; thus, programmers may be surprised to find that push/pop instructions do not exist directly. The most notable of these features are summarized here. Unusual instruction encoding features • All instructions are 32-bits long : as mentioned above. This means, for example, that it is impossible to incorporate a 32-bit constant into a single instruction (there would be no instruction bits left to encode the operation and the registers!). A ‘‘load immediate’’ instruction is limited to a 16-bit value; a special ‘‘load upper immediate’’ must be followed by an ‘‘or immediate’’ to put a 32-bit constant value into a register. • Instruction actions must fit the pipeline : actions can only be carried out in the designated pipeline phase, and must be complete in one clock. For example, the register writeback phase provides for just one value to be stored in the register file, so instructions can only change one register. • 3-operand instructions : arithmetic/logical operations don’t have to specify memory locations, so there are plenty of instruction bits to define two independent source and one destination register. Compilers love 3-operand instructions, which give optimizers more scope to improve the code which handles complex expressions. • 32 registers : the choice of 32 has become universal; compilers like a large (but not necessarily too large) number of registers, but there is a cost in context-saving and in encoding the registers to be used by an instruction. Register $0 always returns zero, to give a compact encoding of that useful constant. • No condition codes : the MIPS architecture does not provide condition code flags implicitly set by arithmetical operations. The motivation is to make sure that execution state is stored in one place – the register file. Conditional branches (in MIPS) test a single register for sign/zero, or a pair of registers for equality. Addressing and memory accesses • Memory references are always register loads and stores : arithmetic on memory variables upsets the pipeline, so is not done. Memory references only occur due to explicit load or store instructions. The large register file allows multiple variables to be “on-chip” simultaneously. • Only one data addressing mode : all loads and stores define the memory location with a single base register value modified by a 16-bit signed displacement. Note that the assembler/compiler tools can use the $0 register, along with the immediate value, to synthesize additional addressing modes from this one directly supported mode. • Byte-addressed : the instruction set includes load/store operations for 8- and 16-bit variables (referred to as byte and halfword). Partialword load instructions come in two flavors – sign-extend and zeroextend. • Loads/stores must be address-aligned : memory word operations can only load or store data from a single 4-byte aligned word; halfword operations must be aligned on half-word addresses. Many CISC microprocessors will load/store a multi-byte item from any byte address (although unaligned transfers always take longer). Techniques to generate code which will handle unaligned data efficiently will be explained later. • Jump instructions : The smallest op-code field in a MIPS instruction is 6 bits; leaving 26 bits to define the target of a jump. Since all instructions are 4-byte aligned in memory the two least-significant 1–5 CHAPTER 1 INTRODUCTION address bits need not be stored, allowing an address range of 228 = 256Mbytes. Rather than make this branch PC-relative, this is interpreted as an absolute address within a 256Mbyte ‘‘segment’’. In theory, this could impose a limit on the size of a single program; in reality, it hasn’t been a problem. Branches out of segment can be achieved by using a jr instruction, which uses the contents of a register as the target. Conditional branches have only a 16-bit displacement field (218 byte range since instructions are 4-byte aligned) which is interpreted as a signed PC-relative displacement. Compilers can only code a simple conditional branch instruction if they know that the target will be within 128Kbytes of the instruction following the branch. Operations not directly supported • No byte or halfword arithmetic : all arithmetical and logical operations are performed on 32-bit quantities. Byte and/or halfword arithmetic would require significant extra resources, many more op-codes, and is an understandable omission. Most C programmers will use the int data type for most arithmetic, and for MIPS an int is 32 bits and such arithmetic will be efficient. C’s rules are to perform arithmetic in int whenever any source or destination variable is as long as int. However, where a program explicitly does arithmetic as short the compiler must insert extra code to make sure that wraparound and overflows have the appropriate effect. • No special stack support : conventional MIPS assembler usage does define a sp register, but the hardware treats sp just like any other register. There is a recommended format for the stack frame layout of subroutines, so that programs can mix modules from different languages and compilers; it is recommended that programmers stick to these conventions, but they have no relationship to the hardware. • Minimal subroutine overhead : there is one special feature; jump instructions have a ‘‘jump and link’’ option which stores the return address into a register. $31 is the default, so for convenience and by convention $31 becomes the ‘‘return address’’ register. Minimal interrupt overhead : The MIPS architecture makes very few • presumptions about system exception handling, allowing fast response and a wide variety of software models. In the R30xx family, the CPU stashes away the restart location in the special register EPC, modifies the machine state just enough to signal why the trap happened and to disallow further interrupts; then it jumps to a single predefined location† in low memory. Everything else is up to the software. Just to emphasize this: on an interrupt or trap a MIPS CPU does not store anything on a stack, or write memory, or preserve any registers by itself. By convention, two registers ($k0, $k1; register conventions are explained in chapter 2) are reserved so that interrupt/trap routines can ‘‘bootstrap’’ themselves – it is impossible to do anything on a MIPS CPU without using some registers. For a program running in any system which takes interrupts or traps, the values of these registers may change at any time, and thus should not be used. † One particular kind of trap (a TLB miss on an address in the user-privilege address space) has a different dedicated entry point. 1–6 INTRODUCTION CHAPTER 1 Multiply and divide operations The MIPS CPU does have an integer multiply/divide unit; worth mentioning because many RISC machines don’t have multiply hardware. The multiply unit is relatively independent of the rest of the CPU, with its own special output registers. Programmer-visible pipeline effects In addition to the discussion above, programmers of R3xxx architecture CPUs also must be aware of certain effects of the MIPS pipeline. Specifically, the results of certain operations may not be available in the immediately subsequent instruction; the programmer may need to be explicitly aware of such cases. branch IF RF branch delay branch addr IF branch target MEM RF ALU IF Figure 1.2. RF WB MEM ALU WB MEM WB The pipeline and branch delays • Delayed branches : the pipeline structure of the MIPS CPU (see "Figure 1.2. The pipeline and branch delays”) means that when a jump instruction reaches the ‘‘execute’’ phase and a new program counter is generated, the instruction after the jump will already have been decoded. Rather than discard this potentially useful work, the architecture rules state that the instruction after a branch is always executed before the instruction at the target of the branch. "Figure 1.2. The pipeline and branch delays” show that a special path is provided through the ALU to make the branch address available a half-clock early, ensuring that there is only a one cycle delay before the outcome of the branch is determined and the appropriate instruction flow (branch taken or not taken) is initiated. It is the responsibility of the compiler system or the assemblerprogrammer to allow for and even to exploit this “branch delay slot”; it turns out that it is usually possible to arrange code such that the instruction in the ‘‘delay slot’’ does useful work. Quite often, the instruction which would otherwise have been placed before the branch can be moved into the delay slot. This can be a bit tricky on a conditional branch, where the branch delay instruction must be (at least) harmless on the path where it isn’t wanted. Where nothing useful can be done the delay slot is filled with a ‘‘nop’’ (no-op, or no-operation) instruction. Many MIPS assemblers will hide this feature from the programmer unless explicitly told not to, as described later. • Load data not available to next instruction : another consequence of the pipeline is that a load instruction’s data arrives from the cache/ memory system AFTER the next instruction’s ALU phase starts – so it is not possible to use the data from a load in the following instruction. See "Figure 1.3. The pipeline and load delays” for how this works. On the MIPS-1 architecture, the programmer must insure that this rule is not violated 1–7 CHAPTER 1 INTRODUCTION • . load load delay IF RD D-cache MEM rd ALU IF use data RD ALU IF Figure 1.3. RD WB MEM ALU WB MEM WB The pipeline and load delays Again, most assemblers will hide this if they can. Frequently, the assembler can move an instruction which is independent of the load into the load delay slot; in the worst case, it can insert a NOP to insure proper program execution. A NOTE ON MACHINE AND ASSEMBLER LANGUAGE To simplify assembly level programming, the MIPS Corp’s assembler (and many other MIPS assemblers) provides a set of “synthetic” instructions. Typically, a synthetic instruction is a common assembly level operation that the assembler will map into one or more true instructions. This mapping can be more intelligent than a mere macro expansion. For example, an immediate load may map into one instruction if the datum is small enough, or multiple instructions if the datum is larger. However, these instructions can dramatically simplify assembly level programming. For example, the programmer just writes a ‘‘load immediate’’ instruction and the assembler will figure out whether it needs to generate multiple machine instructions or can get by with just one (in this example, depending on the size of the immediate datum). This is obviously useful, but can be confusing. This manual will try to use synthetic instructions sparingly, and indicate when it happens. Moreover, the instruction tables below will consistently distinguish between synthetic and machine instructions. These features are there to help human programmers; most compilers generate instructions which are one-for-one with machine code. However, some compilers will in fact generate synthetic instructions. Helpful things the assembler does: • 32-bit load immediates : The programmer can code a load with any value (including a memory location which will be computed at link time), and the assembler will break it down into two instructions to load the high and low half of the value. • Load from memory location : The programmer can code a load from a memory-resident variable. The assembler will normally replace this by loading a temporary register with the high-order half of the variable’s address, followed by a load whose displacement is the loworder half of the address. Of course, this does not apply to variables defined inside C functions, which are implemented either in registers or on the stack. • Efficient access to memory variables : some C programs contain many references to static or extern variables, and a two-instruction sequence to load/store any of them is expensive. Some compilation systems, with run-time support, get around this. Certain variables are selected at compile/assemble time (by default MIPS Corp’s assembler selects variables which occupy 8 or less bytes of storage) 1–8 INTRODUCTION CHAPTER 1 and kept together in a single section of memory which must end up smaller than 64Kbytes. The run-time system then initializes one register ($28 or gp (global pointer) by convention) to point to the middle of this section. Loads and stores to these variables can now be coded as a single gp relative load or store. • More types of branch condition : the assembler synthesizes a full set of branches conditional on an arithmetic test between two registers. • Simple or different forms of instructions : unary operations such as not and neg are produced as a nor or sub with the zero-valued register $0. Two-operand forms of 3-operand instructions can be written; the assembler will put the result back into the first-specified register. • Hiding the branch delay slot: in normal coding most assemblers will not allow access the branch delay slot. MIPS Corp.’s assembler, in particular, is exceptionally ingenious and may re-organize the instruction sequence substantially in search of something useful to do in the delay slot. An assembler directive ‘‘.noreorder’’ is available where this must not happen. • Hiding the load delay: many assemblers will detect an attempt to use the result of a load in the next instruction, and will either move code around or insert a nop. • Unaligned transfers: the ‘‘unaligned’’ load/store instructions will fetch halfword and word quantities correctly, even if the target address turns out to be unaligned. • Other pipeline corrections: some instructions (such as those which use the integer multiply unit) have additional constraints that are implementation specific (see the Appendix on hazards). Many assemblers will just “handle” these cases automatically, or at least warn the programmer about possible hazards violations. • Other optimizations: some MIPS instructions (particularly floating point) take multiple clocks to produce results. However, the hardware is ‘‘interlocked’’, so the programmer does not need to be aware of these delays to write correct programs. But MIPS Corp.’s assembler is particularly aggressive in these circumstances, and will perform substantial code movement to try to make it run faster. This may need to be considered when debugging. In general, it is best to use a dis-assembler utility to disassemble a resulting binary during debug. This will show the system designers the true code sequence being executed, and thus “uncover” the modifications made by the assembler or compiler. 1–9 ® MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 Integrated Device Technology, Inc. 1 PROGRAMMER’S VIEW OF THE PROCESSOR ARCHITECTURE This chapter describes the assembly programmer’s view of the CPU architecture, in terms of registers, instructions, and computational resources. This viewpoint corresponds, for example, to an assembly programmer writing user applications (although more typically, such a programmer would use a high-level language). Information about kernel software development (such as handling interrupts, traps, and cache and memory management) are described in later chapters. Registers There are 32 general purpose registers: $0 to $31. Two, and only two, are special to the hardware: • $0 always returns zero, no matter what software attempts to store to it. • $31 is used by the normal subroutine-calling instruction (jal) for the return address. Note that the call-by-register version (jalr) can use ANY register for the return address, though practice is to use only $31. In all other respects all registers are identical and can be used in any instruction ($0 can be used as the destination of instructions; the value of $0 will remain unchanged, however, so the instruction would be effectively a NOP). In the MIPS architecture the ‘‘program counter’’ is not a register, and it is probably better to not think of it that way. The return address of a jal is two instructions later in sequence (the instruction after the jump delay slot instruction); the instruction after the call is the call’s ‘‘delay slot’’ and is typically used to set up the last parameter. There are no condition codes and nothing in the ‘‘status register’’ or other CPU internals is of any consequence to the user-level programmer. There are two registers associated with the integer multiplier. These registers, referred to as “HI” and “LO”, contain the 64-bit product result of a multiply operation, or the quotient and remainder of a divide. The floating point math co-processor (called FPA for floating point accelerator), if available, adds 32 floating point registers†; in simple assembler language they are just called $0 to $31 again – the fact that these are floating point registers is implicitly defined by the instruction. Actually, only the 16 even-numbered registers are usable for math; but they can be used for either single-precision (32 bit) or double-precision (64-bit) numbers, When performing double-precision arithmetic, odd numbered register $N+1 holds the remaining bits of the even numbered register identified $N. Only moves between integer and FPA, or FPA load/ store instructions, ever refer to odd-numbered registers (and even then the assembler helps the programmer forget...) † The FPA also has a different set of registers called ‘‘co-processor 1 registers’’ for control purposes. These are typically used to manage the actions/state of the FPA, and should not be confused with the FPA data registers. 2–1 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE Conventional names and uses of general-purpose registers Although the hardware makes few rules about the use of registers, their practical use is governed by a number of conventions. These conventions allow inter-changeability of tools, operating systems, and library modules. It is strongly recommended that these conventions be followed. Reg No Name Used for 0 zero Always returns 0 1 at (assembler temporary) Reserved for use by assembler 2-3 v0-v1 Value (except FP) returned by subroutine 4-7 a0-a3 (arguments) First four parameters for a subroutine 8-15 t0-t7 (temporaries) subroutines may use without saving 24-25 t8-t9 16-23 s0-s7 Subroutine ‘‘register variables’’; a subroutine which will write one of these must save the old value and restore it before it exits, so the calling routine sees their values preserved. 26-27 k0-k1 Reserved for use by interrupt/trap handler - may change under your feet 28 gp global pointer - some runtime systems maintain this to give easy access to (some) ‘‘static’’ or ‘‘extern’’ variables. 29 sp stack pointer 30 s8/fp 9th register variable. Subroutines which need one can use this as a ‘‘frame pointer’’. 31 ra Return address for subroutine Table 2.1. Conventional names of registers with usage mnemonics With the conventional uses of the registers go a set of conventional names. Given the need to fit in with the conventions, use of the conventional names is pretty much mandatory. The common names are described in Table 2.1, “Conventional names of registers with usage mnemonics”. Notes on conventional register names • at : this register is reserved for use inside the synthetic instructions generated by the assembler. If the programmer must use it explicitly the directive .noat stops the assembler from using it, but then there are some things the assembler won’t be able to do. • v0-v1 : used when returning non-floating-point values from a subroutine. To return anything bigger than 2×32 bits, memory must be used (described in a later chapter). • a0-a3 : used to pass the first four non-FP parameters to a subroutine. That’s an occasionally-false oversimplification; the actual convention is fully described in a later chapter. • t0-t9 : by convention, subroutines may use these values without preserving them. This makes them easy to use as ‘‘temporaries’’ when evaluating expressions – but a caller must remember that they may be destroyed by a subroutine call. • s0-s8 : by convention, subroutines must guarantee that the values of these registers on exit are the same as they were on entry – either by not using them, or by saving them on the stack and restoring before exit. This makes them eminently suitable for use as ‘‘register variables’’ or for storing any value which must be preserved over a subroutine call. 2–2 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 • k0-k1 : reserved for use by the trap/interrupt routines, which will not restore their original value; so they are of little use to anyone else. • gp : (global pointer). If present, it will point to a load-time-determined location in the midst of your static data. This means that loads and stores to data lying within 32Kbytes either side of the gp value can be performed in a single instruction using gp as the base register. Without the global pointer, loading data from a static memory area takes two instructions: one to load the most significant bits of the 32bit constant address computed by the compiler and loader, and one to do the data load. To use gp a compiler must know at compile time that a datum will end up linked within a 64Kbyte range of memory locations. In practice it can’t know, only guess. The usual practice is to put ‘‘small’’ global data items in the area pointed to by gp, and to get the linker to complain if it still gets too big. The definition of what is “small” can typically be specified with a compiler switch (most compilers use “G“). The most common default size is 8 bytes or less. Not all compilation systems or OS loaders support gp. • sp : (stack pointer). Since it takes explicit instructions to raise and lower the stack pointer, it is generally done only on subroutine entry and exit; and it is the responsibility of the subroutine being called to do this. sp is normally adjusted, on entry, to the lowest point that the stack will need to reach at any point in the subroutine. Now the compiler can access stack variables by a constant offset from sp. Stack usage conventions are explained in a later chapter. • fp : (also known as s8). A subroutine will use a ‘‘frame pointer’’ to keep track of the stack if it wants to use operations which involve extending the stack by an amount which is determined at run-time. Some languages may do this explicitly; assembler programmers are always welcome to experiment; and (for many toolchains) C programs which use the ‘‘alloca’’ library routine will find themselves doing so. In this case it is not possible to access stack variables from sp, so fp is initialized by the function prologue to a constant position relative to the function’s stack frame. Note that a ‘‘frame pointer’’ subroutine may call or be called by subroutines which do not use the frame pointer; so long as the functions it calls preserve the value of fp (as they should) this is OK. • ra : (return address). On entry to any subroutine, ra holds the address to which control should be returned – so a subroutine typically ends with the instruction ‘‘jr ra’’. Subroutines which themselves call subroutines must first save ra, usually on the stack. Integer multiply unit and registers MIPS’ architects decided that integer multiplication was important enough to deserve a hard-wired instruction. This is not so common in RISCs, which might instead: • implement a ‘‘multiply step’’ which fits in the standard integer execution pipeline, and require software routines for every multiplication (e.g. Sparc or AM29000); or • perform integer multiplication in the floating point unit – a good solution but which compromises the optional nature of the MIPS floating point ‘‘co-processor’’. The multiply unit consumes a small amount of die area, but dramatically improves performance (and cache performance) over “multiply step” operations. It’s basic operation is to multiply two 32-bit values together to produce a 64-bit result, which is stored in two 32-bit 2–3 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE registers (called ‘‘hi’’ and ‘‘lo’’) which are private to the multiply unit. Instructions mfhi, mflo are defined to copy the result out into general registers. Unlike results for integer operations, the multiply result registers are interlocked. An attempt to read out the results before the multiplication is complete results in the CPU being stopped until the operation completes. The integer multiply unit will also perform an integer division between values in two general-purpose registers; in this case the ‘‘lo’’ register stores the quotient, and the ‘‘hi’’ register the remainder. In the R30xx family, multiply operations take 12 clocks and division takes 35. The assembler has a synthetic multiply operation which starts the multiply and then retrieves the result into an ordinary register. Note that MIPS Corp.’s assembler may even substitute a series of shifts and adds for multiplication by a constant, to improve execution speed. Multiply/divide results are written into ‘‘hi’’ and ‘‘lo’’ as soon as they are available; the effect is not deferred until the writeback pipeline stage, as with writes to general purpose (GP) registers. If a mfhi or mflo instruction is interrupted by some kind of exception before it reaches the writeback stage of the pipeline, it will be aborted with the intention of restarting it. However, a subsequent multiply instruction which has passed the ALU stage will continue (in parallel with exception processing) and would overwrite the ‘‘hi’’ and ‘‘lo’’ register values, so that the re-execution of the mfhi would get wrong (i.e. new) data. For this reason it is recommended that a multiply should not be started within two instructions of an mfhi/ mflo. The assembler will avoid doing this where it can. Integer multiply and divide operations never produce an exception, though divide by zero produces an undefined result. Compilers will often generate code to trap on errors, particularly on divide by zero. Frequently, this instruction sequence is placed after the divide is initiated, to allow it to execute concurrently with the divide (and avoid a performance loss). Instructions mthi, mtlo are defined to setup the internal registers from general-purpose registers. They are essential to restore the values of ‘‘hi’’ and ‘‘lo’’ when returning from an exception, but probably not for anything else. Instruction types A full list of R30xx family integer instructions is presented in Appendix A. Floating point instructions are listed in Appendix B of this manual. Currently, floating point instructions are only available in the R3081, and are described in the R3081 User’s Manual. The MIPS-1 ISA uses only three basic instruction encoding formats; this is one of the keys to the high-frequencies attained by RISC architectures. Instructions are mostly in numerical order; to simplify reading, the list is occasionally re-ordered for clarity. Throughout this manual, the description of various instructions will also refer to various subfields of the instruction. In general, the following typical nomenclature is used: op The basic op-code, which is 6 bits long. Instructions which large sub-fields (for example, large immediate values, such as required for the ‘‘long’’ j/jal instructions, or arithmetic with a 16-bit constant) have a unique ‘‘op’’ field. Other instructions are classified in groups sharing an ‘‘op’’ value, distinguished by other fields (‘‘op2’’ etc.). rs, rs1, rs2 One or two fields identifying source registers. rd The register to be changed by this instruction. sa Shift-amount: How far to shift, used in shift-by-constant instructions. 2–4 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 op2 Sub-code field used for the 3-register arithmetic/logical group of instructions (op value of zero). offset 16-bit signed word offset defining the destination of a ‘‘PCrelative’’ branch. The branch target will be the instruction ‘‘offset’’ words away from the ‘‘delay slot’’ instruction after the branch; so a branch-to-self has an offset of -1. target 26-bit word address to be jumped to (it corresponds to a 28-bit byte address, which is always word-aligned). The long j instruction is rarely used, so this format is pretty much exclusively for function calls (jal). The high-order 4 bits of the target address can’t be specified by this instruction, and are taken from the address of the jump instruction. This means that these instructions can reach anywhere in the 256Mbyte region around the instructions’ location. To jump further use a jr (jump register) instruction. constant 16-bit integer constant for ‘‘immediate’’ arithmetic or logic operations. mf Yet another extended opcode field, this time used by ‘‘coprocessor’’ type instructions. rg Field which may hold a source or destination register. crg Field to hold the number of a CPU control register (different from the integer register file). Called ‘‘crs’’/‘‘crd’’ in contexts where it must be a source/destination respectively. The instruction encodings have been chosen to facilitate the design of a high-frequency CPU. Specifically:. • The instruction encodings do reveal portions of the internal CPU design. Although there are variable encodings, those fields which are required very early in the pipeline are encoded in a very regular way: • Source registers are always in the same place : so that the CPU can fetch two instructions from the integer register file without any conditional decoding. Some instructions may not need both registers – but since the register file is designed to provide two source values on every clock nothing has been lost. • 16-bit constant is always in the same place : permitting the appropriate instruction bits to be fed directly into the ALU’s input multiplexer, without conditional shifts. Loading and storing: addressing modes As mentioned above, there is only one basic ‘‘addressing mode’’. Any load or store machine instruction can be written as: operation dest-reg, offset(src-reg) e.g.:lw $1, offset($2); sw $3, offset($4) Any of the GP registers can be used for the destination and source. The offset is a signed, 16-bit number (so can be anywhere between -32768 and 32767); the program address used for the load is the sum of dest-reg and the offset. This address mode is normally enough to pick out a particular member of a C structure (‘‘offset’’ being the distance between the start of the structure and the member required); it implements an array indexed by a constant; it is enough to reference function variables from the stack or frame pointer; to provide a reasonable sized global area around the gp value for static and extern variables. The assembler provides the semblance of a simple direct addressing mode, to load the values of memory variables whose address can be computed at link time. 2–5 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE More complex modes such as double-register or scaled index must be implemented with sequences of instructions. Data types in Memory and registers The R30xx family CPUs can load or store between 1 and 4 bytes in a single operation. Naming conventions are used in the documentation and to build instruction mnemonics: ‘‘C’’ name MIPS name Size(bytes) Assembler mnemonic int word 4 ‘‘w’’ as in lw long word 4 ‘‘w’’ as in lw short halfword 2 ‘‘h’’ as in lh char byte 1 ‘‘b’’ as in lb Integer data types Byte and halfword loads come in two flavors: • Sign-extend : lb and lh load the value into the least significant bits of the 32-bit register, but fill the high order bits by copying the ‘‘sign bit’’ (bit 7 of a byte, bit 16 of a half-word). This correctly converts a signed value to a 32-bit signed integer. • Zero-extend : instructions lbu and lhu load the value into the least significant bits of a 32-bit register, with the high order bits filled with zero. This correctly converts an unsigned value in memory to the corresponding 32-bit unsigned integer value; so byte value 254 becomes 32-bit value 254. If the byte-wide memory location whose address is in t1 contains the value 0xFE (-2, or 254 if interpreted as unsigned), then: lb lbu t2, 0(t1) t3, 0(t1) will leave t2 holding the value 0xFFFF FFFE (-2 as signed 32-bit) andt3 holding the value 0x0000 00FE (254 as signed or unsigned 32-bit). Subtle differences in the way shorter integers are extended to longer ones are a historical cause of C portability problems, and the modern C standards have elaborate rules. On machines like the MIPS, which does not perform 8- or 16-bit precision arithmetic directly, expressions involving short or char variables are less efficient than word operations. Unaligned loads and stores Normal loads and stores in the MIPS architecture must be aligned; halfwords may be loaded only from 2-byte boundaries, and words only from 4byte boundaries. A load instruction with an unaligned address will produce a trap. Because CISC architectures such as the MC680x0 and iAPXx86 do handle unaligned loads and stores, this could complicate porting software from one of these architectures. The MIPS architecture does provide mechanisms to support this type of operation; in extremity, software can provide a trap handler which will emulate the desired load operation and hide this feature from the application. All data items declared by C code will be correctly aligned. But when it is known in advance that the program will transfer a word from an address whose alignment is unknown and will be computed at run time, the architecture does allow for a special 2-instruction sequence (much more efficient than a series of byte loads, shifts and assembly). This sequence is normally generated by the macro-instruction ulw (unaligned load word). 2–6 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 (A macro-instruction ulh, unaligned load half, is also provided, and is synthesized by two loads, a shift, and a bitwise ‘‘or’’ operation.) The special machine instructions are lwl and lwr (load word left, load word right). ‘‘Left’’ and ‘‘right’’ are arithmetical directions, as in ‘‘shift left’’; ‘‘left’’ is movement towards more significant bits, ‘‘right’’ is towards less significant bits. These instructions do three things: • load 1, 2, 3 or 4 bytes from within one aligned 4-byte (word) location; • shift that data to move the byte selected by the address to either the most-significant (lwl) or least-significant (lwr) end of a 32-bit field; • merge the bytes fetched from memory with the data already in the destination. This breaks most of the rules the architecture usually sticks by; it does a logical operation on a memory variable, for example. Special hardware allows the lwl, lwr pair to be used in consecutive instructions, even though the second instruction uses the value generated by the first. For example, on a CPU configured as big-endian the assembler instruction: ulw add t1, 0(t2) t4, t3, t1 is implemented as: lwl lwr nop add t1, 0(t2) t1, 3(t2) t4, t3, t1 Where: • the lwl picks up the lowest-addressed byte of the unaligned 4-byte region, together with however many more bytes which fit into an aligned word. It then shifts them left, to form the most-significant bytes of the register value. • the lwr is aimed at the highest-addressed byte in the unaligned 4-byte region. It loads it, together with any bytes which precede it in the same memory word, and shifts it right to get the least significant bits of the register value. The merge leaves the high-order bits unchanged. • Although special hardware ensures that a nop is not required between the lwl and lwr, there is still a load delay between the second of them and a normal instruction. Note that if t2 was in fact 4-byte aligned, then both instructions load the entire word; duplicating effort, but achieving the desired effect. CPU behavior when operating with little-endian byte order is described in a later chapter. Floating point data in memory Loads into floating point registers from 4-byte aligned memory move data without any interpretation – a program can load an invalid floating point number and no FP error will result until an arithmetic operation is requested with it as an operand. This allows a programmer to load single-precision values by a load into an even-numbered floating point register; but the programmer can also load a double-precision value by a macro instruction, so that: ldc1 $f2, 24(t1) is expanded to two loads to consecutive registers: lwc1 lwc1 2–7 $f2, 24(t1) $f3, 28(t1) CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE The C compiler aligns 8-byte long double-precision floating point variables to 8-byte boundaries. R30xx family hardware does not require this alignment; but it is done to avoid compatibility problems with implementations of MIPS-2 or MIPS-3 CPUs such as the IDT R4600 (Orion), where the ldc1 instruction is part of the machine code, and the alignment is necessary. BASIC ADDRESS SPACE The way in which MIPS processors use and handle addresses is subtly different from that of traditional CISC CPUs, and may appear confusing. Read the first part of this section carefully. Here are some guidelines: • The addresses put into programs are rarely the same as the physical addresses which come out of the chip (sometimes they’re close, but not the same). This manual will refer to them as program addresses and physical addresses respectively. A more common name for program addresses is “virtual addresses”; note that the use of the term “virtual address” does not necessarily imply that an operating system must perform virtual memory management (e.g. demand paging from disks...), but rather that the address undergoes some transformation before being presented to physical memory. Although virtual address is a proper term, this manual will typically use the term “program address” to avoid confusing virtual addresses with virtual memory management requirements. • A MIPS-1 CPU has two operating modes: user and kernel. In user mode, any address above 2Gbytes (most-significant bit of the address set) is illegal and causes a trap. Also, some instructions cause a trap in user mode. • The 32-bit program address space is divided into four big areas with traditional names; and different things happen according to the area an address lies in: kuseg 0000 0000 – 7FFF FFFF (low 2Gbytes): these are the addresses permitted in user mode. In machines with an MMU (“E” versions of the R30xx family), they will always be translated (more about the R30xx MMU in a later chapter). Software should not attempt to use these addresses unless the MMU is set up. For machines without an MMU (“base” versions of the R30xx family), the kuseg “program address” is transformed to a physical address by adding a 1GB offset; the address transformations for “base versions” of the R30xx family are described later in this chapter. Note, however, that many embedded applications do not use this address segment (those applications which do not require that the kernel and its resources be protected from user tasks). kseg0 0x8000 0000 – 9FFF FFFF (512 Mbytes): these addresses are ‘‘translated’’ into physical addresses by merely stripping off the top bit, mapping them contiguously into the low 512 Mbytes of physical memory. This transformation operates the same for both “base” and “E” family members. This segment is referred to as “unmapped” because “E” version devices cannot redirect this translation to a different area of physical memory. Addresses in this region are always accessed through the cache, so may not be used until the caches are properly initialized. They will be used for most programs and data in systems using “base” family members; and will be used for the OS kernel for systems which do use the MMU (“E” version devices). 2–8 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 kseg1 0xA000 0000 – BFFF FFFF (512 Mbytes): these addresses are mapped into physical addresses by stripping off the leading three bits, giving a duplicate mapping of the low 512 Mbytes of physical memory. However, kseg1 program address accesses will not use the cache. The kseg1 region is the only chunk of the memory map which is guaranteed to behave properly from system reset; that’s why the after-reset starting point ( 0xBFC0 0000, commonly called the “reset exception vector”) lies within it. The physical address of the starting point is 0x1FC0 0000 – which means that the hardware should place the boot ROM at this physical address. Software will therefore use this region for the initial program ROM, and most systems also use it for I/O registers. In general, IO devices should always be mapped to addresses that are accessible from Kseg1, and system ROM is always mapped to contain the reset exception vector. Note that code in the ROM can then be accessed uncacheably (during boot up) using kseg1 program addresses, and also can be accessed cacheably (for normal operation) using kseg0 program addresses. kseg2 0xC000 0000 – FFFF FFFF (1 Gbyte): this area is only accessible in kernel mode. As for kuseg, in “E” devices program addresses are translated by the MMU into physical addresses; thus, these addresses must not be referenced prior to MMU initialization. For “base versions”, physical addresses are generated to be the same as program addresses for kseg2. Note that many systems will not need this region. In “E” versions, it frequently contains OS structures such as page tables; simpler OS’es probably will have little need for kseg2. SUMMARY OF SYSTEM ADDRESSING MIPS program addresses are rarely simply the same as physical addresses, but simple embedded software will probably use addresses in kseg0 and kseg1, where the program address is related in an obvious and unchangeable way to physical addresses. Physical memory locations from 0x2000 0000 (512Mbyte) upward may be difficult to access. In “E” versions of the R30xx family, the only way to reach these addresses is through the MMU. In “base” family members, certain of these physical addresses can be reached using kseg2 or kuseg addresses: the address transformations for base R30xx family members is described later in this chapter. Kernel vs. user mode In kernel mode (the CPU resets into this state), all program addresses are accessible. In user mode: • Program addresses above 2Gbytes (top bit set) are illegal and will cause a trap. Note that if the CPU has an MMU, this means all valid user mode addresses must be translated by the MMU; thus, User mode for “E” devices typically requires the use of a memory-mapped OS. For “base” CPUs, kuseg addresses are mapped to a distinct area of physical memory. Thus, kernel memory resources (including IO devices) can be made inaccessible to User mode software, without requiring a memory-mapping function from the OS. Alternately, the hardware can choose to “ignore” high-order address bits when performing address decoding, thus “condensing” kuseg, kseg2, kseg1, and kseg0 into the same physical memory. 2–9 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE • Instructions beyond the standard user set become illegal. Specifically, the kernel can prevent User mode software from accessing the onchip CP0 (system control coprocessor, which controls exception and machine state and performs the memory management functions of the CPU). Thus, the primary differences between User and Kernel modes are: • User mode tasks can be inhibited from accessing kernel memory resources, including OS data structures and IO devices. This also means that various user tasks can be protected from each other. • User mode tasks can be inhibited from modifying the basic machine state, by prohibiting accesses to CP0. Note that the kernel/user mode bit does not change the interpretation of anything – just some things cease to be allowed in user mode. In kernel mode the CPU can access low addresses just as if it was in user mode, and they will be translated in the same way. Memory map for CPUs without MMU hardware The treatment of kseg0 and kseg1 addresses is the same for all IDT R30xx CPUs. If the system can be implemented using only physical addresses in the low 512Mbytes, and system software can be written to use only kseg0 and kseg1, then the choice of “base” vs. “E” versions of the R30xx family is not relevant. For versions without the MMU (“base versions”), addresses in kuseg and kseg2 will undergo a fixed address translation, and provide the system designer the option to provide additional memory. The base members of the R30xx family provide the following address translations for kuseg and kseg2 program addresses: • kuseg: this region (the low 2Gbytes of program addresses) is translated to a contiguous 2Gbyte physical region between 13Gbytes. In effect, a 1GB offset is added to each kuseg program address. In hex: Program address 0x0000 0000 0x7FFF FFFF Physical Address → 0x4000 0000 0xBFFF FFFF • kseg2: these program addresses are genuinely untranslated. So program addresses from 0xC000 0000 – 0xFFFF FFFF emerge as identical physical addresses. This means that “base” versions can generate most physical addresses (without the use of an MMU), except for a gap between 512Mbyte and 1Gbyte (0x2000 0000 through 0x3FFF FFFF). As noted above, many systems may ignore high-order address bits when performing address decoding, thus condensing all physical memory into the lowest 512MB addresses. Subsegments in the R3041 – memory width configuration The R3041 CPU can be configured to access different regions of memory as either 32-, 16- or 8-bits wide. Where the program requests a 32-bit operation to a narrow memory (either with an uncached access, or a cache miss, or a store), the CPU may break a transaction into multiple data phases, to match the datum size to the memory port width. The width configuration is applied independently to subsegments of the normal kseg regions, as follows: • kseg0 and kseg1: as usual, these are both mapped onto the low 512Mbytes. This common region is split into 8 subsegments (64Mbytes each), each of which can be programmed as 8-, 16- or 32bits wide. The width assignment affects both kseg0 and kseg1 accesses (that is, one can view these as subsegments of the corresponding “physical” addresses). 2–10 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 • kuseg: is divided into four 512Mbyte subsegments, each independently programmable for width. Thus, kuseg can be broken into multiple portions, which may have varying widths. An example of this may be a 32-bit main memory with some 16-bit PCMCIA font cards and an 8-bit NVRAM. • kseg2: is divided into two 512Mbyte subsegments, independently programmable for width. Again, this means that kseg2 can support multiple memory subsystems, of varying port width. Note that once the various memory port widths have been configured (typically at boot time), software does not have to be aware of the actual width of any memory system. It can choose to treat all memory as 32-bit wide, and the CPU will automatically adjust when an access is made to a narrower memory region. This simplifies software development, and also facilitates porting to various system implementations (which may or may not choose the same memory port widths). 2–11 ® MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 Integrated Device Technology, Inc. 1 PROGRAMMER’S VIEW OF THE PROCESSOR ARCHITECTURE This chapter describes the assembly programmer’s view of the CPU architecture, in terms of registers, instructions, and computational resources. This viewpoint corresponds, for example, to an assembly programmer writing user applications (although more typically, such a programmer would use a high-level language). Information about kernel software development (such as handling interrupts, traps, and cache and memory management) are described in later chapters. Registers There are 32 general purpose registers: $0 to $31. Two, and only two, are special to the hardware: • $0 always returns zero, no matter what software attempts to store to it. • $31 is used by the normal subroutine-calling instruction (jal) for the return address. Note that the call-by-register version (jalr) can use ANY register for the return address, though practice is to use only $31. In all other respects all registers are identical and can be used in any instruction ($0 can be used as the destination of instructions; the value of $0 will remain unchanged, however, so the instruction would be effectively a NOP). In the MIPS architecture the ‘‘program counter’’ is not a register, and it is probably better to not think of it that way. The return address of a jal is two instructions later in sequence (the instruction after the jump delay slot instruction); the instruction after the call is the call’s ‘‘delay slot’’ and is typically used to set up the last parameter. There are no condition codes and nothing in the ‘‘status register’’ or other CPU internals is of any consequence to the user-level programmer. There are two registers associated with the integer multiplier. These registers, referred to as “HI” and “LO”, contain the 64-bit product result of a multiply operation, or the quotient and remainder of a divide. The floating point math co-processor (called FPA for floating point accelerator), if available, adds 32 floating point registers†; in simple assembler language they are just called $0 to $31 again – the fact that these are floating point registers is implicitly defined by the instruction. Actually, only the 16 even-numbered registers are usable for math; but they can be used for either single-precision (32 bit) or double-precision (64-bit) numbers, When performing double-precision arithmetic, odd numbered register $N+1 holds the remaining bits of the even numbered register identified $N. Only moves between integer and FPA, or FPA load/ store instructions, ever refer to odd-numbered registers (and even then the assembler helps the programmer forget...) † The FPA also has a different set of registers called ‘‘co-processor 1 registers’’ for control purposes. These are typically used to manage the actions/state of the FPA, and should not be confused with the FPA data registers. 2–1 ® SYSTEM CONTROL COPROCESSOR ARCHITECTURE CHAPTER 3 Integrated Device Technology, Inc. 1 This chapter concentrates on the aspects of the R30xx family architecture that must be managed by the OS programmer. Note that most of these features are transparent to the user program author; however, the nature of embedded systems is such that most embedded systems programmers will have a view of the underlying CPU and system architecture, and thus will find this material important. Co-processors MIPS uses the term “co-processor” both in a traditional fashion, and also in a non-traditional fashion. Specifically, the FPA device is a traditional microprocessor co-processor: it is an optional part of the architecture, with its own particular instruction set. Opcodes are reserved and instruction fields defined for up to four ‘‘coprocessors’’. Architecturally, the co-processors can be tightly coupled to the base integer CPU; for example, the ISA defines instructions to move data directly between memory and the coprocessor, rather than requiring it to be moved into the integer processor first. However, MIPS also uses the term “co-processor” for the functions required to manage the CPU environment, including exception management, cache control, and memory management. This segmentation insures that the chip architecture can be varied (e.g. cache architecture, interrupt controller, etc.), without impacting user mode software compatibility. These functions are grouped by MIPS into the on-chip “co-processor 0”, or ‘‘system control co-processor’’ - and these instructions implement the whole CPU control system. Note that co-processor 0 has no independent existence, and is certainly not optional. It provides a standard way of encoding the instructions which access the CPU status register; so that, although the definition of the status register changes among implementations, programmers can use the same assembler for both CPUs. Similarly, the exception and memory management strategies can be varied among implementations, and these effects isolated to particular portions of the OS kernel. CPU CONTROL SUMMARY This chapter, coupled with chapters on cache management, memory management, and exception processing, provide details on managing the machine and OS state. The areas of interest include: • CPU control and co-processor : how privileged instructions are organized, with shortform descriptions. There are relatively few privileged instructions; most of the low-level control over the CPU is exercised by reading and writing bit-fields within special registers. • Exceptions : external interrupts, invalid operations, arithmetic errors – all result in ‘‘exceptions’’, where control is transferred to an exception handler routine. MIPS exceptions are extremely simple – the hardware does the absolute minimum, allowing the programmer to tailor the exception mechanism to the needs of the particular system. A later chapter describes MIPS exceptions, why they are ‘‘precise’’, exception vectors, and conventions about how to code exception handling routines. Special problems can arise with nested exceptions: exceptions occurring while the CPU is still handling an earlier exception. 3–1 CHAPTER 3 • • • • SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE Hardware interrupts have their own style and rules. The Exception Management chapter includes an annotated example of a moderately-complicated exception handler. Caches and cache management : all R30xx implementations have dual caches (the I-cache for instructions, the D-cache for data). On-chip hardware is provided to manage the caches, and the programmer working with I/O devices, particularly with DMA devices, may need to explicitly manage the caches in particular situations. To manipulate the caches, the CPU allows software to isolate them, inhibiting cache/memory traffic and allowing the processor to access cache as if it were simple memory; and the CPU can swap the roles of the I-cache and D-cache (the only way to make the I-cache writable). Caches must sometimes be cleared of stale or invalid/uninitialized data. Even following power-up, the R30xx caches are in a random state and must be cleaned up before they can be used. A later chapter will discuss the techniques used by software to manage the on-chip cache resources. In addition, techniques to determine the on-chip cache sizes will be shown (greatest flexibility is achieved if software can be written to be independent of cache sizes). For the diagnostics programmer, techniques to test the cache memory and probe for particular entries will be discussed. On some CPU implementations the system designer may make configuration choices about the cache (e.g. the R3081 and R3071 allow the cache organization to be selected between 16kB of I-cache/ 4kB of D-cache and 8kB each of I- and D- cache). The cache management chapter will also discuss some of the considerations to apply to make a proper selection. Write buffer : on R30xx family CPUs the D-cache is always write through; all writes go to main memory as well as the cache. This simplifies the caches, but main memory won’t be able to accept data as fast as the CPU can write it. Much of the performance loss can be made up by using a FIFO store which holds a number of ‘‘write cycles’’ (it stores both address and data). In the R30xx family, this FIFO, called the write buffer, is integrated on-chip. System programmers may need to know that writes happen later than the code sequence suggests. The chapter on cache management discusses this. Starting up : at reset almost nothing is defined, so the software must build carefully. In MIPS CPUs, reset is implemented in almost exactly the same way as the exceptions. A later chapter on reset initialization discusses ways of finding out which CPU is executing the software, and how to get a ROM program to run. An example of a C runtime environment, attending to the stack and special registers, is provided. Memory management and the TLB : A later chapter will discuss address translation and managing the translation hardware (the TLB). This section is mostly for OS programmers. CPU CONTROL AND ‘‘CO-PROCESSOR 0’’ CPU control instructions Most control functions are implemented with registers (most of which consist of multiple bitfields). The MIPS architecture has an escape mechanism to define instructions for ‘‘co-processors’’ – and the CPU control instructions are coded for ‘‘co-processor 0’’. 3–2 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE CHAPTER 3 There are several CPU control instructions used in the memory management implementation, which are described in a later chapter. But leaving aside the MMU, CPU control defines just one instruction beyond the necessary move to and from the control registers. mtc0 rs,–Move to co-processor zero Loads ‘‘co-processor 0’’ register number nn from CPU general register rs. It is unusual, and not good practice, to refer to CPU control registers by their number in assembler sources; normal practice is to use the names listed in Table 3.1, “Summary of CPU control registers (not MMU)”. In some toolchains the names are defined by a C-style ‘‘include’’ file, and the C preprocessor run as a front-end to the assembler; the assembler manual should provide guidance on how to do this. This is the only way of setting bits in a CPU control register. mfc0 rd, –Move from co-processor zero General register rd is loaded with the values from CPU control register number nn. Once again, it is common to use a symbolic name and a macro-processor to save remembering the numbers. This is the only way of inspecting bits in a control register. rfe –Restore from exception Note that this is not ‘‘return from exception’’. This instruction restores the status register to go back to the state prior to the trap. To understand what it does, refer to the status register SR defined later in this chapter. The only secure way of returning to user mode from an exception is to return with a jr instruction which has the rfe in its delay slot. Standard CPU control registers This table describes the general CPU control registers (ignoring the MMU control registers). Also note that typical convention is to reserve k0 and k1 for exception processing, although they are proper GP registers of the integer CPU unit. Register Mnemonic Description CP0 reg no. PRId CP0 type and rev level 15 SR (status register) CPU mode flags 12 Cause Describes the most recently recognized exception 13 EPC Return address from trap 14 BadVaddr Contains the last invalid program address which caused a trap. It is set by address errors of all kinds, even if there is no MMU 8 Config CPU configuration (R3081 and R3041 only) 3 BusCtrl (R3041 only) configure bus interface signals. Needs to be setup to match the hardware implementation. 2 PortSize (R3041 only) used to flag some program address regions as 8- or 16-bits wide. Must be programmed to match the hardware implementation. 10 Count (R3041 only, read/write) a 24-bit counter incrementing with the CPU clock. 9 Compare (R3041 only, read/write) a 24-bit value used to wraparound the Count value and set an output signal. 11 Table 3.1. Summary of CPU control registers (not MMU) 3–3 CHAPTER 3 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE Encoding of control registers The next section describes the format of the control registers, with a sketch of the function of each field. In most cases, more information about how things work is to be found in separate sections or chapters later. A note about reserved fields is in order here. Many unused control register fields are marked ‘‘0’’. Bits in such fields are guaranteed to read zero, and should be written as zero. Other reserved fields are marked ‘‘reserved’’ or ‘‘×’’; software must always write them as zero, and should not assume that it will get back zero or any other particular value. Registers specific to the memory management system are described in a later chapter. PRId Register 31 16 15 reserved 8 7 Imp Figure 3.1. 0 Rev PRId Register fields Figure 3.1, “PRId Register fields” shows the layout of the PRId register, a read-only register to be consulted to identify the CPU type (more properly, this register describes CP0, allowing the kernel to dynamically configure itself for various CPU implementations). ‘‘Imp’’ should be related to the CPU control register set. The encoding of Imp is described below: CPU type ‘‘Imp’’ value R3000A (including R3051, R3052, R3071, and R3081) 3 IDT unique (R3041) 7 Note that when the Imp field indicates IDT unique, the revision number can be used to distinguish among various CP0 implementations. Refer to the R3041 User’s manual for the revision level appropriate for that device. Since the R3051, 52, 71, and 81 are kernel compatible with the R3000A, they share the same Imp value. When printing the value of this register, it is conventional to print them out as ‘‘x.y’’ where ‘‘x’’ and ‘‘y’’ are the decimal values of Imp and Rev respectively. Try not to use this register and the CPU manuals to size things, or to establish the presence or absence of particular features; software will be more portable and robust if it is designed to include code sequences to probe for the existence of individual features. This manual will provide numerous examples designed to determine cache sizes, presence or absence of TLB, FPA, etc. SR Register 31 30 29 28 27 CU3 CU2 CU1 CU0 26 0 25 24 RE 15 0 8 IM 23 22 21 20 19 18 17 16 BEV TS PE CM PZ SwC IsC 6 5 4 3 2 1 0 KUo IEo KUp IEp KUc IEc 7 0 Figure 3.2. Fields in status register (SR) 3–4 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE CHAPTER 3 The MIPS CPU has remarkably few mode bits; those that exist are defined by fields in the CPU status register SR, as shown in Figure 3.2, “Fields in status register (SR)”. Note that there are no modes such as non-translated or non-cached in MIPS CPUs; all translation and caching decisions are made on the basis of the program address. Fields are: CU3, CU2 Bits (31:30) control the usability of ‘‘co-processors’’ 3 and 2 respectively. In the R30xx family, these might be enabled if software wishes to use the BrCond(3:2) input pins for polling, or to speed exception decoding. CU1 ‘‘co-processor 1 usable’’: 1 to use FPA if present, 0 to disable. When 0, all FPA instructions cause an exception, even for the kernel. It can be useful to turn off an FPA even when one is available; it may also be enabled in devices which do not include an FPA, if the intent is to use the BrCond(1) pin as a polled input. CU0 ‘‘co-processor 0 usable’’: set 1 to be able to use some nominallyprivileged instructions in user mode (this is rarely if ever done). The CPU control instructions encoded as ‘‘co-processor 0’’ type are always usable in kernel mode, regardless of the setting of this bit. RE ‘‘reverse endianness in user mode’’. The MIPS processors can be configured, at reset time, with either ‘‘endianness’’ (byte ordering convention, discussed in the various CPU’s User’s Manuals and later in this manual). The RE bit allows binaries intended to be run with one byte ordering convention to be run in systems with the opposite convention, presuming OS software provided the necessary support. When RE is active, user-privilege software runs as if the CPU had been configured with the opposite endianness. However, achieving cross-universe running would require a large software effort as well, and should not be necessary in embedded systems. BEV ‘‘boot exception vectors’’: when BEV == 1, the CPU uses the ROM (kseg1) space exception entry point (described in a later chapter). BEV is usually set to zero in running systems; this relocates the exception vectors. to RAM addresses, speeding accesses and allowing the use of “user supplied” exception service routines. TS ‘‘TLB shutdown’’: In devices which implement the full R3000A MMU, TS gets set if a program address simultaneously matches two TLB entries. Prolonged operation in this state, in some implementations, could cause internal contention and damage to the chip. TLB shutdown is terminal, and can be cleared only by a hardware reset. In base family members, which do not include the TLB, this bit is set by reset; software can rely on this feature to determine the presence or absence of TLB support hardware. PE set if a cache parity error has occurred. No exception is generated by this condition, which is really only useful for diagnostics. The MIPS architecture has cache diagnostic facilities because earlier versions of the CPU used external caches, and this provided a way to verify the timing of a particular system. For those implementations the cache parity error bit was an essential design debug tool. For CPUs with on-chip caches this feature is rarely needed; only the R3071 and R3081 implement parity over the on-chip caches. 3–5 CHAPTER 3 CM PZ SwC, IsC IM SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE shows the result of the last load operation performed with the Dcache isolated (described in the chapter on cache management). CM is set if the cache really contained data for the addressed memory location (i.e. if the load would have hit in the cache even if the cache had not been isolated). When set, cache parity bits are written as zero and not checked. This was useful in old R3000A systems which required external cache RAMs, but is of little relevance to the R30xx family. ‘‘swap caches’’ and ‘‘isolate (data) cache’’. Cache mode bits for cache management and diagnostics; their use is described in detail in a later chapter on cache management. In simple terms: • IsC set 1: makes all loads and stores access only the data cache, and never memory; and in this mode a partialword store invalidates the cache entry. Note that when this bit is set, even uncached data accesses will not be seen on the bus; further, this bit is not initialized by reset. Boot-up software must insure this bit is properly initialized before relying on external data references. • SwC set 1: reverses the roles of the I-cache and D-cache, so that software can access and invalidate I-cache entries. ‘‘interrupt mask’’: an 8 bit field defining which interrupt sources, when active, will be allowed to cause an exception. Six of the interrupt sources are external pins (one may be used by the FPA, which although it lives on the same chip is logically external); the other two are the software-writable interrupt bits in the Cause register. No interrupt prioritization is provided by the CPU: the hardware treats all interrupt bits the same. This is described in greater detail in the chapter dealing with exceptions. KUc, IEc The two basic CPU protection bits. KUc is set 1 when running with kernel privileges, 0 for user mode. In kernel mode, software can get at the whole program address space, and use privileged (‘‘co-processor 0’’) instructions. User mode restricts software to program addresses between 0x0000 0000 and 0x7FFF FFFF, and can be denied permission to run privileged instructions; attempts to break the rules result in an exception. IEc is set 0 to prevent the CPU taking any interrupt, 1 to enable. KUp, IEp‘‘KU previous, IE previous’’: on an exception, the hardware takes the values of KUc and IEc and saves them here; at the same time as changing the values of KUc, IEc to [1, 0] (kernel mode, interrupts disabled). The instruction rfe can be used to copy KUp, IEp back into KUc, IEc. KUo, IEo‘‘KU old, IE old’’: on an exception the KUp, IEp bits are saved here. Effectively, the six KU/IE bits are operated as a 3-deep, 2-bit wide stack which is pushed on an exception and popped by an rfe. This provides a chance of recovering cleanly from an exception occurring so early in an exception handling routine that the first exception has not yet saved SR. The circumstances in which this can be done are limited, and it is probably only really of use in allowing the user TLB refill code to be made a little shorter, as described in the chapter on memory management. 3–6 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE CHAPTER 3 Cause Register 31 30 29 BD 0 CE 28 27 16 0 Figure 3.3. 15 IP 8 7 6 2 0 ExcCode 1 0 0 Fields in the Cause register Figure 3.3, “Fields in the Cause register” shows the fields in the Cause register, which are consulted to determine the kind of exception which happened and will be used to decide which exception routine to call. BD ‘‘branch delay’’: if set, this bit indicates that the EPC does not point to the actual “exception” instruction, but rather to the branch instruction which immediately precedes it. When the exception restart point is an instruction which is in the ‘‘delay slot’’ following a branch, EPC has to point to the branch instruction; it is harmless to re-execute the branch, but if the CPU returned from the exception to the branch delay instruction itself the branch would not be taken and the exception would have broken the interrupted program. The only time software might be sensitive to this bit is if it must analyze the ‘‘offending’’ instruction (if BD == 1 then the instruction is at EPC + 4). This would occur if the instruction needs to be emulated (e.g. a floating point instruction in a device with no hardware FPA; or a breakpoint placed in a branch delay slot). CE ‘‘co-processor error’’: if the exception is taken because a ‘‘coprocessor’’ format instruction was for a ‘‘co-processor’’ which is not enabled by the CUx bit in SR, then this field has the coprocessor number from that instruction. IP ‘‘Interrupt Pending’’: shows the interrupts which are currently asserted (but may be “masked” from actually signalling an exception). These bits follow the CPU inputs for the six hardware levels. Bits 9 and 8 are read/writable, and contain the value last written to them. However, any of the 8 bits active when enabled by the appropriate IM bit and the global interrupt enable flag IEc in SR, will cause an interrupt. IP is subtly different from the rest of the Cause register fields; it doesn’t indicate what happened when the exception took place, but rather shows what is happening now. ExcCode A 5-bit code which indicates what kind of exception happened, as detailed in Table 3.2, “ExcCode values: different kinds of exceptions”. ExcCode Value Mnemonic Description 0 Int Interrupt 1 Mod ‘‘TLB modification’’ 2 TLBL ‘‘TLB load/TLB store’’ 3 TLBS 4 AdEL 5 AdES Address error (on load/I-fetch or store respectively). Either an attempt to access outside kuseg when in user mode, or an attempt to read a word or half-word at a misaligned address. Table 3.2. ExcCode values: different kinds of exceptions 3–7 CHAPTER 3 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE ExcCode Value Mnemonic Description 6 IBE 7 DBE 8 Syscall Generated unconditionally by a syscall instruction. 9 Bp Breakpoint - a break instruction. 10 RI ‘‘reserved instruction’’ 11 CpU ‘‘Co-Processor unusable’’ 12 Ov ‘‘arithmetic overflow’’. Note that ‘‘unsigned’’ versions of instructions (e.g. addu) never cause this exception. 13-31 - reserved. Some are already defined for MIPS CPUs such as the R6000 and R4xxx Bus error (instruction fetch or data load, respectively). External hardware has signalled an error of some kind; proper exception handling is system-dependent. The R30xx family CPUs can’t take a bus error on a store; the write buffer would make such an exception “imprecise”. Table 3.2. ExcCode values: different kinds of exceptions EPC Register This is a 32-bit register containing the 32-bit address of the return point for this exception. The instruction causing (or suffering) the exception is at EPC, unless BD is set in Cause, in which case EPC points to the previous (branch) instruction. BadVaddr Register A 32-bit register containing the address whose reference led to an exception; set on any MMU-related exception, on an attempt by a user program to access addresses outside kuseg, or if an address is wrongly aligned for the datum size referenced. After any other exception this register is undefined. Note in particular that it is not set after a bus error. R3041, R3071, and R3081 specific registers Count and Compare Registers (R3041 only) Only present in the R3041, these provide a simple 24-bit counter/timer running at CPU cycle rate. Count counts up, and then wraps around to zero once it has reached the value in the Compare register. As it wraps around the Tc* CPU output is asserted. According to CPU configuration (bit TC of the BusCtrl register), Tc* will either remain active until reset by software (re-write Compare), or will pulse. In either case the counter just keeps counting. To generate an interrupt Tc* must be connected to one of the interrupt inputs. From reset Compare is setup to its maximum value 0xFF ( FFFF), so the counter runs up to 224-1 before wrapping around. Config Register (R3071 and R3081) 31 30 29 28 Lock Slow Bus DB Refill FPInt Figure 3.4. 26 25 24 23 22 Halt RF AC reserved Fields in the R3071/81 Config Register 3–8 0 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE CHAPTER 3 • Lock : set this bit to write to the register for the last time; all future writes to Config will be ignored. The intention is that initialization software will set the register and can then lock it in case some illbehaved piece of software developed on some earlier version of the MIPS architecture tries to stomp on Config; this would have had no effect on earlier CPUs. • Slow Bus : hardware may require that this bit be set. It only matters when the CPU performs a store while running from a cached location. The system hardware design determines the proper setting for this bit; setting it to ‘1’ should be permissible for any system, but loses some performance in memory systems able to support more aggressive bus performance. If set 1, an idle bus cycle is guaranteed between any read and write transfer. This enables additional time for bus tri-stating, control logic generation, etc. • DB : ‘‘data cache block refill’’, set 1 to reload 4 words into the data cache on any miss, set 0 to reload just one word. Can be initialized either way on the R3081, by a reset-time hardware input. • FPInt : controls the CPU interrupt level on which FPA interrupts are reported. On original R3000 CPUs the FPA was external and this was determined by wiring; but the R3081’s FPA is on the chip and it would be inefficient (and jeopardize pin-compatibility) to send the interrupt off chip and on again. Set FPInt to the binary value of the CPU interrupt pin number which is dedicated to FPA interrupts. By default the field is initialized to “011’’ to select the pin Int3†; MIPS convention put the FPA on external interrupt pin 3. For whichever pin is dedicated to the FPA, the CPU will then ignore the value on the external pin; the IP field of the cause register will simply follow the FPA. On the R3071, this field is “reserved”, and must be written as “000”. • Halt : set to bring the CPU to a standstill. It will start again as soon as any interrupt input is asserted (regardless of the state of the interrupt mask). This is useful for power reduction, and can also be used to emulate old MC68000 “Halt” operation. • RF : slows the CPU to 1/16th of the normal clock rate, to reduce power consumption. Illegal unless the CPU is running at 33Mhz or higher. Note that the CPUs output clock (which is normally used to synchronize all the interface logic) slows down too; the hardware design should also accommodate this feature if software desires to use it. • AC : ‘‘alternate cache’’. 0 for 16K I-cache/4K D-cache, but set 1 for 8K I-cache/8K D-cache. • Reserved : must only be written as zero. It will probably read as zero, but software should not rely on this. Config Register (R3041) 31 30 29 28 Lock 1 DBR 0 Figure 3.5. 20 19 18 FDM 0 0 Fields in the R3041 Config (Cache Configuration) Register † Take care: the external pin Int3 corresponds to the bit numbered ‘‘5’’ in IP of the Cause register or IM of the SR register. That’s because both the Cause and SR fields support two ‘‘software interrupts’’ numbered as bits 0 and 1. 3–9 CHAPTER 3 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE • Lock: set 1 to finally configure register (additional writes will not have any effect until the CPU is reset). • 1 and 0 : set fields to exactly the value shown. • DBR: ‘‘DBlockRefill’’, set 1 to read 4 words into the cache on a miss, 0 to refill just the word missed on. The proper setting for a given system is dependent on a number of factors, and may best be determined by measuring performance in each mode and selecting the best one. Note that it is possible for software to dynamically reconfigure the refill algorithm depending on the current code executing, presuming the register has not been “locked”. • FDM: “Force D-Cache Miss”, set 1 for an R3041-specific cache mode, where all loads result in data being fetched from memory (missing in the data cache), but the incoming data is still used to refill the cache. Stores continue to write the cache. This is useful when software desires to obtain the high-bandwidth of the cache and cache refills, but the corresponding main memory is “volatile” (e.g. a FIFO, or updated by DMA). BusCtrl Register (R3041 only) The R3041 CPU has many hardware interface options not available on other members of the R30xx family, which are intended to allow the use of simpler and cheaper interface and memory components. The BusCtrl register does most of the configuration work. It needs to be set strictly in accordance with the needs of the hardware implementation. Note also that its default settings (from reset) leave the interface compatible with other R30xx family members. Figure 3.6, “Fields in the R3041 Bus Control (BusCtrl) Register” shows the layout of the fields, and their uses are provided for completeness. 31 3 0 Loc 10 k 2 8 2 7 2 6 Mem Figure 3.6. 2 5 ED 2 4 2 3 IO 2 2 21 2 0 1 9 1 8 BE 1 B E 11 16 1 6 1 5 1 4 BTA 13 1 2 1 1 1 0 0 DM T A C B R 0x30 0 Fields in the R3041 Bus Control (BusCtrl) Register • Lock: when software has initialized BusCtrl to its desired state it may write this bit to prevent its contents being changed again until the system is reset. • 10 and other numbers : write exactly the specified bit pattern to this field (hex used for big ones, but others are given as binary). Improper values may cause test modes and other unexpected side effects. • Mem : ‘‘MemStrobe* control’’. Set this field to xy binary, where x set means the strobe activates on reads, and y set makes it active on writes. • ED: ‘‘ExtDataEn* control’’. Encoded as for ‘‘Mem’’. Note that the BR bit must be zero for this pin to function as an output. • IO: ‘‘IOStrobe* control’’. Encoded as for ‘‘Mem’’. Note that the BR bit must be zero for this pin to function as an output. • BE16: ‘‘BE16(1:0)* read control’’ – 0 to make these pins active on write cycles only. • BE: ‘‘BE(3:0)* read control’’ – 0 to make these pins active on write cycles only. • BTA: ‘‘Bus turn around time’’. Program with a binary number between 0 and 3, for 0-3 cycles of guaranteed delay between the end of a read cycle and the start of the address phase of the next cycle. This field enables the use of devices with slow tri-state time, and enables the system designer to save cost by omitting data transceivers. 3–10 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE CHAPTER 3 • DMA: ‘‘DMA Protocol Control’’, enables ‘‘DMA pulse protocol’’. When set, the CPU uses its DMA control pins to communicate its desire for the bus even while a DMA is in progress. • TC: ‘‘TC* negation control’’. TC* is the output pin which is activated when the internal timer register Count reaches the value stored in Compare. Set TC zero to make the TC* pin just pulse for a couple of clock periods; leave TC as 1, and TC* will be asserted on a compare and remain asserted until software explicitly clears it (by re-writing Compare with any value). If TC* is used to generate a timer interrupt, then use the default (TC == 0). The pulse is more useful when the output is being used by external logic (e.g. to signal a DRAM refresh). • BR: ‘‘SBrCond(3:2) control’’. Set zero to recycle the SBrCond(3:2) pins as IOStrobe and ExtDataEn respectively. PortSize Register (R3041 only) The PortSize register is used to flag different parts of the program address space for accesses to 8-, 16- or 32-bit wide memory. Settings of this register have to be made at a time and to values which will be mandated by the hardware design. See ‘‘IDT79R3041 Hardware User’s Manual’’ for details. What registers are relevant when? The various CP0 registers and their fields provide support at specific times during system operation. • After hardware reset: software must initialize SR to get the CPU into the right state to bootstrap itself. • Hardware configuration at start-up: an R3041, R3071, or R3081 require initialization of Config, BusCtrl, and/or PortSize before very much will work. The system hardware implementation will dictate the proper configuration of these registers. • After any exception: any MIPS exception (apart from one particular MMU event) invokes a single common ‘‘general exception handler’’ routine, at a fixed address. On entry, no program registers are saved, only the return address in EPC. The MIPS hardware knows nothing about stacks. In any case the exception routine cannot use the user-mode stack for any purpose; the exception might have been a TLB miss on stack memory. Exception software will need to use at least one of k0 and k1 to point to some ‘‘safe’’ (exception-proof) memory space. Key information can be saved, using the other k0 or k1 register to stage data from control registers where necessary. Consult the Cause register to find out what kind of exception it was and dispatch accordingly. • Returning from exception: control must eventually be returned to the value stored in EPC on entry. Whatever kind of exception it was, software will have to adjust SR back upon return from exception. The special instruction rfe does the job; but note that it does not transfer control. To make the jump back software must load the original EPC value back into a generalpurpose register and use a jr operation. • Interrupts: SR is used to adjust the interrupt masks, to determine which (if any) interrupts will be allowed ‘‘higher priority’’ than the current one. The hardware offers no interrupt prioritization, but the software can do whatever it likes. • Instructions which always cause exceptions: are often used (for system calls, breakpoints, and to emulate some kinds of instruction). These sometimes requires partial decoding of the offending 3–11 CHAPTER 3 SYSTEM CONTROL CO-PROCESSOR ARCHITECTURE instruction, which can usually be found at the location EPC. But there is a complication; suppose that an exception occurs just after a branch but in time to prevent the branch delay slot instruction from running. Then EPC will point to the branch instruction (resuming execution starting at the delay slot would cause the branch to be ignored), and the BD bit will be set. This Cause register bit flags this event; to find the instruction at which the exception occurred, add 4 to the EPC value when the BD bit is set. • Cache management routines: SR contains bits defining special modes for cache management. In particular they allow software to isolate the data cache, and to swap the roles of the instruction and data caches. The subsequent chapters will describe appropriate treatment of these registers, and provide software examples of their use. 3–12 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE Conventional names and uses of general-purpose registers Although the hardware makes few rules about the use of registers, their practical use is governed by a number of conventions. These conventions allow inter-changeability of tools, operating systems, and library modules. It is strongly recommended that these conventions be followed. Reg No Name Used for 0 zero Always returns 0 1 at (assembler temporary) Reserved for use by assembler 2-3 v0-v1 Value (except FP) returned by subroutine 4-7 a0-a3 (arguments) First four parameters for a subroutine 8-15 t0-t7 (temporaries) subroutines may use without saving 24-25 t8-t9 16-23 s0-s7 Subroutine ‘‘register variables’’; a subroutine which will write one of these must save the old value and restore it before it exits, so the calling routine sees their values preserved. 26-27 k0-k1 Reserved for use by interrupt/trap handler - may change under your feet 28 gp global pointer - some runtime systems maintain this to give easy access to (some) ‘‘static’’ or ‘‘extern’’ variables. 29 sp stack pointer 30 s8/fp 9th register variable. Subroutines which need one can use this as a ‘‘frame pointer’’. 31 ra Return address for subroutine Table 2.1. Conventional names of registers with usage mnemonics With the conventional uses of the registers go a set of conventional names. Given the need to fit in with the conventions, use of the conventional names is pretty much mandatory. The common names are described in Table 2.1, “Conventional names of registers with usage mnemonics”. Notes on conventional register names • at : this register is reserved for use inside the synthetic instructions generated by the assembler. If the programmer must use it explicitly the directive .noat stops the assembler from using it, but then there are some things the assembler won’t be able to do. • v0-v1 : used when returning non-floating-point values from a subroutine. To return anything bigger than 2×32 bits, memory must be used (described in a later chapter). • a0-a3 : used to pass the first four non-FP parameters to a subroutine. That’s an occasionally-false oversimplification; the actual convention is fully described in a later chapter. • t0-t9 : by convention, subroutines may use these values without preserving them. This makes them easy to use as ‘‘temporaries’’ when evaluating expressions – but a caller must remember that they may be destroyed by a subroutine call. • s0-s8 : by convention, subroutines must guarantee that the values of these registers on exit are the same as they were on entry – either by not using them, or by saving them on the stack and restoring before exit. This makes them eminently suitable for use as ‘‘register variables’’ or for storing any value which must be preserved over a subroutine call. 2–2 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 • k0-k1 : reserved for use by the trap/interrupt routines, which will not restore their original value; so they are of little use to anyone else. • gp : (global pointer). If present, it will point to a load-time-determined location in the midst of your static data. This means that loads and stores to data lying within 32Kbytes either side of the gp value can be performed in a single instruction using gp as the base register. Without the global pointer, loading data from a static memory area takes two instructions: one to load the most significant bits of the 32bit constant address computed by the compiler and loader, and one to do the data load. To use gp a compiler must know at compile time that a datum will end up linked within a 64Kbyte range of memory locations. In practice it can’t know, only guess. The usual practice is to put ‘‘small’’ global data items in the area pointed to by gp, and to get the linker to complain if it still gets too big. The definition of what is “small” can typically be specified with a compiler switch (most compilers use “G“). The most common default size is 8 bytes or less. Not all compilation systems or OS loaders support gp. • sp : (stack pointer). Since it takes explicit instructions to raise and lower the stack pointer, it is generally done only on subroutine entry and exit; and it is the responsibility of the subroutine being called to do this. sp is normally adjusted, on entry, to the lowest point that the stack will need to reach at any point in the subroutine. Now the compiler can access stack variables by a constant offset from sp. Stack usage conventions are explained in a later chapter. • fp : (also known as s8). A subroutine will use a ‘‘frame pointer’’ to keep track of the stack if it wants to use operations which involve extending the stack by an amount which is determined at run-time. Some languages may do this explicitly; assembler programmers are always welcome to experiment; and (for many toolchains) C programs which use the ‘‘alloca’’ library routine will find themselves doing so. In this case it is not possible to access stack variables from sp, so fp is initialized by the function prologue to a constant position relative to the function’s stack frame. Note that a ‘‘frame pointer’’ subroutine may call or be called by subroutines which do not use the frame pointer; so long as the functions it calls preserve the value of fp (as they should) this is OK. • ra : (return address). On entry to any subroutine, ra holds the address to which control should be returned – so a subroutine typically ends with the instruction ‘‘jr ra’’. Subroutines which themselves call subroutines must first save ra, usually on the stack. Integer multiply unit and registers MIPS’ architects decided that integer multiplication was important enough to deserve a hard-wired instruction. This is not so common in RISCs, which might instead: • implement a ‘‘multiply step’’ which fits in the standard integer execution pipeline, and require software routines for every multiplication (e.g. Sparc or AM29000); or • perform integer multiplication in the floating point unit – a good solution but which compromises the optional nature of the MIPS floating point ‘‘co-processor’’. The multiply unit consumes a small amount of die area, but dramatically improves performance (and cache performance) over “multiply step” operations. It’s basic operation is to multiply two 32-bit values together to produce a 64-bit result, which is stored in two 32-bit 2–3 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE registers (called ‘‘hi’’ and ‘‘lo’’) which are private to the multiply unit. Instructions mfhi, mflo are defined to copy the result out into general registers. Unlike results for integer operations, the multiply result registers are interlocked. An attempt to read out the results before the multiplication is complete results in the CPU being stopped until the operation completes. The integer multiply unit will also perform an integer division between values in two general-purpose registers; in this case the ‘‘lo’’ register stores the quotient, and the ‘‘hi’’ register the remainder. In the R30xx family, multiply operations take 12 clocks and division takes 35. The assembler has a synthetic multiply operation which starts the multiply and then retrieves the result into an ordinary register. Note that MIPS Corp.’s assembler may even substitute a series of shifts and adds for multiplication by a constant, to improve execution speed. Multiply/divide results are written into ‘‘hi’’ and ‘‘lo’’ as soon as they are available; the effect is not deferred until the writeback pipeline stage, as with writes to general purpose (GP) registers. If a mfhi or mflo instruction is interrupted by some kind of exception before it reaches the writeback stage of the pipeline, it will be aborted with the intention of restarting it. However, a subsequent multiply instruction which has passed the ALU stage will continue (in parallel with exception processing) and would overwrite the ‘‘hi’’ and ‘‘lo’’ register values, so that the re-execution of the mfhi would get wrong (i.e. new) data. For this reason it is recommended that a multiply should not be started within two instructions of an mfhi/ mflo. The assembler will avoid doing this where it can. Integer multiply and divide operations never produce an exception, though divide by zero produces an undefined result. Compilers will often generate code to trap on errors, particularly on divide by zero. Frequently, this instruction sequence is placed after the divide is initiated, to allow it to execute concurrently with the divide (and avoid a performance loss). Instructions mthi, mtlo are defined to setup the internal registers from general-purpose registers. They are essential to restore the values of ‘‘hi’’ and ‘‘lo’’ when returning from an exception, but probably not for anything else. Instruction types A full list of R30xx family integer instructions is presented in Appendix A. Floating point instructions are listed in Appendix B of this manual. Currently, floating point instructions are only available in the R3081, and are described in the R3081 User’s Manual. The MIPS-1 ISA uses only three basic instruction encoding formats; this is one of the keys to the high-frequencies attained by RISC architectures. Instructions are mostly in numerical order; to simplify reading, the list is occasionally re-ordered for clarity. Throughout this manual, the description of various instructions will also refer to various subfields of the instruction. In general, the following typical nomenclature is used: op The basic op-code, which is 6 bits long. Instructions which large sub-fields (for example, large immediate values, such as required for the ‘‘long’’ j/jal instructions, or arithmetic with a 16-bit constant) have a unique ‘‘op’’ field. Other instructions are classified in groups sharing an ‘‘op’’ value, distinguished by other fields (‘‘op2’’ etc.). rs, rs1, rs2 One or two fields identifying source registers. rd The register to be changed by this instruction. sa Shift-amount: How far to shift, used in shift-by-constant instructions. 2–4 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 op2 Sub-code field used for the 3-register arithmetic/logical group of instructions (op value of zero). offset 16-bit signed word offset defining the destination of a ‘‘PCrelative’’ branch. The branch target will be the instruction ‘‘offset’’ words away from the ‘‘delay slot’’ instruction after the branch; so a branch-to-self has an offset of -1. target 26-bit word address to be jumped to (it corresponds to a 28-bit byte address, which is always word-aligned). The long j instruction is rarely used, so this format is pretty much exclusively for function calls (jal). The high-order 4 bits of the target address can’t be specified by this instruction, and are taken from the address of the jump instruction. This means that these instructions can reach anywhere in the 256Mbyte region around the instructions’ location. To jump further use a jr (jump register) instruction. constant 16-bit integer constant for ‘‘immediate’’ arithmetic or logic operations. mf Yet another extended opcode field, this time used by ‘‘coprocessor’’ type instructions. rg Field which may hold a source or destination register. crg Field to hold the number of a CPU control register (different from the integer register file). Called ‘‘crs’’/‘‘crd’’ in contexts where it must be a source/destination respectively. The instruction encodings have been chosen to facilitate the design of a high-frequency CPU. Specifically:. • The instruction encodings do reveal portions of the internal CPU design. Although there are variable encodings, those fields which are required very early in the pipeline are encoded in a very regular way: • Source registers are always in the same place : so that the CPU can fetch two instructions from the integer register file without any conditional decoding. Some instructions may not need both registers – but since the register file is designed to provide two source values on every clock nothing has been lost. • 16-bit constant is always in the same place : permitting the appropriate instruction bits to be fed directly into the ALU’s input multiplexer, without conditional shifts. Loading and storing: addressing modes As mentioned above, there is only one basic ‘‘addressing mode’’. Any load or store machine instruction can be written as: operation dest-reg, offset(src-reg) e.g.:lw $1, offset($2); sw $3, offset($4) Any of the GP registers can be used for the destination and source. The offset is a signed, 16-bit number (so can be anywhere between -32768 and 32767); the program address used for the load is the sum of dest-reg and the offset. This address mode is normally enough to pick out a particular member of a C structure (‘‘offset’’ being the distance between the start of the structure and the member required); it implements an array indexed by a constant; it is enough to reference function variables from the stack or frame pointer; to provide a reasonable sized global area around the gp value for static and extern variables. The assembler provides the semblance of a simple direct addressing mode, to load the values of memory variables whose address can be computed at link time. 2–5 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE More complex modes such as double-register or scaled index must be implemented with sequences of instructions. Data types in Memory and registers The R30xx family CPUs can load or store between 1 and 4 bytes in a single operation. Naming conventions are used in the documentation and to build instruction mnemonics: ‘‘C’’ name MIPS name Size(bytes) Assembler mnemonic int word 4 ‘‘w’’ as in lw long word 4 ‘‘w’’ as in lw short halfword 2 ‘‘h’’ as in lh char byte 1 ‘‘b’’ as in lb Integer data types Byte and halfword loads come in two flavors: • Sign-extend : lb and lh load the value into the least significant bits of the 32-bit register, but fill the high order bits by copying the ‘‘sign bit’’ (bit 7 of a byte, bit 16 of a half-word). This correctly converts a signed value to a 32-bit signed integer. • Zero-extend : instructions lbu and lhu load the value into the least significant bits of a 32-bit register, with the high order bits filled with zero. This correctly converts an unsigned value in memory to the corresponding 32-bit unsigned integer value; so byte value 254 becomes 32-bit value 254. If the byte-wide memory location whose address is in t1 contains the value 0xFE (-2, or 254 if interpreted as unsigned), then: lb lbu t2, 0(t1) t3, 0(t1) will leave t2 holding the value 0xFFFF FFFE (-2 as signed 32-bit) andt3 holding the value 0x0000 00FE (254 as signed or unsigned 32-bit). Subtle differences in the way shorter integers are extended to longer ones are a historical cause of C portability problems, and the modern C standards have elaborate rules. On machines like the MIPS, which does not perform 8- or 16-bit precision arithmetic directly, expressions involving short or char variables are less efficient than word operations. Unaligned loads and stores Normal loads and stores in the MIPS architecture must be aligned; halfwords may be loaded only from 2-byte boundaries, and words only from 4byte boundaries. A load instruction with an unaligned address will produce a trap. Because CISC architectures such as the MC680x0 and iAPXx86 do handle unaligned loads and stores, this could complicate porting software from one of these architectures. The MIPS architecture does provide mechanisms to support this type of operation; in extremity, software can provide a trap handler which will emulate the desired load operation and hide this feature from the application. All data items declared by C code will be correctly aligned. But when it is known in advance that the program will transfer a word from an address whose alignment is unknown and will be computed at run time, the architecture does allow for a special 2-instruction sequence (much more efficient than a series of byte loads, shifts and assembly). This sequence is normally generated by the macro-instruction ulw (unaligned load word). 2–6 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 (A macro-instruction ulh, unaligned load half, is also provided, and is synthesized by two loads, a shift, and a bitwise ‘‘or’’ operation.) The special machine instructions are lwl and lwr (load word left, load word right). ‘‘Left’’ and ‘‘right’’ are arithmetical directions, as in ‘‘shift left’’; ‘‘left’’ is movement towards more significant bits, ‘‘right’’ is towards less significant bits. These instructions do three things: • load 1, 2, 3 or 4 bytes from within one aligned 4-byte (word) location; • shift that data to move the byte selected by the address to either the most-significant (lwl) or least-significant (lwr) end of a 32-bit field; • merge the bytes fetched from memory with the data already in the destination. This breaks most of the rules the architecture usually sticks by; it does a logical operation on a memory variable, for example. Special hardware allows the lwl, lwr pair to be used in consecutive instructions, even though the second instruction uses the value generated by the first. For example, on a CPU configured as big-endian the assembler instruction: ulw add t1, 0(t2) t4, t3, t1 is implemented as: lwl lwr nop add t1, 0(t2) t1, 3(t2) t4, t3, t1 Where: • the lwl picks up the lowest-addressed byte of the unaligned 4-byte region, together with however many more bytes which fit into an aligned word. It then shifts them left, to form the most-significant bytes of the register value. • the lwr is aimed at the highest-addressed byte in the unaligned 4-byte region. It loads it, together with any bytes which precede it in the same memory word, and shifts it right to get the least significant bits of the register value. The merge leaves the high-order bits unchanged. • Although special hardware ensures that a nop is not required between the lwl and lwr, there is still a load delay between the second of them and a normal instruction. Note that if t2 was in fact 4-byte aligned, then both instructions load the entire word; duplicating effort, but achieving the desired effect. CPU behavior when operating with little-endian byte order is described in a later chapter. Floating point data in memory Loads into floating point registers from 4-byte aligned memory move data without any interpretation – a program can load an invalid floating point number and no FP error will result until an arithmetic operation is requested with it as an operand. This allows a programmer to load single-precision values by a load into an even-numbered floating point register; but the programmer can also load a double-precision value by a macro instruction, so that: ldc1 $f2, 24(t1) is expanded to two loads to consecutive registers: lwc1 lwc1 2–7 $f2, 24(t1) $f3, 28(t1) CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE The C compiler aligns 8-byte long double-precision floating point variables to 8-byte boundaries. R30xx family hardware does not require this alignment; but it is done to avoid compatibility problems with implementations of MIPS-2 or MIPS-3 CPUs such as the IDT R4600 (Orion), where the ldc1 instruction is part of the machine code, and the alignment is necessary. BASIC ADDRESS SPACE The way in which MIPS processors use and handle addresses is subtly different from that of traditional CISC CPUs, and may appear confusing. Read the first part of this section carefully. Here are some guidelines: • The addresses put into programs are rarely the same as the physical addresses which come out of the chip (sometimes they’re close, but not the same). This manual will refer to them as program addresses and physical addresses respectively. A more common name for program addresses is “virtual addresses”; note that the use of the term “virtual address” does not necessarily imply that an operating system must perform virtual memory management (e.g. demand paging from disks...), but rather that the address undergoes some transformation before being presented to physical memory. Although virtual address is a proper term, this manual will typically use the term “program address” to avoid confusing virtual addresses with virtual memory management requirements. • A MIPS-1 CPU has two operating modes: user and kernel. In user mode, any address above 2Gbytes (most-significant bit of the address set) is illegal and causes a trap. Also, some instructions cause a trap in user mode. • The 32-bit program address space is divided into four big areas with traditional names; and different things happen according to the area an address lies in: kuseg 0000 0000 – 7FFF FFFF (low 2Gbytes): these are the addresses permitted in user mode. In machines with an MMU (“E” versions of the R30xx family), they will always be translated (more about the R30xx MMU in a later chapter). Software should not attempt to use these addresses unless the MMU is set up. For machines without an MMU (“base” versions of the R30xx family), the kuseg “program address” is transformed to a physical address by adding a 1GB offset; the address transformations for “base versions” of the R30xx family are described later in this chapter. Note, however, that many embedded applications do not use this address segment (those applications which do not require that the kernel and its resources be protected from user tasks). kseg0 0x8000 0000 – 9FFF FFFF (512 Mbytes): these addresses are ‘‘translated’’ into physical addresses by merely stripping off the top bit, mapping them contiguously into the low 512 Mbytes of physical memory. This transformation operates the same for both “base” and “E” family members. This segment is referred to as “unmapped” because “E” version devices cannot redirect this translation to a different area of physical memory. Addresses in this region are always accessed through the cache, so may not be used until the caches are properly initialized. They will be used for most programs and data in systems using “base” family members; and will be used for the OS kernel for systems which do use the MMU (“E” version devices). 2–8 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 kseg1 0xA000 0000 – BFFF FFFF (512 Mbytes): these addresses are mapped into physical addresses by stripping off the leading three bits, giving a duplicate mapping of the low 512 Mbytes of physical memory. However, kseg1 program address accesses will not use the cache. The kseg1 region is the only chunk of the memory map which is guaranteed to behave properly from system reset; that’s why the after-reset starting point ( 0xBFC0 0000, commonly called the “reset exception vector”) lies within it. The physical address of the starting point is 0x1FC0 0000 – which means that the hardware should place the boot ROM at this physical address. Software will therefore use this region for the initial program ROM, and most systems also use it for I/O registers. In general, IO devices should always be mapped to addresses that are accessible from Kseg1, and system ROM is always mapped to contain the reset exception vector. Note that code in the ROM can then be accessed uncacheably (during boot up) using kseg1 program addresses, and also can be accessed cacheably (for normal operation) using kseg0 program addresses. kseg2 0xC000 0000 – FFFF FFFF (1 Gbyte): this area is only accessible in kernel mode. As for kuseg, in “E” devices program addresses are translated by the MMU into physical addresses; thus, these addresses must not be referenced prior to MMU initialization. For “base versions”, physical addresses are generated to be the same as program addresses for kseg2. Note that many systems will not need this region. In “E” versions, it frequently contains OS structures such as page tables; simpler OS’es probably will have little need for kseg2. SUMMARY OF SYSTEM ADDRESSING MIPS program addresses are rarely simply the same as physical addresses, but simple embedded software will probably use addresses in kseg0 and kseg1, where the program address is related in an obvious and unchangeable way to physical addresses. Physical memory locations from 0x2000 0000 (512Mbyte) upward may be difficult to access. In “E” versions of the R30xx family, the only way to reach these addresses is through the MMU. In “base” family members, certain of these physical addresses can be reached using kseg2 or kuseg addresses: the address transformations for base R30xx family members is described later in this chapter. Kernel vs. user mode In kernel mode (the CPU resets into this state), all program addresses are accessible. In user mode: • Program addresses above 2Gbytes (top bit set) are illegal and will cause a trap. Note that if the CPU has an MMU, this means all valid user mode addresses must be translated by the MMU; thus, User mode for “E” devices typically requires the use of a memory-mapped OS. For “base” CPUs, kuseg addresses are mapped to a distinct area of physical memory. Thus, kernel memory resources (including IO devices) can be made inaccessible to User mode software, without requiring a memory-mapping function from the OS. Alternately, the hardware can choose to “ignore” high-order address bits when performing address decoding, thus “condensing” kuseg, kseg2, kseg1, and kseg0 into the same physical memory. 2–9 CHAPTER 2 MIPS-1 (R30xx) ARCHITECTURE • Instructions beyond the standard user set become illegal. Specifically, the kernel can prevent User mode software from accessing the onchip CP0 (system control coprocessor, which controls exception and machine state and performs the memory management functions of the CPU). Thus, the primary differences between User and Kernel modes are: • User mode tasks can be inhibited from accessing kernel memory resources, including OS data structures and IO devices. This also means that various user tasks can be protected from each other. • User mode tasks can be inhibited from modifying the basic machine state, by prohibiting accesses to CP0. Note that the kernel/user mode bit does not change the interpretation of anything – just some things cease to be allowed in user mode. In kernel mode the CPU can access low addresses just as if it was in user mode, and they will be translated in the same way. Memory map for CPUs without MMU hardware The treatment of kseg0 and kseg1 addresses is the same for all IDT R30xx CPUs. If the system can be implemented using only physical addresses in the low 512Mbytes, and system software can be written to use only kseg0 and kseg1, then the choice of “base” vs. “E” versions of the R30xx family is not relevant. For versions without the MMU (“base versions”), addresses in kuseg and kseg2 will undergo a fixed address translation, and provide the system designer the option to provide additional memory. The base members of the R30xx family provide the following address translations for kuseg and kseg2 program addresses: • kuseg: this region (the low 2Gbytes of program addresses) is translated to a contiguous 2Gbyte physical region between 13Gbytes. In effect, a 1GB offset is added to each kuseg program address. In hex: Program address 0x0000 0000 0x7FFF FFFF Physical Address → 0x4000 0000 0xBFFF FFFF • kseg2: these program addresses are genuinely untranslated. So program addresses from 0xC000 0000 – 0xFFFF FFFF emerge as identical physical addresses. This means that “base” versions can generate most physical addresses (without the use of an MMU), except for a gap between 512Mbyte and 1Gbyte (0x2000 0000 through 0x3FFF FFFF). As noted above, many systems may ignore high-order address bits when performing address decoding, thus condensing all physical memory into the lowest 512MB addresses. Subsegments in the R3041 – memory width configuration The R3041 CPU can be configured to access different regions of memory as either 32-, 16- or 8-bits wide. Where the program requests a 32-bit operation to a narrow memory (either with an uncached access, or a cache miss, or a store), the CPU may break a transaction into multiple data phases, to match the datum size to the memory port width. The width configuration is applied independently to subsegments of the normal kseg regions, as follows: • kseg0 and kseg1: as usual, these are both mapped onto the low 512Mbytes. This common region is split into 8 subsegments (64Mbytes each), each of which can be programmed as 8-, 16- or 32bits wide. The width assignment affects both kseg0 and kseg1 accesses (that is, one can view these as subsegments of the corresponding “physical” addresses). 2–10 MIPS-1 (R30xx) ARCHITECTURE CHAPTER 2 • kuseg: is divided into four 512Mbyte subsegments, each independently programmable for width. Thus, kuseg can be broken into multiple portions, which may have varying widths. An example of this may be a 32-bit main memory with some 16-bit PCMCIA font cards and an 8-bit NVRAM. • kseg2: is divided into two 512Mbyte subsegments, independently programmable for width. Again, this means that kseg2 can support multiple memory subsystems, of varying port width. Note that once the various memory port widths have been configured (typically at boot time), software does not have to be aware of the actual width of any memory system. It can choose to treat all memory as 32-bit wide, and the CPU will automatically adjust when an access is made to a narrower memory region. This simplifies software development, and also facilitates porting to various system implementations (which may or may not choose the same memory port widths). 2–11 ® EXCEPTION MANAGEMENT CHAPTER 4 Integrated Device Technology, Inc. 1 This chapter describes the software techniques used to recognize and decode exceptions, save state, dispatch exception service routines, and return from exception. Various code examples are provided. EXCEPTIONS In the MIPS architecture interrupts, traps, system calls and everything else which disrupts the normal flow of execution are called ‘‘exceptions’’ and handled by a single mechanism. These kinds of events include: • External events : interrupts, or a bus error on a read. Note that for the R30xx floating point exceptions are reported as interrupts, since when the R3000A was originally implemented the FPA was indeed external. Interrupts are the only exception conditions which can be disabled under software control. • Program errors and unusual conditions : non-existent instructions (including ‘‘co-processor’’ instructions executed with the appropriate SR disabled), integer overflow, address alignment errors, accesses outside kuseg in user mode. • Memory translation exceptions : using an invalid translation, or a write to a write-protected page; and access to a page for which there is no translation in the TLB. • System calls and traps : exceptions deliberately generated by software to access kernel facilities in a secure way (syscalls, conditional traps planted by careful code, and breakpoints). Some things do not cause exceptions, although other CPU architectures may handle them that way. Software must use other mechanisms to detect: • bus errors on write cycles (R30xx CPUs don’t detect these as exceptions at all; the use of a write buffer would make such an exception “imprecise”, in that the instruction which generated the store data is not guaranteed to be the one which recognizes the exception). • parity errors detected in the cache (the PE bit in SR is set, but no exception is signalled). Precise exceptions The MIPS architecture implements precise exceptions. This is quite a useful feature, as it provides: • Unambiguous proof of cause : after an exception caused by any internal error, the EPC points to the instruction which caused the error (it might point to the preceding branch for an instruction which is in a branch delay slot, but will signal occurrence of this using the BD bit). • Exceptions are seen in instruction sequence : exceptions can arise at several different stages of execution, creating a potential hazard. For example, if a load instruction suffers a TLB miss the exception won’t be signalled until the ‘‘MEM’’ pipestage; if the next instruction suffers an instruction TLB miss (at the ‘‘IF’’ pipestage) the logically second exception will be signalled first (since the IF occurs earlier in the pipe than MEM). 4–1 CHAPTER 4 EXCEPTION MANAGEMENT To avoid this problem, early-detected exceptions are not activated until it is known that all previous instructions will complete successfully; in this case, the instruction TLB miss is suppressed and the exception caused by the earlier instruction handled. The second exception will likely happen again upon return from handling the data fault. • Subsequent instructions nullified : because of the pipelining, instructions lying in sequence after the EPC may well have been started. But the architecture guarantees that no effects produced by these instructions will be visible in the registers or CPU state; and no effect at all will occur which will prevent execution being restarted at the EPC. Note that this isn’t quite true of, for example, the result registers in the integer multiply unit (logically, the architecture considers these changed by the initiation of a multiply or divide). But provided that the instruction arrangement rules required by the assembler are followed, no problems will arise. The implementation of precise exceptions requires a number of clever techniques. For example, the FPA cannot update the register file until it knows that the operation will not generate an exception. However, the R30xx family contains logic to allow multi-cycle FPA operations to occur concurrently with integer operations, yet maintain precise exceptions. When exceptions happen Since exceptions are precise, the architecture determines that an exception seems to have happened just before the execution of the instruction which caused it. The first fetch from the exception routine will be made within 1 clock of the time when the faulting instruction would have finished; in practice it is often faster. On an interrupt, the last instruction to be completed before interrupt processing starts will be the one which has just finished its MEM stage when the interrupt is detected. The EPC target will be the one which has just finished its ALU stage. However, take care; some of the interrupt inputs to R30xx family CPUs are resynchronised internally (to support interrupt signalling from asynchronous sources) and the interrupt will be detected only on the rising edge of the second clock after the interrupt becomes active. Exception vectors Unlike most CISC processors, the MIPS CPU does no part of the job of dispatching exceptions to specialist routines to deal with individual conditions. The rationale for this is twofold: • on CISC CPUs this feature is not so useful in practice as one might hope. For example, most interrupts are likely to share code for saving registers and it is common for CISC microcode to spend time dispatching to different interrupt entry points, where system software loads a code number and jumps back to a common handler. • on a RISC CPU ordinary code is fast enough to be used in preference to microcode. Only one exception is handled differently; a TLB miss on an address in kuseg. Although the architecture uses software to handle this condition (which occurs very frequently in a heavily-used multi-tasking, virtual memory OS), there is significant architectural support for a ‘‘preferred’’ scheme for TLB refill. The preferred refill scheme can be completed in about 13 clocks. It is also useful to have two alternate pairs of entry points. It is essential for high performance to locate the vectors in cached memory for OS use, but this is highly undesirable at start-up; the need for a robust and selfdiagnosing start-up sequence mandates the use of uncached read-only memory for vectors. 4–2 EXCEPTION MANAGEMENT CHAPTER 4 So the exception system adds four more “magic” addresses to the one used for system start-up. The reset mechanism on the MIPS CPU is remarkably like the exception mechanism, and is sometimes referred to as the reset exception. The complete list of exception vector addresses is shown in Table 4.1, “Reset and exception entry points (vectors) for R30xx family”: Program address ‘‘segment’’ Physical Address Description 0x8000 0000 kseg0 0x0000 0000 TLB miss on kuseg reference only. 0x8000 0080 kseg0 0x0000 0080 All other exceptions. 0xbfc0 0100 kseg1 0x1fc0 0100 Uncached alternative kuseg TLB miss entry point (used if SR bit BEV set). 0xbfc0 0180 kseg1 0x1fc0 0180 Uncached alternative for all other exceptions, used if SR bit BEV set). 0xbfc0 0000 kseg1 0x1fc0 0000 The ‘‘reset exception’’. Table 4.1. Reset and exception entry points (vectors) for R30xx family The 128 byte (0x80) gap between the two exception vectors is because the MIPS architects felt that 32 instructions would be enough to code the user-space TLB miss routine, saving a branch instruction without wasting too much memory. So on an exception, the CPU: 1) sets up EPC to point to the restart location. 2) the pre-existing user-mode and interrupt-enable flags in SR are saved by pushing the 3-entry stack inside SR, and changing to kernel mode with interrupts disabled. 3) Cause is setup so that software can see the reason for the exception. On address exceptions BadVaddr is also set. Memory management system exceptions set up some of the MMU registers too; see the chapter on memory management for more detail. 4) transfers control to the exception entry point. Exception handling – basics Any MIPS exception handler has to go through the same stages: • Bootstrapping : on entry to the exception handler very little of the state of the interrupted program has been saved, so the first job is to provide room to preserve relevant state information. Almost inevitably, this is done by using the k0 and k1 registers (which are reserved for ‘‘kernel mode’’ use, and therefore should contain no application program state), to reference a piece of memory which can be used for other register saves. • Dispatching different exceptions : consult the Cause register. The initial decision is likely to be made on the ‘‘ExcCode’’ field, which is thoughtfully aligned so that its code value (between 0 and 31) can be used to index an array of words without a shift. The code will be something like this: mfc0 and lw jr 4–3 t1, C0_CAUSE t2, t1, 0x3f t2, tablebase(t2) t2 CHAPTER 4 EXCEPTION MANAGEMENT • Constructing the exception processing environment : complex exception handling routines may be written in a high level language; in addition, software may wish to be able to use standard library routines. To do this, software will have to switch to a suitable stack, and save the values of all registers which “called subroutines” may use. • Processing the exception : this is system and cause dependent. • Returning from an exception : The return address is contained in the EPC register on exception entry; the value must be placed into a general purpose register for return from exception (note that the EPC value may have been placed on the stack at exception entry). Returning control is now done with a jr instruction, and the change of state back from kernel to the previous mode is done by an rfe instruction after the jr, in the delay slot. Nesting exceptions In many cases the system may wish to permit (or will not be able to avoid) further exceptions occurring within the exception processing routine – nested exceptions. If improperly handled, this could cause chaos; vital state for the interrupted program is held in EPC and SR, and another exception would overwrite them. To permit nested exceptions, these values must be saved elsewhere. Moreover, once exceptions are re-enabled, software can no longer rely on the values of k0 and k1, since a subsequent (nested) exception may alter their values. The normal approach to this is to define an exception frame; a memoryresident data structure with fields to store incoming register values, so that they can be retrieved on return. Exception frames are usually arranged logically as a stack. Stack resources are consumed by each exception, so arbitrarily nested exceptions cannot be tolerated. Most systems sort exceptions into a priority order, and arrange that while an exception is being processed only higher-priority exceptions are permitted. Such systems need have only as many exception frames as there are priority levels. Software can inhibit certain exceptions, as follows: • Interrupts : can be individually masked by software to conform to system priority rules; • Privilege Violations : can’t happen in kernel mode; virtually all exception service routines will execute in kernel mode; • Addressing errors and TLB misses : software must be written to ensure that these never happen when processing higher priority exceptions. Typical system priorities are (lowest first): non-exception code, TLB miss on kuseg address, TLB miss on kseg2 address, interrupt (lowest)... interrupt (highest), illegal instructions and traps, bus errors. An exception routine The following is an exception routine from IDT/sim. It receives exceptions, saves all state, and calls the appropriate service routine. It also shows the code used to install the exception handler in memory. /* ** ** ** ** ** ** */ exception.s - contains functions for setting up and handling exceptions Copyright 1989 Integrated Device Technology, Inc. All Rights Reserved 4–4 EXCEPTION MANAGEMENT CHAPTER 4 #include #include #include #include #include "iregdef.h" "idtcpu.h" "idtmon.h" "setjmp.h" "excepthdr.h" /* ** move_exc_code() - moves the exception code to the utlb and gen ** exception vectors */ FRAME(move_exc_code,sp,0,ra) .set noreorder la t1,exc_utlb_code la t2,exc_norm_code li t3,UT_VEC li t4,E_VEC li t5,VEC_CODE_LENGTH 1: lw t6,0(t1) lw t7,0(t2) sw t6,0(t3) sw t7,0(t4) addiu t1,4 addiu t3,4 addiu t4,4 subu t5,4 bne t5,zero,1b addiu t2,4 move t5,ra # assumes clear_cache doesnt use t5 li a0,UT_VEC jal clear_cache li a1,VEC_CODE_LENGTH nop li a0,E_VEC jal clear_cache li a1,VEC_CODE_LENGTH move ra,t5 # restore ra j ra nop .set reorder ENDFRAME(move_exc_code) /* ** enable_int(mask) - enables interrupts - mask is positoned so it only ** needs to be or'ed into the status reg. This ** also does some other things !!!! caution should ** be used if invoking this while in the middle ** of a debugging session where the client may have ** nested interrupts. ** */ FRAME(enable_int,sp,0,ra) .set noreorder la t0,client_regs lw t1,R_SR*4(t0) nop or t1,0x4 or t1,a0 sw t1,R_SR*4(t0) mfc0 t0,C0_SR or a0,1 or t0,a0 mtc0 t0,C0_SR j ra 4–5 CHAPTER 4 EXCEPTION MANAGEMENT nop .set reorder ENDFRAME(enable_int) /* ** disable_int(mask) - disable the interrupt - mask is the compliment ** of the bits to be cleared - i.e. to clear ext int ** 5 the mask would be - 0xffff7fff */ FRAME(disable_int,sp,0,ra) .set noreorder la t0,client_regs lw t1,R_SR*4(t0) nop and t1,a0 sw t1,R_SR*4(t0) mfc0 t0,C0_SR nop and t0,a0 mtc0 t0,C0_SR j ra nop .set reorder ENDFRAME(disable_int) /* ** the following sections of code are copied to the vector area ** at location 0x80000000 (utlb miss) and location 0x80000080 ** (general exception). ** */ .set .set noreorder noat # must be set so la does not use at FRAME(exc_norm_code,sp,0,ra) la k0,except_regs sw AT,R_AT*4(k0) sw gp,R_GP*4(k0) sw v0,R_V0*4(k0) li v0,NORM_EXCEPT la AT,exception j AT nop ENDFRAME(exc_norm_code) FRAME(exc_utlb_code,sp,0,ra) la k0,except_regs sw AT,R_AT*4(k0) sw gp,R_GP*4(k0) sw v0,R_V0*4(k0) li v0,UTLB_EXCEPT la AT,exception j AT nop .set reorder /* ** common exception handling code ** Save various registers so we can print informative messages ** for faults (whether in monitor or client mode) ** Reg.(k0) points to the exception register save area. ** If we are in client mode then some of these values will ** have to be copied to the client register save area. */ .set noreorder 4–6 EXCEPTION MANAGEMENT CHAPTER 4 exception: sw v0,R_EXCTYPE*4(k0) # save exception type (gen or utlb) sw v1,R_V1*4(k0) mfc0 v0,C0_EPC mfc0 v1,C0_SR sw v0,R_EPC*4(k0)# save the pc at the time of the exception sw v1,R_SR*4(k0) .set noat la AT,client_regs# get address of client reg save area mfc0 v0,C0_BADVADDR mfc0 v1,C0_CAUSE sw v0,R_BADVADDR*4(k0) sw v0,R_BADVADDR*4(AT) sw v1,R_CAUSE*4(k0) sw v1,R_CAUSE*4(AT) sw sp,R_SP*4(k0) sw sp,R_SP*4(AT) lw v0,user_int_fast#see if a client wants a shot at it sw a0,R_A0*4(k0) sw a0,R_A0*4(AT) sw ra,R_RA*4(k0) sw ra,R_RA*4(AT) lw sp,fault_stack # use "fault" stack beq v0,zero,1f # skip the following if no client nop move a0,AT jal v0 nop la k0,except_regs la AT,client_regs beq v0,zero,1f # returns false if user did not handle nop la v1,except_regs lw ra,R_RA*4(v1) lw AT,R_AT*4(v1) lw gp,R_GP*4(v1) lw v0,R_V0*4(v1) lw sp,R_SP*4(v1) lw a0,R_A0*4(v1) lw k0,R_EPC*4(v1) lw v1,R_V1*4(v1) j k0 rfe /* ** Save registers if in client mode ** then change mode to prom mode currently k0 is pointing ** exception reg. save area - v0, v1, AT, gp, sp regs were saved ** epc, sr, badvaddr and cause were also saved. */ 1: lw v0,R_MODE*4(AT)# get the current op. mode lw v1,R_EXCTYPE*4(k0) sw v0,R_MODE*4(k0)# save the current prom mode sw v1,R_EXCTYPE*4(AT) li v1,MODE_MONITOR# see if it beq v0,v1,nosave # was in prom mode nop li v0,MODE_MONITOR sw v0,R_MODE*4(AT)# now in prom mode lw v0,R_GP*4(k0) lw v1,R_EPC*4(k0) sw v0,R_GP*4(AT) sw v1,R_EPC*4(AT) lw v0,R_SR*4(k0) lw v1,R_AT*4(k0) 4–7 CHAPTER 4 sw sw lw lw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw li sw sw sw sw lw move and beq present move lw and mtc0 nop cfc1 cfc1 sw sw swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 EXCEPTION MANAGEMENT v0,R_SR*4(AT) v1,R_AT*4(AT) v0,R_V0*4(k0) v1,R_V1*4(k0) v0,R_V0*4(AT) v1,R_V1*4(AT) a1,R_A1*4(AT) a2,R_A2*4(AT) a3,R_A3*4(AT) t0,R_T0*4(AT) t1,R_T1*4(AT) t2,R_T2*4(AT) t3,R_T3*4(AT) t4,R_T4*4(AT) t5,R_T5*4(AT) t6,R_T6*4(AT) t7,R_T7*4(AT) s0,R_S0*4(AT) s1,R_S1*4(AT) s2,R_S2*4(AT) s3,R_S3*4(AT) s4,R_S4*4(AT) s5,R_S5*4(AT) s6,R_S6*4(AT) s7,R_S7*4(AT) t8,R_T8*4(AT) v0,0xbababadd #This reg (k0) is invalid t9,R_T9*4(AT) v0,R_K0*4(AT) # should be obvious k1,R_K1*4(AT) fp,R_FP*4(AT) v0,status_base v1,AT v0,SR_CU1 v0,zero,1f # only save fpu regs if AT,v1 v1,R_SR*4(AT) v0,v1 v0,C0_SR v0,$30 v1,$31 v0,R_FEIR*4(AT) v1,R_FCSR*4(AT) fp0,R_F0*4(AT) fp1,R_F1*4(AT) fp2,R_F2*4(AT) fp3,R_F3*4(AT) fp4,R_F4*4(AT) fp5,R_F5*4(AT) fp6,R_F6*4(AT) fp7,R_F7*4(AT) fp8,R_F8*4(AT) fp9,R_F9*4(AT) fp10,R_F10*4(AT) fp11,R_F11*4(AT) fp12,R_F12*4(AT) fp13,R_F13*4(AT) fp14,R_F14*4(AT) fp15,R_F15*4(AT) fp16,R_F16*4(AT) fp17,R_F17*4(AT) fp18,R_F18*4(AT) fp19,R_F19*4(AT) fp20,R_F20*4(AT) fp21,R_F21*4(AT) fp22,R_F22*4(AT) 4–8 EXCEPTION MANAGEMENT CHAPTER 4 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 swc1 fp23,R_F23*4(AT) fp24,R_F24*4(AT) fp25,R_F25*4(AT) fp26,R_F26*4(AT) fp27,R_F27*4(AT) fp28,R_F28*4(AT) fp29,R_F29*4(AT) fp30,R_F30*4(AT) fp31,R_F31*4(AT) mflo mfhi sw sw mfc0 mfc0 sw sw mfc0 mfc0 sw mfc0 sw sw .set nosave: .set j v0 v1 v0,R_MDLO*4(AT) v1,R_MDHI*4(AT) v0,C0_INX v1,C0_RAND v0,R_INX*4(AT) v1,R_RAND*4(AT) v0,C0_TLBLO v1,C0_TLBHI v0,R_TLBLO*4(AT) v0,C0_CTXT v1,R_TLBHI*4(AT) v0,R_CTXT*4(AT) at 1: reorder exception_handler ENDFRAME(exc_utlb_code) /* ** resume -- resume execution of client code */ FRAME(resume,sp,0,ra) jal install_sticky jal clr_extern_brk jal clear_remote_int .set noat .set noreorder la AT,client_regs lw v0,status_base move v1,AT and v0,SR_CU1 beq v0,zero,1f # only save fpu regs if present move AT,v1 lw v1,R_SR*4(AT) nop or v0,v1 mtc0 v0,C0_SR lw v1,R_FCSR*4(AT) lwc1 fp0,R_F0*4(AT) ctc1 v1,$31 lwc1 fp1,R_F1*4(AT) lwc1 fp2,R_F2*4(AT) lwc1 fp3,R_F3*4(AT) lwc1 fp4,R_F4*4(AT) lwc1 fp5,R_F5*4(AT) lwc1 fp6,R_F6*4(AT) lwc1 fp7,R_F7*4(AT) lwc1 fp8,R_F8*4(AT) lwc1 fp9,R_F9*4(AT) lwc1 fp10,R_F10*4(AT) lwc1 fp11,R_F11*4(AT) lwc1 fp12,R_F12*4(AT) lwc1 fp13,R_F13*4(AT) lwc1 fp14,R_F14*4(AT) lwc1 fp15,R_F15*4(AT) lwc1 fp16,R_F16*4(AT) 4–9 CHAPTER 4 EXCEPTION MANAGEMENT lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 lwc1 fp17,R_F17*4(AT) fp18,R_F18*4(AT) fp19,R_F19*4(AT) fp20,R_F20*4(AT) fp21,R_F21*4(AT) fp22,R_F22*4(AT) fp23,R_F23*4(AT) fp24,R_F24*4(AT) fp25,R_F25*4(AT) fp26,R_F26*4(AT) fp27,R_F27*4(AT) fp28,R_F28*4(AT) fp29,R_F29*4(AT) fp30,R_F30*4(AT) fp31,R_F31*4(AT) 1: lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw lw mtlo mthi lw lw mtc0 mtc0 lw lw mtc0 mtc0 lw lw mtc0 move and intr */ mtc0 li move sw lw lw lw lw a0,R_A0*4(AT) a1,R_A1*4(AT) a2,R_A2*4(AT) a3,R_A3*4(AT) t0,R_T0*4(AT) t1,R_T1*4(AT) t2,R_T2*4(AT) t3,R_T3*4(AT) t4,R_T4*4(AT) t5,R_T5*4(AT) t6,R_T6*4(AT) t7,R_T7*4(AT) s0,R_S0*4(AT) s1,R_S1*4(AT) s2,R_S2*4(AT) s3,R_S3*4(AT) s4,R_S4*4(AT) s5,R_S5*4(AT) s6,R_S6*4(AT) s7,R_S7*4(AT) t8,R_T8*4(AT) t9,R_T9*4(AT) k1,R_K1*4(AT) gp,R_GP*4(AT) fp,R_FP*4(AT) ra,R_RA*4(AT) v0,R_MDLO*4(AT) v1,R_MDHI*4(AT) v0 v1 v0,R_INX*4(AT) v1,R_TLBLO*4(AT) v0,C0_INX v1,C0_TLBLO v0,R_TLBHI*4(AT) v1,R_CTXT*4(AT) v0,C0_TLBHI v1,C0_CTXT v0,R_CAUSE*4(AT) v1,R_SR*4(AT) v0,C0_CAUSE /* only sw0 and 1 writable */ v0,AT v1,~(SR_KUC|SR_IEC|SR_PE)/* make sure we aren't v1,C0_SR k0,MODE_USER AT,v0 k0,R_MODE*4(AT) v1,R_V1*4(AT) sp,R_SP*4(AT) k0,R_EPC*4(AT) v0,R_V0*4(AT) /* reset mode */ 4–10 EXCEPTION MANAGEMENT CHAPTER 4 lw AT,R_AT*4(AT) j k0 rfe .set reorder .set at ENDFRAME(resume) /* ** do_call(procedure, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8) ** interface for call command to client code ** copies arguments to new frame and sets up gp for client */ #define CALLFRM ((8*4)+4+4) FRAME(do_call, sp,CALLFRM,ra) subu sp,CALLFRM sw ra,CALLFRM-4(sp) sw gp,CALLFRM-8(sp) move v0,a0 move a0,a1 move a1,a2 move a2,a3 lw a3,CALLFRM+(4*4)(sp) lw v1,CALLFRM+(5*4)(sp) sw v1,4*4(sp) lw v1,CALLFRM+(6*4)(sp) sw v1,5*4(sp) lw v1,CALLFRM+(7*4)(sp) sw v1,6*4(sp) lw v1,CALLFRM+(8*4)(sp) sw v1,7*4(sp) la t1,client_regs lw gp,R_GP*4(t1) jal v0 lw gp,CALLFRM-8(sp) lw ra,CALLFRM-4(sp) addu sp,CALLFRM j ra ENDFRAME(do_call) /* ** clear_stat() -- clear status register ** returns current sr */ FRAME(clear_stat,sp,0,ra) .set noreorder lw v1,status_base mfc0 v0,C0_SR mtc0 v1,C0_SR j ra nop ENDFRAME(clear_stat) .set reorder /* ** setjmp(jmp_buf) -- save current context for non-local goto's ** return 0 */ FRAME(setjmp,sp,0,ra) sw ra,JB_PC*4(a0) sw sp,JB_SP*4(a0) sw fp,JB_FP*4(a0) sw s0,JB_S0*4(a0) sw s1,JB_S1*4(a0) sw s2,JB_S2*4(a0) sw s3,JB_S3*4(a0) sw s4,JB_S4*4(a0) 4–11 CHAPTER 4 EXCEPTION MANAGEMENT sw s5,JB_S5*4(a0) sw s6,JB_S6*4(a0) sw s7,JB_S7*4(a0) move v0,zero j ra ENDFRAME(setjmp) /* ** longjmp(jmp_buf, rval) */ FRAME(longjmp,sp,0,ra) lw ra,JB_PC*4(a0) lw sp,JB_SP*4(a0) lw fp,JB_FP*4(a0) lw s0,JB_S0*4(a0) lw s1,JB_S1*4(a0) lw s2,JB_S2*4(a0) lw s3,JB_S3*4(a0) lw s4,JB_S4*4(a0) lw s5,JB_S5*4(a0) lw s6,JB_S6*4(a0) lw s7,JB_S7*4(a0) move v0,a1 j ra ENDFRAME(longjmp) /* ** wbflush() flush the write buffer - this is specific for each hardware ** configuration. */ FRAME(wbflush,sp,0,ra) .set noreorder lw t0,wbflush#read an uncached memory location j ra nop .set reorder ENDFRAME(wbflush) INTERRUPTS The MIPS CPUs are provided with 6 individual hardware interrupt bits, activated by CPU input pins (in the case of the R3081, one pin is used internally by the FPA), and 2 additional software-settable interrupt bits. An active level on any pin is sensed in each cycle, and will cause an exception if enabled. The interrupt enable comes in two parts: • The global interrupt enable bit (IEc) in the status register – when zero no interrupt exception will occur. Simple, fast and comprehensive, this is what prevents interrupts occurring during the early and vulnerable stages of processing exceptions. Also, the global interrupt enable is usually switched back on by an rfe instruction at the end of an exception routine; this means that the interrupt cannot take effect until the CPU has returned from the exception and finished with the EPC register, avoiding undesirable recursion in the interrupt routine. • The individual interrupt mask bits IM in the status register, one for each interrupt. Set the bit 1 to enable the corresponding interrupt. These are manipulated by software to allow whichever interrupts are appropriate to the system. 4–12 EXCEPTION MANAGEMENT CHAPTER 4 Changes to the individual bits are usually made “under cover”, with the global interrupt enable off. What are the software interrupt bits for? One commonly asked question is: “Why does the CPU provide two bits in the Cause register which, when set, immediately cause an interrupt unless masked?” The clue is in ‘‘unless masked’’. Typically this is used as a mechanism for high-priority interrupt routines to flag actions which will be performed by lower-priority interrupt routines, once the system has dealt with all high priority business. As the high-priority processing completes, the software will open up the interrupt mask, and the pending software interrupt will occur. There is no definitive reason why the same effect should not be simulated by system software (using flags in memory, for example) but the soft interrupt bits are convenient because they fit in with the already provided interrupt handling mechanism. Pin SR/Cause bit no Notes 8 software interrupt 9 software interrupt Int0* 10 Cause bit reads 1 when pin low (active) Int1* 11 Int2* 12 Int3* 13 Int4* 14 Int5* 15 Usual choice for FPA. The pin corresponding to the interrupt selected for FPA interrupts on an R3081 is effectively a no-connect. Table 4.2. Interrupt bitfields and interrupt pins Interrupt processing proper begins after an exception is received and the Type field in Cause signals that it was caused by an interrupt. Table 4.2, “Interrupt bitfields and interrupt pins” describes the relationship between Cause bits and input pins. Once the interrupt exception is “recognized” by the CPU, the stages are: • Consult the Cause register IP field, logically-‘‘and’’ it with the current interrupt masks in the SR IM field to obtain a bit-map of active, enabled interrupt requests. There may be more than one, and any of them would have caused the interrupt. • Select one active, enabled interrupt for attention. The selection can be done simply by using fixed priorities; however, software is free to implement whatever priority mechanism is appropriate for the system. • Software needs to save the old interrupt mask bits of the SR register, but it is quite likely that the whole SR register was saved in the main exception routine. • Change IM in SR to ensure that the current interrupt and all interrupts of equal or lesser priority are inhibited. • If not already performed by the main exception routine, save the state required for nested exception processing. • Set the global interrupt enable bit IEc in SR to allow higher-priority interrupts to be processed. 4–13 CHAPTER 4 EXCEPTION MANAGEMENT • Call the particular interrupt service routine for the selected, current interrupt. • On return, disable interrupts again by clearing IEc in SR, before returning to the normal exception stream. Conventions and Examples The following is as simple as an exception routine can be. It does nothing except increment a counter on each exception: .set .set xcptgen: la lw nop addu sw mfc0 nop j rfe .set .set noreorder noat k0,xcptcount# get address of counter k1,0(k0)# load counter # (load delay) k1,1 # increment counter k1,0(k0)# store counter k0,C0_EPC# get EPC # (load delay, mfc0 slow) k0 # return to program # branch delay slot at reorder Note that this routine cannot survive a nested exception (the original return address in EPC would be lost, for example). It doesn’t re-enable interrupts; but note that the counter xcptcount should be at an address which can’t possibly suffer a TLB miss. 4–14 ® CACHE MANAGEMENT CHAPTER 5 Integrated Device Technology, Inc. 1 CACHES AND CACHE MANAGEMENT R30xx family CPUs implement separate on-chip caches for instructions (I-cache) and data (D-cache). Following RISC principles, hardware functions are provided only for normal operation of the caches; software routines must be provided to initialize the cache following system start-up, and to invalidate cache data when required†. Cache Memory tagstore memory address higher bits lo bits cache data store 0 index match? hit? Figure 5.1. data Direct mapped cache The cache’s job is to hold a copy of memory data which has been recently read or written, so it can be returned quickly to the CPU; in the R30xx architecture data accesses in the cache take just one clock, and an I-cache and a D-cache operation can occur together. When a cacheable location is read (a data load): • It will be returned from the D-cache if the cache contains the corresponding physical address and the cache line is valid there (called a cache ‘‘hit’’). In this case nothing happens at the CPUs memory interface, so the read is invisible to the outside world. • If the data is not found in the D-cache (called a cache “miss”), the data will be read from external memory. According to the CPU type and how it is set up, it may read one or more words from memory. The data is loaded into the cache, and normal operation then resumes. In normal operation, cache miss processing will cause the targeted cache line to “invalidate” the valid data already present in the cache. In the R30xx caches, cache data is never more up-to-date than memory (because the cache is write-through, described below), so the previously cached data can be discarded without any trouble. † Note that the R3071 and R3081 do implement a DMA protocol that allows automatic, hardware-based data cache invalidation. 5–1 CHAPTER 5 CACHE MANAGEMENT When data is loaded from an uncacheable location, it is always obtained from external memory (or a memory-mapped IO location). Most systems never access the same data locations as cached and uncached; however, the results of such a system would be predictable. On an uncacheable load cache data is neither used nor updated. When software writes a cached location: • If the CPU is doing a 32-bit store, the cache is always updated (possibly discarding data from a previously cached location). • For byte or half-word stores, the cache will only be updated if the reference hits in the cache; then data will be extracted from the cache, merged with the store data, and written back†. • If the partial-word store misses in the cache, then the cache is left alone. • In all cases, the write is also made to main memory. When the store target is an uncached location the cache is not consulted or modified. Figure 5.1, “Direct mapped cache” is a diagrammatic representation of the way the MIPS cache works. Both caches are: • Physically indexed, physically tagged: the CPUs program address (virtual address) is translated to a physical address, just as is used to address real memory, before being used for the cache lookup. The TAG comparison (checking for a hit) is also based on physical addresses. On certain other CPU families the cache index is based on program addresses (which are available a bit earlier); some CPUs even use virtual TAGs, which then require that the cache be flushed at context switch. But physical caches are easier to manage. • Direct mapped : Each physical address has only one location in each cache where it may reside. At each cache index there is only one data item stored – this will be just one word in the D-cache but is usually a 4-word line for the I-cache (see Figure 5.1, “Direct mapped cache”). Next to the data is kept the tag, which stores the memory address for which this data is a copy. If the tag matches the high-order (higher number) address bits then the cache line contains the data the CPU is looking for; the data is returned and execution continues. For an I-cache access, the CPU must select one of the four words based on the lowest address bits. This is a direct mapped cache because there is only one tag/data pair at each cache index. More complex caches may have more than one tag field, and compare them simultaneously with the physical address. A direct-mapped cache is very simple, but can suffer from cache thrashing; so the CPU can run slowly if a program loop is regularly accessing a pair of locations whose low-order addresses happen to be equal. To avoid this situation, the R30xx family implements relatively large caches, which minimize the probability of reasonable program loops causing CPU thrashing. • Cache lines : the line size is the number of data elements stored with each tag. For R30xx family CPUs the I-cache implements a 4-word line size; the D-cache always has 1-word lines. † In the R30xx family, the data will be merged in the D-Cache. However, the CPU bus will perform the store only to the bytes which were actually changed (i.e. the store datum size), facilitating debugging. 5–2 CACHE MANAGEMENT CHAPTER 5 When a cache miss occurs the whole line must be filled from memory. But it is quite possible to fetch more than a line’s worth of data; and R30xx family CPUs can be configured to fetch 4 words of data on a Dcache miss, refilling 4 1-word ‘‘lines’’. • Write through : the D-cache is write-through, meaning that all store operations result in a store to main memory. This means that all data in the cache is duplicated in main memory, and can therefore be discarded at any time. In particular, when data is being read following a cache miss it can always be stored in the cache without regard for the data which was previously stored at the same index. • Partial word write implementations : when the CPU writes only part of a word, it is essential that any valid cache data should still end up as a duplicate of main memory. One simple approach is to invalidate the cache line and to write only to main memory (the main memory must be byte-addressable). But the R30xx family uses a more efficient strategy: a) if the location being written is present in the cache (cache hit) the cache data is read into the CPU, the partial-word data merged with it, the whole word written back to the cache, and the partial-word written to memory. b) where the write misses in the cache the partial-word write is performed to memory only, and the cache left alone. Note that this takes an extra clock, so a partial-word write which hits in the cache is slower than a whole-word write. Cache isolation and swapping No special instructions are provided to explicitly access the caches; everything has to be done with load and store instructions. To distinguish operations for cache management from regular memory references, without having to dedicate a special address region for this purpose, the R30xx architecture provides bits in the SR to support cache management: • The SR mode bit “IsC” will isolate the D-cache; in this mode loads and stores affect only the cache, and loads also ‘‘hit’’ regardless of whether the tag matches. As a special mechanism, with the D-cache isolated a partial-word write will invalidate the appropriate cache line. Caution: when the D-cache is isolated, not even loads/stores marked by their address or TLB entry as ‘‘uncached’’ will operate normally. One consequence of this is that the cache management routines must not make any data accesses; they are typically written in assembler, using only register variables. • The CPU provides a mode where the caches are swapped (SR SwC bit), to allow the I-Cache to be targeted by store instructions; then the D-cache acts as an I-cache, and the I-cache acts as the D-cache. Once the caches are swapped and isolated I-cache entries may be read, written and invalidated (invalidation uses the same partial word write mechanism described above). Note that cache isolation does not stop instruction fetches from referencing main memory. The D-cache behaves ‘‘perfectly’’ as an I-cache (provided it was sufficiently initialized to work as a D-cache) but the I-cache does not behave properly as a D-cache. It is unlikely that it will ever be useful to have the caches swapped but not isolated. If software does use a swapped I-cache for word stores (a partial-word store invalidates the line, as before) it must make sure those locations are invalidated before returning to normal operation. 5–3 CHAPTER 5 CACHE MANAGEMENT Initializing and sizing the caches At machine start-up the caches are in a random state, so the result of a cached read is unpredictable. In addition, following a reset the status register SwC and IsC bits are also in a random state, so start-up software had better set them to a known state before attempting any load or store (even uncached). Different members of the R3051 family have different cache sizes. Software will be more portable if it dynamically determines the size of the I-cache and D-cache at initialization time, rather than hard-wiring a particular value. A number of algorithms are possible. Shown below is the code contained in IDT/sim for cache sizing. The basic algorithm works as follows:isolate the D-cache; • swap the caches when sizing the I-cache; • Write a marker into the initial cache entry. • Start with the smallest permissible cache size. • Read memory at the location for the current cache size. If it contains the marker, that is the correct size. Otherwise, double the size to try and repeat this step until the marker is found. /* ** Config_cache() -- determine sizes of i and d caches ** Sizes stored in globals dcache_size and icache_size */ #define CONFIGFRM ((4*4)+4+4) FRAME(config_cache,sp, CONFIGFRM, ra) .set noreorder subu sp,CONFIGFRM sw ra,CONFIGFRM-4(sp)# save return address sw s0,4*4(sp) # save s0 in first regsave slot mfc0 s0,C0_SR # save SR mtc0 zero,C0_SR # disable interrupts .set reorder jal _size_cache sw v0,dcache_size li v0,SR_SWC # swap caches .set noreorder mtc0 v0,C0_SR jal _size_cache nop sw v0,icache_size mtc0 zero,C0_SR # swap back caches and s0,~SR_PE # do not inadvertantly clear PE mtc0 s0,C0_SR # restore SR .set reorder lw s0,4*4(sp) # restore s0 lw ra,CONFIGFRM-4(sp)# restore ra addu sp,CONFIGFRM # pop stack j ra ENDFRAME(config_cache) /* ** _size_cache() ** return size of current data cache */ FRAME(_size_cache,sp,0,ra) .set noreorder mfc0 t0,C0_SR # save current sr and t0,~SR_PE # do not inadvertently clear PE or v0,t0,SR_ISC # isolate cache mtc0 v0,C0_SR /* * First check if there is a cache there at all */ move v0,zero li v1,0xa5a5a5a5 # distinctive pattern 5–4 CACHE MANAGEMENT CHAPTER 5 sw v1,K0BASE lw t1,K0BASE nop mfc0 t2,C0_SR nop .set reorder and t2,SR_CM bne t2,zero,3f bne v1,t1,3f /* * Clear cache size */ li v0,MINCACHE # try to write into cache # try to read from cache # cache miss, must be no cache # data not equal -> no cache boundries to known state. 1: 2: 3: sw sll ble zero,K0BASE(v0) v0,1 v0,MAXCACHE,1b li sw li v0,-1 v0,K0BASE(zero)# store marker in cache v0,MINCACHE # MIN cache size lw v1,K0BASE(v0) # Look for marker bne v1,zero,3f # found marker sll v0,1 # cache size * 2 ble v0,MAXCACHE,2b# keep looking move v0,zero # must be no cache .set noreorder mtc0 t0,C0_SR # restore sr j ra nop ENDFRAME(_size_cache) .set reorder In a properly initialized cache, every cache entry is either invalid or correctly corresponds to a memory location, and also contains correct parity. Again, the sample code shown is from IDT/sim. The code works as follows: • Check that SR bit PZ is cleared to zero (1 disables parity; the R3071 and R3081 contain parity bits, and thus PZ=1 could cause the caches to be initialized improperly). • Isolate the D-cache, swap to access the I-cache. • For each word of the cache: first write a word value (writing correct tag, data and parity), then write a byte (invalidating the line). Note that for an I-cache with 4 words per line this is inefficient; it would be enough to write just one byte in the line to invalidate the entry. Unless the system uses the invalidate routine often it doesn’t seem worth the trouble. FRAME(flush_cache,sp,0,ra) lw t1,icache_size lw t2,dcache_size .set noreorder mfc0 t3,C0_SR # save SR nop and t3,~SR_PE # dont inadvertently clear PE beq t1,zero,_check_dcache# if no i-cache check d-cache nop li v0,SR_ISC|SR_SWC# disable intr, isolate and swap mtc0 v0,C0_SR li t0,K0BASE .set reorder or t1,t0,t1 1: sb zero,0(t0) 5–5 CHAPTER 5 CACHE MANAGEMENT sb zero,4(t0) sb zero,8(t0) sb zero,12(t0) sb zero,16(t0) sb zero,20(t0) sb zero,24(t0) addu t0,32 sb zero,-4(t0) bne t0,t1,1b /* * flush data cache */ _check_dcache: li v0,SR_ISC # isolate and swap back caches .set noreorder mtc0 v0,C0_SR nop beq t2,zero,_flush_done .set reorder li t0,K0BASE or t1,t0,t2 1: sb sb sb sb sb sb sb addu sb bne zero,0(t0) zero,4(t0) zero,8(t0) zero,12(t0) zero,16(t0) zero,20(t0) zero,24(t0) t0,32 zero,-4(t0) t0,t1,1b .set noreorder _flush_done: mtc0 t3,C0_SR # un-isolate, enable interrupts .set reorder j ra ENDFRAME(flush_cache) Invalidation Invalidation refers to the act of setting specified cache lines to contain no valid references to main memory, but to otherwise be consistent (e.g. valid parity). Software needs to invalidate: • the D-cache when memory contents have been changed by something other than store operations from the CPU. Typically this is done when some DMA device is reading into memory. • the I-cache when instructions have been either written by the CPU or obtained by DMA. The hardware does nothing to prevent the same locations being used in the I- and D-cache; and an update by the processor will not change the I-cache contents. Note that the system could be constructed to use unmapped accesses to those variables shared with a DMA device; the only difference is in performance. In general small areas where DMA is frequent compared to CPU activity should be mapped uncached; and larger areas where CPU activity predominates should be invalidated by the driver at appropriate points. Bear in mind that invalidating a word of data in the cache is faster (probably 4-7 times faster) than an uncached load. To invalidate the cache: • Figure out the address range to invalidate. Invalidating a region larger than the cache size is a waste of time. 5–6 CACHE MANAGEMENT CHAPTER 5 • isolate the D-cache. Once it is isolated, the system must insure at all costs against an exception (since the memory interface will be temporarily disabled). Disable interrupts and ensure that software which follows cannot cause a memory access exception; • to work on the I-cache, swap the caches; • write a byte value to each cache line in the range; • (unswap and) unisolate. The invalidate routine is normally executed with its instructions cacheable. This sounds like a lot of trouble; but in fact shouldn’t require any extra steps to run cached. An invalidation routine in uncached space will run 4-10 times slower. Again, the example code fragment shown is taken from IDT/sim: /* ** clear_cache(base_addr, byte_count) ** flush portion of cache */ FRAME(clear_cache,sp,0,ra) 1: /* * flush instruction cache */ lw t1,icache_size lw t2,dcache_size .set noreorder mfc0 t3,C0_SR # save SR and t3,~SR_PE # dont inadvertently clear PE nop nop li v0,SR_ISC|SR_SWC# disable intr, isolate and swap mtc0 v0,C0_SR .set reorder bltu t1,a1,1f # cache is smaller than region move t1,a1 addu t1,a0 # ending address + 1 move t0,a0 sb sb sb sb sb sb sb addu sb bltu zero,0(t0) zero,4(t0) zero,8(t0) zero,12(t0) zero,16(t0) zero,20(t0) zero,24(t0) t0,32 zero,-4(t0) t0,t1,1b /* * flush data cache */ 1: 1: .set nop li mtc0 nop .set bltu move addu move sb sb sb sb noreorder v0,SR_ISC v0,C0_SR reorder t2,a1,1f t2,a1 t2,a0 t0,a0 zero,0(t0) zero,4(t0) zero,8(t0) zero,12(t0) 5–7 # isolate and swap back caches # cache is smaller than region # ending address + 1 CHAPTER 5 CACHE MANAGEMENT sb sb sb addu sb bltu zero,16(t0) zero,20(t0) zero,24(t0) t0,32 zero,-4(t0) t0,t2,1b .set noreorder mtc0 t3,C0_SR # un-isolate, enable interrupts .set reorder j ra ENDFRAME(clear_cache) Testing and probing During test, debug or when profiling, it may be useful to build up a picture of the cache contents. Software cannot read the tag value directly, but, for a valid line, can determine the tag value by exhaustive search: • isolate the cache; • load from the cache line at each possible line start address (low order bits fixed, high order bits ranging over physical memory which exists in the system). After each load consult the CM bit in SR, which will be ‘‘0’’ only when the tag value matches. This takes a long time by computer terms; but to fully search a 1K Dcache with 4Mbytes of cacheable physical memory on a 20Mhz processor will take only a couple of seconds, and will provide very valuable debugging information. IDT/sim provides this capability. Configuration (R3041/71/81 only) The R3041, R3071, and R3081 processors allow the programmer to make choices about the cache by setting fields in the Config register: • Cache refill burst size (R3041/71/81) : by default the R3041 refills only 1 word in the D-cache on a cache miss; but software can program it to use 4-word burst reads instead, by setting the Config DBR bit. The bit can be changed at any time, without needing to invalidate the cache. The refill of R3071 and R3081 processors can be configured by hardware at reset-time, but software can override that choice. This support is provided in the hope of enhancing performance. The proper selection for a given system will depend on both the hardware and the application. Some systems may find an advantage in “toggling” the bit for various portions of the software. In general, the proper burst size selection can be determined as follows: Burst reads make most sense when the memory is capable of returning a burst of data significantly faster than it can return 4 individual words. Many DRAM systems are like this; most ROM and static RAM memories are not. Similarly, data accessed from narrow memory ports should rarely be configured for a multi-word burst. If programs tend to access memory sequentially (working up or down a large array, for example) then the burst refill will offer a very useful degree of data prefetch, and performance will be enhanced. If cache access is more random, the burst refill may actually reduce performance (since it involves overwriting cached data with memory data the program may never use). As a general rule, the bigger the D-cache, the smaller the penalty for burst refills. • Bigger I-cache in exchange for smaller D-cache (R3071/81) : the R3081 cache can be organized either with both I-cache and D-cache 8Kbytes in size, or with a 16Kbyte I-cache and 4Kbyte D-cache. The configuration is programmed using the AC bit in the Config register. 5–8 CACHE MANAGEMENT CHAPTER 5 After changing the cache configuration both caches should be reinitialized, while running uncached. This means that most systems will not dynamically reconfigure the caches. Which configuration is best for a given system is mainly dependent on the software. Cache effects are extremely hard to predict, and it is recommended that both configurations be tried and measured, while running as much of the real system as possible. As a general rule: with large applications (like in a big OS) the big Icache will probably be best. If the system spends most of its time manipulating lots of data from tight program loops, the big D-cache may be better. WRITE BUFFER The write-through cache common to all R30xx family CPUs can be a big performance bottleneck. In the average C program only about 10% of instructions are stores, but these accesses tend to come in bursts; for example, when a function prologue saves a few registers. DRAM memory frequently has the characteristic that the first write of a group takes quite a long time (5-10 clocks typical on these CPUs), and subsequent ones are relatively fast so long as they follow quickly. If the CPU simply waits for all writes to complete, the performance hit will be significant. So the R30xx provides a write buffer, a FIFO store which keeps a number of entries each containing both data to be written, and the address at which to write it. The 4-entry queue provided by R30xx family CPUs is efficient for well-tuned DRAM. In general, the operation of the write buffer is completely transparent to software. Occasionally, the programmer needs to be aware of what is happening: • Timing relations for IO register accesses : When software performs a store to write an IO register, the store reaches memory after a small, but indeterminate, delay. Some consequences are: — other communication with the IO system (e.g. interrupts) may happen more quickly – for example, the CPU may get an interrupt from a device ‘‘after’’ it has been programmed to generate no interrupts. — if the IO device needs some time to recover after a write the program must ensure that the write buffer FIFO is empty before counting out that time period. — at the end of interrupt service, when writing to an IO device to clear the interrupt it is asserting, software must insure that the command is actually written to the device, and that it has had to respond, before re-enabling that interrupt; otherwise, spurious interrupts may be signalled. In these cases the programmer must ensure that the CPU waits while the write buffer empties. It is good practice to define a subroutine which does this job; it is traditionally called wbflush(). Hints on implementing this function are provided later in this chapter. On CPUs outside the R30xx family, even stranger things can happen: • Reads overtaking writes : a load instruction (uncached or missing in the cache) executed while the write buffer FIFO is not empty gives the CPU a choice: should it finish off the write, or use the memory interface to fetch data for the load? The R3041, R3051, R3052 and R3081 all have the same rule, which avoids potential problems: the write buffer is emptied before the load occurs. Although it seems tempting to instead implement a scheme which checks for conflicts, and allows the read to progress if no write buffer entry matches the read target address, such a scheme does not avoid the possible system problems. Specifically, writes to locations which 5–9 CHAPTER 5 CACHE MANAGEMENT may have side effects (e.g. semaphores, IO registers, etc.), are not detected under such a scheme, and can cause great headaches to the programmer. • Byte gathering : some write buffers watch for partial-word writes within the same memory word, and will combine those partial writes into a single operation. This is not done by any current R30xx family CPU, because such operation would pose problems with IO register writes. Implementing wbflush() IDT R30xx family CPUs enforce strict write priority (all pending writes retired to memory before main memory is read). Thus, implementing wbflush() is as simple as implementing an uncached load (e.g. from the boot PROM vector). This will stall the CPU until the writes have finished, and the load finished too. Alternately, the overhead can be minimized by performing an uncached load from the fastest memory available in the system. The code fragment below shows an implementation of WbFlush, taken from IDT/sim: /* ** wbflush() flush the write buffer - this is specific for each hardware ** configuration. */ FRAME(wbflush,sp,0,ra) .set noreorder lw t0,wbflush#read an uncached memory location j ra nop .set reorder ENDFRAME(wbflush) 5–10 ® MEMORY MANAGEMENT AND THE TLB CHAPTER 6 Integrated Device Technology, Inc. 1 MEMORY MANAGEMENT AND THE TLB Some R30xx family processors (“E” versions) have on-chip memory management hardware. This provides a mechanism for dynamically translating program addresses in the kuseg and kseg2 regions. The key piece of hardware is the ‘‘TLB†’’. The memory management is paged: with a fixed page size of 4Kbytes. The low-order 12 bit of the program address are used directly as the low order bits of the physical address, so address translation operates in 4K chunks. The TLB is a 64-entry associative memory. Each entry in an associative memory consists of a key field and a data field; when presented with a key, the memory returns the data of any entry where the key matches. In the R30xx family, the TLB is referred to as ‘‘fully-associative’’; this emphasizes that all keys are really compared with the input value in parallel. The TLB’s key field contains two sections: • Virtual page number : (VPN) this is just a program address with the low 12 bits cut off, since the low-order bits don’t participate in the translation process. • Address Space Identifier. (ASID): this is a magic number used to stamp translations, and (optionally) is compared with an extended part of the key. Why? In multi-tasking systems it is common to have all user-level tasks executing at the same sort of program addresses (though of course they are using different physical addresses); they are said to be using different address spaces. So translation records for different tasks will often share the same value of ‘‘VPN’’. If the TLB mechanism was not supported with an ASID, when the OS switches from one task to another, it would have to find and invalidate all TLB translations relating to the old task’s address space, to prevent them from being erroneously used for the new one. This would be desperately inefficient. Instead, the OS assigns a 6-bit unique code to each task’s distinct address space. During normal running this code is kept in the ASID field of the EntryHi register, and is used together with the program address to form the lookup key; so a translation with an ASID code which doesn’t match is quietly ignored. Since the ASID is only 6 bits long, OS software does have to lend a hand if there are ever more than 64 address spaces in concurrent use; but it probably won’t happen too often. In such a system, new tasks are assigned new ASIDs until all 64 are assigned; at that time, all tasks are flushed of their ASIDs “de-assigned” and the TLB flushed; as each task is re-entered, a new ASID is given. Thus, ASID flushing is relatively infrequent. The TLB data field includes: • Physical frame number (PFN) : the physical address with the low 12 bits cut off. In an address translation, the VPN bits are replaced by the corresponding PFN bits to form the true physical address. • Cache control bit (N) : set 1 to make the page uncacheable. † This is an acronym for ‘‘translation lookaside buffer’’, which is a look-up table of virtual to physical address translations. 6–1 CHAPTER 6 MEMORY MANAGEMENT AND THE TLB • Write control bit (D) : set 1 to allow stores to this page to happen. The ‘‘D’’ comes from this being called the ‘‘dirty bit’’; a later section on “Simulating dirty bits” describes a typical use for these bits. • Valid bit (V) : set 0 to make this entry usable. This seems pretty pointless; why have a record loaded into the TLB if the translation is not usable? But an access to an invalid page produces a different trap from a TLB refill exception, so making a page invalid means that some strange conditions can be made to take a different trap, which does not have to be handled by the superfast refill code. • Global bit (G) : set to disable the ASID-matching scheme, allowing an OS to map some program addresses to the same physical address for all tasks; it can be useful to have some corner of each address space mapped to the same physical locations. Sharp-eyed or experienced readers will notice that this means that the global bit is really more like part of the key than part of the data; the distinction tends to get blurred in associative memories. Translating an address is now simple, and goes like this: • CPU generates a program address : either for an instruction fetch, a load or a store, in one of the translated address regions. The low 12 bits are separated off, and the resulting VPN together with the current value of the ASID field in EntryHi used as the key to the TLB. • TLB matches key : selecting the matching entry. The PFN is glued to the low-order bits of the program address to form a complete physical address. • Valid? : the V and D bits are consulted. If it isn’t valid, or a store is being attempted with D cleared, the CPU takes a trap. As with all translation traps, the BadVaddr register will be filled with the offending program address and TLB registers Context and EntryHi pre-filled with relevant information. The system software can use these registers to obtain data for exception service. • Cached? : if the N bit is set the CPU looks in the cache for a copy of the physical location’s data; if it isn’t there it will be fetched from memory and a copy left in the cache. Where the C bit is clear the CPU neither looks in nor refills the cache. Of course, there are only 64 entries in the TLB, which can hold translations for a maximum of 256 Kbytes of program addresses. This is far short of enough for most systems. The TLB is almost always going to be used as a software-maintained ‘‘cache’’ for a much larger set of translations. When a program address lookup in the TLB fails, a TLB refill trap is taken. System software has the job of: • figuring out whether there is a correct translation; if not the trap will be dispatched to the software which handles address errors. • if there is a correct translation, constructing a TLB entry which will implement it; • if the TLB is already full (and it almost always is full in running systems), selecting an entry which can be discarded; • writing the new entry into the TLB. 6–2 MEMORY MANAGEMENT AND THE TLB CHAPTER 6 See below for how this can be tackled; but note here that although special CPU features help out with one particular class of implementations, the software can refill the TLB any way it likes. Register Mnemonic EntryHi Description CP0 reg no Together these registers hold a TLB entry. All reads and writes to the TLB must be staged through them. EntryHi also remembers the current ASID. 10 Index Determines which TLB entry will be read/written by appropriate instructions 0 Random pseudo-random value (actually a free-running counter) used by a tlbwr to write a new TLB entry into a ‘‘randomly’’ selected location. 1 Context Convenience register provided to speed up the processing of TLB refill traps. The high-order bits are read/write; the low-order 21 bits reflect the BadVaddr value. (The register is designed so that, if the system uses the ‘‘favored’’ arrangement of memory-held copies of memory translation records, it will be setup by a TLB refill trap to point to the memory location of the record needed to map the offending address. This speeds up the process of finding the current memory mapping, and arranging EntryHi/Lo properly). 4 EntryLo 2 Table 6.1. CPU control registers for memory management MMU registers described EntryHi, EntryLo 31 12 VPN 11 6 5 ASID 0 0 EntryHi Register (TLB key fields) Figure 6.1. EntryHi and EntryLo register fields 31 12 PFN 11 10 9 8 7 N D V G 0 0 EntryLo Register (TLB data fields) Figure 6.2. EntryHi and EntryLo register fields These two registers represent a TLB entry, and are best considered as a pair. Fields in EntryHi are: • VPN : ‘‘virtual page number’’, the high-order bits of a program address. On a refill exception this field is set up automatically to match the program address which could not be translated. To write a different TLB entry, or attempt a TLB probe, software must set it up “manually”. • ASID : ‘‘address space identifier’’, normally left holding the OS’ value for the current address space. This is not changed by exceptions. Most software systems will deliberately write this field only to setup the current address space. However, software must be careful when using tlbr to inspect TLB entries; the operation overwrites the whole of EntryHi, so software needs to restore the correct current ASID value afterwards. 6–3 CHAPTER 6 MEMORY MANAGEMENT AND THE TLB Fields in EntryLo are: • PFN : the high-order bits of the physical address to which values matching EntryHi’s VPN will be translated. • N : ‘‘noncacheable’’; 0 to make the access cacheable, 1 for uncacheable. • D : ‘‘dirty’’, but really a write-enable bit. 1 to allow writes, 0 and any store using this translation will be trapped. • V : ‘‘valid’’, if 0 any address matching this entry will cause an exception. • G : ‘‘global’’. When the G bit in a TLB entry is set, that TLB entry will match solely on the VPN field, regardless of whether the TLB entry’s ASID field matches the value in EntryHi. • Fields called ‘‘0’’ : these fields always return zero; but unlike many reserved fields, they do not need to be written as zero (nothing happens regardless of the data written). This is important; it means that the memory-resident data which is used to generate EntryLo when refilling the TLB can contain some software-interpreted data in these fields, which the TLB hardware will ignore without the need to spend precious CPU cycles masking it. Index 31 30 P × 14 13 8 7 × Index Figure 6.3. 0 Fields in the Index register The ‘‘P’’ field is set when a tlbp instruction (tlb probe, used to see if the TLB can translate a particular VPN) failed to find a valid translation; since it is the top bit it appears to make the 32-bit value negative, which is easy to test for. Random 31 14 × 13 Random Figure 6.4. 8 7 0 × Fields in the Random register Most systems never have to read or write the Random register, shown as Figure 6.4, “Fields in the Random register”, in normal use; but it may be useful for diagnostics. The hardware initializes the Random field to its maximum value (63) on reset, and it decrements every clock period until it reaches 8, when it wraps back to 63 and starts again. Context 31 PTEBase 21 20 2 Bad VPN Figure 6.5. 1 0 0 Fields in the Context Register • PTEBase : a location which just stores what is put in it. In the ‘‘standard’’ refill handler, this will be the high-order bits of the (1Mbyte aligned) starting address of a memory-resident page table. • Bad VPN : following an addressing exception this holds the high-order bits of the address; exactly the same as the high-order bits of BadVaddr. However, if the system uses the ‘‘standard’’ TLB refill 6–4 MEMORY MANAGEMENT AND THE TLB CHAPTER 6 exception handling code the 32-bit value formed by Context is directly usable as a pointer to the memory-resident page table, considerably shortening the refill exception code. • Fields marked 0 : can be written with any value, but they will always read zero. MMU control instructions tlbr – Read TLB entry at index tlbwi – Write TLB entry at index The above two instructions move MMU data between the TLB entry selected by the Index register and the EntryHi and EntryLo registers. tlbwr – Write TLB entry selected by Random copies the contents of EntryHi & EntryLo into the TLB entry indexed by the random register. This saves time when using the recommended random replacement policy. In practice, tlbwr will be used to write a new TLB entry in a TLB refill exception handler; tlbwi will be used anywhere else. tlbp – TLB lookup searches (probes) the TLB for an entry whose virtual page number and ASID matches those currently in EntryHi, and stores the index of that entry in the index register (index is set to a negative value if nothing matches). If more than one entry matches, anything might happen. Note that tlbp does not fetch data from the TLB. The instruction following a tlbp must not be a load or store. Programming interface to the TLB TLB entries are set up by writing the required fields into EntryHi and EntryLo and using a tlbwr or tlbwi instruction to copy that entry into the TLB proper. When handling a TLB refill exception, EntryHi has been set up automatically, with the current ASID and the required VPN. Be very careful not to create two entries which will match the same program address/ASID pair. If the TLB contains duplicate entries an attempt to translate such an address, or probe for it, produces a fatal ‘‘TLB shutdown’’ condition (indicated by the TS bit in SR being set). It can be cleared only by a hardware reset. System software often won’t need to read TLB entries at all. But if necessary, software can find the TLB entry matching some particular program address using tlbp to setup the Index register. Don’t forget to save EntryHi and restore it afterwards because its ASID field is likely to be important. Use a tlbr to read the TLB entry into EntryHi and EntryLo. How refill happens When a program makes an access in kuseg or kseg2 to a page for which no translation record is present, the CPU takes a TLB refill exception. The assumption is that system software is maintaining a large number of page translations and is using the TLB as a cache of recently-used translations; so the refill exception will normally be handled by finding a correct translation, installing it, and returning to user code. In ‘‘CISC’’ CPUs the TLB is a cache (usually implemented by microcode), and the CPU automatically reads memory-resident ‘‘page tables’’ whose structure is part of the CPU architecture. In the MIPS architecture software is fast enough, and offers greater flexibility. To save time on user-program TLB refill exceptions (which will happen frequently in a ‘‘big’’ OS): • refill exceptions on kuseg program addresses are vectored through a low-memory address used for no other exception; 6–5 CHAPTER 6 MEMORY MANAGEMENT AND THE TLB • special exception rules permit the kuseg refill handler to risk a nested TLB refill exception on a kseg2 address. The problem is that before an exception routine can itself suffer an exception it must first save the previous program state, represented by the EPC return address and some SR bits. This is helped out by a hardware feature and a software convention: a) the KUo, IEo bits in the status register act as a third level of the processor-state stack, so that the CPU state already saved as a result of the kuseg refill exception can be preserved during the nested exception. b) The kuseg refill handler copies EPC into the k1 register; the general exception code and kseg2 refill handler are then careful to preserve its value, enabling a clean return. Refill exceptions on kseg2 addresses are expected to be rare enough that it will not matter if they share in the overhead of the ‘‘all other exceptions’’ entry point. However, once software determines the type of exception the handling is similar. Using ASIDs By setting up TLB entries with a particular ASID setting and with the EntryLo G bit zero, those entries will only ever match a program address when the CPU’s ASID register is set the same. This allows software to map up to 64 different address spaces simultaneously, without requiring that the OS clear out the TLB on a context change. In typical usage, new tasks are assigned an “un-initialized” ASID. The first time the task is invoked, it will presumably miss in the TLB, allowing the assignment of an ASID. If the system does run out of new ASIDs, it will flush the TLB and mark all tasks as “new”. Thus, as each task is reentered, it will be assigned a new ASID. This sequence is expected to happen infrequently if ever. The Random register and wired entries The hardware offers no way of finding out which TLB entries have been used most recently. When the system needs to replace a mapping dynamically (using the TLB as a cache) the only practicable strategy is to replace an entry at random. The CPU makes this easy by maintaining the Random register, which counts (down) with every processor cycle. However, it is often useful to have some TLB entries which are guaranteed to stay there unless explicitly removed. These may be useful to map pages which are known to be required very often; they are critical because they allow the system to map pages and guarantee that no refill exception will be generated on them. The stable TLB entries are described as ‘‘wired’’ and on R30xx family CPUs consist of TLB entries 0 through 7. There is nothing special about these entries; the magic is in the Random register, which never takes values 0-7; it cycles directly from 63 down to 8 before reloading with 63. So conventional random replacement leaves TLB entries 0 through 7 unaffected, and entries written there will stay until explicitly removed. Memory translation – setup The following code fragment initializes the TLB to ensure no match on any kuseg or kseg2 address. This is important, and is preferable to initializing with all “0”’s (which is a kuseg address, and which would cause multiple matches if referenced): LEAF(mips_init_tlb) mfc0 t0,C0_ENTRYHI # save asid mtc0 zero,C0_ENTRYLO# tlblo = !valid li a1,NTLBID< vaddr) >> VMPGSHIFT; unsigned vpn = xcp->vaddr >> VMPGSHIFT; unsigned asid = 0; /* write a random tlb (entryhi, entrylo) pair */ /* mark it valid, global, uncached, and not writable/dirty */ r3k_tlbwr ((vpn < double */ cvt.d.w fd,fs fd = (double) fs;/* int -> double */ cvt.s.d fd,fs fd = (float) fs;/* double -> float */ cvt.s.w fd,fs fd = (float) fs;/* int -> float */ cvt.w.s fd,fs fd = (int) fs;/* float -> int */ cvt.w.s fd,fs fd = (int) fs;/* double -> int */ Table 8.7. FPA data conversion operations When converting from FP formats to 32-bit integer, the result produced depends on the current rounding mode. Conditional branch and test instructions The FP test and branch instructions are separate. A test instruction compares two FP values and set the FPA condition bit accordingly (C in the FP status register); the branch instructions branch on whether the bit is set or unset. 8–10 FLOATING POINT CO-PROCESSOR CHAPTER 8 The branch instructions are: bc1f disp Branch if C bit ‘‘false’’ (zero) bc1t disp Branch if C bit ‘‘true’’ (one) Like the CPU’s other conditional branch instructions disp is PC-relative, with a signed 16-bit field as a word displacement. disp is usually coded as the name of a label, which is unlikely to end up more than 128Kbytes away. But before executing the branch, the condition bit must be set appropriately. The comparison operators are: c. .d fs1,fs2 Compare fs1 and fs2 and set C c. .s fs1,fs2 Where is any of 16 conditions called: eq, f, le, lt, nge, ngl, ngle, ngt, ole, olt, seq, sf, ueq, ule, ult, un. Why so many? These test for any ‘‘OR’’ combination of three mutually incompatible conditions: fs1 f2) goto foo;/* and trap if unordered */ 8–11 CHAPTER 8 FLOATING POINT CO-PROCESSOR c.ole.d $f0, $f2 nop bc1f foo # the assembler will do this... Fortunately, many assemblers recognize and manage this delay slot properly. INSTRUCTION TIMING REQUIREMENTS FP arithmetic instructions are interlocked (the instruction flow “stalls” automatically until results are available; the programmer does not need to be explicitly aware of execution times), and there is no need to interpose ‘‘nops’’ or to reorganize code for correctness. However, optimal performance will be achieved by code which lays out FP instructions to make the best use of overlapped execution of integer instructions, and the FP pipeline. However, the compiler, assembler or (in the end) the programmer must take care about the timing of: • Operations on the FP control and status register: moves between FP and integer registers complete late, and the resulting value cannot be used in the following instruction. • FP register loads: like integer loads, take effect late. The value can’t be used in the following instruction. • Test condition and branch: the test of the FP condition bit using the bc1t, bc1f instructions must be carefully coded, because the condition bit is tested a clock earlier than might be expected. So the conditional branch cannot immediately follow a test instruction. INSTRUCTION TIMING FOR SPEED The R30xx family FPA takes more than one clock for most arithmetic instructions, and so the pipelining becomes visible. The pipeline can show up in three ways: • Hazards: where the software must ensure the separation of instructions to work correctly; • Interlocks: where the hardware will protect the software by delaying use of an operand until it is ready, but knowledgable re-arrangement of the code will improve performance; • Overlapping: where the hardware is prepared to start one operation before another has completed, provided there are no data dependencies. This is discussed later. Hazards and interlocks arise when instructions fail to stick to the general MIPS rule of taking exactly one clock period between needing operands and making results ready. Some instructions either need operands earlier (branches, particularly, do this), or produce results late (e.g. loads). All R30xx family instructions which can cause trouble are tabulated in an appendix of this manual. INITIALIZATION AND ENABLE ON DEMAND Reset processing will normally initialize the CPU’s SR register to disable all optional co-processors, which includes the FPA (alias coprocessor 1). The SR bit CU1 has to be set for the FPA to work. 8–12 FLOATING POINT CO-PROCESSOR CHAPTER 8 To determine availability of a hardware FPA, software should read the FPA implementation register; if it reads zero, no FP is fitted and software should run the system with CU1 off†. Once CU1 is enabled, software should setup the control/status register FCR31 with the system choice of rounding modes and trap enables. Once the FPA is operating, the FP registers should be saved and restored during interrupts and context switches. Since this is (relatively) timeconsuming, software can optimize this: • Leave the FPA disabled by default when running a new task. Since the task cannot now access the FPA, the OS doesn’t have to save and restore registers. • On a FP instruction trap, mark the task as an FP user and enable the FP before returning to it. • Disable FP operations while in the kernel, or in any software called directly or indirectly from an interrupt routine. This avoids saving FP registers on an interrupt; instead FP registers need be saved only when context-switching to or from an FP using task. FLOATING POINT EMULATION The low-cost members of the R30xx family do not have a hardware FPA. Floating point functions for these processors are provided by software, and are slower than the hardware. Software FP is useful for systems where floating point is employed in some rarely-used routines. There are two approaches: • Soft-float: Some compilers can be requested to implement floating point operations with software. In such a system, the instruction stream does not contain actual floating point operations; instead, when the software requests floating point from the compiler, the compiler inserts a call to a dedicated floating point library. This eliminates the overhead of emulating a floating point register file, and also the overhead of decoding the requested operation. • Run-time emulation: The compiler can produce the regular FP instruction set. The CPU will then take a trap on each FP instruction, which is caught by the FP emulator. The emulator decodes the instruction and performs the requested operation in software. Part of the emulator’s job will be emulating the FP register set in memory. This technique is much slower than the soft-float technique; however, the binaries generated will automatically gain significant performance when executed by an R3081, simplifying system upgrades. As described above, a run-time emulator may also be required to back up FP hardware for very small operands or obscure operations; and, for maximal flexibility that emulator is usually complete. However, it will be written to ensure exact IEEE compatibility and is only expected to be called occasionally, so it will probably be coded for correctness rather than speed. Compiled-in floating point (soft-float) is much more efficient on integer only chips; the emulator has a high overhead on each instruction from the trap handler, instruction decoder, and emulated register file. † Some systems may still enable CP1, to use the BrCond(1) input pin as an input port. The software must then insure that no FPA operations are actually required, since the CPU will presume that they are actually executed. 8–13 ® ASSEMBLER LANGUAGE PROGRAMMING CHAPTER CHAPTER 9 9 Integrated Device Integrated DeviceTechnology, Technology,Inc. Inc. 1 This chapter details the techniques and conventions associated with writing and reading MIPS assembler code. This is different from just looking at the list of machine instructions because: 1) MIPS assemblers provide a large number of extra ‘‘macro’’ instructions which provide a richer instruction set than in fact exists at the machine level. 2) Programmers need to know the exact syntax of directives to start and end functions, define data, control instruction ordering and optimization, etc. Before reading much further, it may be a good idea to go back and review Chapter 2 (MIPS Architecture). It describes the low-level machine instruction set, data types, addressing modes, and conventional register usage. SYNTAX OVERVIEW Appendix C of this manual contains the formal syntax for the original MIPS Corp. assembler; most assemblers from other vendors follow this closely, although they may differ in their support of certain directives. These directives and conventions are similar to those found in other assemblers, especially a UNIX† assembler. Key points to note • The assembler allows more than one statement on each line, as long as they are separated by semi-colons. • "White space" (tabs and spaces) is permitted between any symbols. • All text from a ‘#’ to the end of the line is a comment and is ignored, but do not put a ‘#’ in column 1. • Identifiers for labels, variables, etc. can be any combination of alphanumeric characters plus ‘$’, ‘_’ and ‘.’, except for the first character which cannot be numeric: Good labels: AVeryLongIdentifier frog$spawn frog.spawn __peculiar2 # # # # # lower case is different from upper case dollars allowed in names ’.’ is also valid leading underscores often used to avoid name clashes in C Bad labels: 7down frog-spawn # leading decimal # "-" not allowed • The assembler allows the use of numbers (decimal between 1-99) as a label. These are treated as ‘‘temporary’’, and are “re-usable”. In a branch instruction ‘‘1f’’ (forward) refers to the next ‘‘1:’’ label in the code, and ‘‘1b’’ (back) refers to the last-met ‘‘1:’’ label. This eliminates the need for inventing unique but meaningless names for little branches and loops. Many programmers reserve named labels for subroutine entry points. † UNIX is a trademark of Univel Inc. 9–1 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING • The MIPS Corp. assembler, among others, provides the conventional register names (a0, t5, etc.) as C pre-processor macros; thus, the programmer must pass the source through the C preprocessor and include the file †. • If the C preprocessor is indeed used, then typically it is permitted to also use C-style /* comments */ and macros. • Hexadecimal constants are numbers preceded by ‘‘0x’’ or ‘‘ 0X’’; octal constants must be preceded by ‘‘0’’; be careful not to put a redundant zero on the front of a decimal constant. Constants are: 0 0x80000000 0377 08 01024 # # # # # strictly octal zero, but who cares? the biggest negative integer 255 decimal, probably what was meant illegal (0 implies octal) octal for 528, probably not what was meant • Pointer values can be used; in a word context, a label or relocatable symbol stands for its address as a 32-bit integer. The identifier ‘.’ (dot) represents the current location counter. Many assemblers even allow some limited arithmetic. • Character constants and strings can contain the following special characters, introduced by the backslash ‘\’ escape character: character generated code \a alert (bell) \b backspace \e escape \f formfeed \n newline \r carriage return \t horizontal tab \v vertical tab \\ backslash \’ single quote \" double quote \0 null (integer 0) A character can be represented as a one-, two-, or three-digit octal number (\ followed by octal digits), or as a one-, two-, or three-digit hexadecimal number ( \x followed by hexadecimal digits). • The precedence of binary and unary operations in constant expressions follows the C definition. REGISTER-TO-REGISTER INSTRUCTIONS Most MIPS machine instructions are three-register operations, i.e. they are arithmetic or logical functions with two inputs and one output, for example: † In IDT/c version 5.0 and later, the header files exist in the directory “/idtc”. The pre-processor is automatically invoked if the extension of the filename is anything other than “.s”. To force the pre-processor to be used with “.s” files, use the switch “xassemble-with-cpp” in the command line. 9–2 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 rd = rs + rt • rd : is the destination register, which receives the result of functions op; • rs : is a source register (operand); • rt : is a second source register. In MIPS assembly language these type of instructions are written: opcode rd, rs, rt For example: addu $2, $4, $5 # $2 = $4 + $5 Of course any or all of the register operands may be identical. To produce a CISC-style, two-operand instruction just use the destination register as a source operands; the assembler will do this automatically if rs is omitted. addu $4, $5 → addu $4, $4, $5 # $4 = $4 + $5 Unary operations (e.g. neg, not) are always synthesized from one or more of the three-register instructions. The assembler expects maximum of two operands for these instructions (dst and src): neg not $2, $4 $3 → → sub nor $2, $0, $4 $3, $0, $3 # $2 = -$4 # $3 = ~$3 Probably the most common register-to-register operation is move. This ubiquitous instruction is in fact implemented by an addu with the always zero-valued register $0: move $3, $5 → addu $3, $5, $0 # $3 = $5 IMMEDIATE (CONSTANT) OPERANDS An immediate operand is the traditional term for a constant value found in a field of the instruction. Many of the MIPS arithmetic and logical operations have an alternative form which use a 16-bit immediate in place of rt. The immediate value is first sign-extended or zero-extended to 32bits, for arithmetic or logical operations respectively. Although an immediate operand implies different low-level machine instruction from its three-register version (e.g. addi instead of add), there is no need for the programmer to write this explicitly. The assembler will spot the case when the final operand is an immediate, and use the correct machine instruction. For example: add $2, $4, 64 → addi $2, $4, 64 If an immediate value is too large to fit into the 16-bit field in the machine instruction, then the assembler helps out again. It automatically loads the constant into the assembler temporary register $at/$1 and then performs the operation using that. add $4, 0x12345 → li add $at, 0x12345 $4, $4, $at Note the li (load immediate) instruction, which again isn’t found in the machine’s instruction set; li is a heavily-used macro instruction which loads a 32-bit integer value into a register, without the programmer having to worry about how it gets there: 9–3 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING • When the 32-bit value lies between ±32K it can use a single addiu with $0; when bits 31-16 are all zero it can use ori; when the bits 150 are all zero it will be lui; and when none of these is possible it will be a an lui/ori pair: li $3, -5 → addiu $3, $0, -5 li $4, 0x8000 → ori $4, $0, 0x8000 li $5, 0x120000→lui $5, 0x12 li $6, 0x12345→ lui $6, 0x1 ori $6, $6, 0x2345 MULTIPLY/DIVIDE The multiply and divide machine instructions are unusual: • they do not accept immediate operands; • they do not perform overflow or divide-by-zero tests; • they operate asynchronously – so other instructions can be executed while they do their work; • they store their results in two separate result registers (hi and lo), which can only be read with the two special instructions mfhi and mflo; • the result registers are interlocked – they can be read at any time after the operation is started, and the processor will stall until the result is ready. However the conventional assembler multiply/divide instructions will hide this: they are complex macro instructions which simulate a threeoperand instruction and perform overflow checking. A signed divide may generate about 13 instructions, but they execute in parallel with the hardware divider so that no time is wasted (the divide itself takes 35 cycles). Instruction Description mul simple unsigned multiply, no checking mulo signed multiply, checks for overflow above 32-bits mulou unsigned multiply, checks for overflow above 32-bits div signed divide, checks for zero divisor or divisor of -1 with most negative dividend. divu unsigned divide, checks for zero divisor rem signed remainder, checks for zero divisor or divisor of -1 with most negative dividend. remu unsigned remainder, checks for zero divisor Some MIPS assemblers will convert constant multiplication, and division/remainder by constant powers of two, into the appropriate shifts, masks, etc. Don’t rely on this though, as most toolchains expect the compiler or assembly-language programmer to spot this sort of optimization. To explicitly control the multiplication, specify a dst of $0. The assembler will issue the raw machine instruction to start the operation; it is then up to the programmer to fetch the result from hi and/or lo and, if required, perform overflow checking. 9–4 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 LOAD/STORE INSTRUCTIONS The following table lists all the assembler’s load/store instructions. The signed load instructions sign-extend the memory data to 32-bits; the unsigned instructions zero-extend. Load Signed Store Description sw word Unsigned lw lh lhu sh halfword lb lbu sb byte usw unaligned word ush unaligned halfword lwl swl word left lwr swr word right l.d s.d double precision floating-point l.s s.s lwc1 swc1 single precision floating-point (i.e., coprocessor 1 register) ulw ulh ulhu Don’t forget the architectural constraints of load/store instructions: • Strict alignment: addresses must be aligned correctly (i.e. a multiple of 4 for words, and 2 for halfwords), except for the special left, right and unaligned variants (described below), or else they will cause an exception. • Load delay: all load instructions require at least one other instruction between them and the instruction which uses their result – but most assemblers should guarantee this by inserting a nop if necessary. There is a special exception to this rule for lwl followed immediately by lwr to the same register, or vice versa (the last instruction of the pair will still have the delay slot, but no delay slot is required between the instructions in the pair). Unaligned loads and store As noted above, normal load and store instructions must have a correctly aligned address. This can occasionally cause problems when porting software from CISC architectures which allow unaligned addresses. All data structures that are declared as part of a standard C program will be aligned correctly. But addresses computed at run-time, or data structures declared using a non-standard language extension, may require that software copes with unaligned addresses. While this can be done by a combination of byte loads, shifts and adds, the MIPS architecture provides the special purpose lwl, lwr, swl and swr instructions. An unaligned word can be accessed using just two of these special instructions as a pair, however they are not usually used directly, but are generated by the ulw (unaligned load word) and usw (unaligned store word) macro instructions. The ulh, ulhu, and ush unaligned halfword macro instructions do not use the special instructions. Unaligned halfwords loads generate two lb’s, a shl and an or (4 instructions); stores generate two sb’s and a shr (3 instructions). 9–5 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING ADDRESSING MODES As discussed above, the hardware supports only one addressing mode: base_reg+offset, where offset is in the range –32768 to 32767. However the assembler simulates direct and direct+index-reg addressing modes by using two or three machine instructions, and the assembler-temporary register. lw $2, ($3) → lw $2, 0($3) lw $2, 8+4($3) → lw $2, 12($3) lw $2, addr → lui lw $at, %hi_addr $2, %lo_addr($at) sw $2, addr($3) → lui addu sw $at, %hi_addr $at, $at, $3 $2, %lo_addr($at) The store instruction is written with the source register first and the address second, to look like a load; for other operations the destination is first. The symbol addr in the above examples can be any of these things: • a relocatable symbol – the name of a label or variable (whether in this module or elsewhere); • a relocatable symbol ± a constant expression; • a 32-bit constant expression (e.g. the absolute address of a device register). The constructs ‘‘%hi_’’ and ‘‘%lo_’’ do not actually exist in the assembler, but represent the high and low 16-bits of the address. This is not quite the straightforward division into low and high words that it looks, because the 16-bit offset field of a lw is treated as signed. So if the ‘‘addr’’ value is such that bit 15 is a ‘‘1’’, then the %lo_addr value will act as negative, and the assembler needs to increment %hi_addr to compensate: addr %hi_addr %lo_addr 0x12345678 0x1234 0x5678 0x10008000 0x1001 0x8000 The la (load address) macro instruction provides a similar service for addresses as the li instruction provides for integer constants: la $2, 4($3) → addiu $2, $3, 4 la $2, addr → lui addiu $at, %hi_addr $2, $at, %lo_addr la $2, addr($3) → lui addiu addu $at, %hi_addr $2, $at, %lo_addr $2, $2, $3 In principle, la could avoid apparently-negative ‘‘%lo_’’ values by using an ori instruction. But the linker has to be able to fix up addresses in the signed ‘‘%lo_’’ format found for load/store instructions – so la uses the add instruction so as to use the same kind of address fixup. Gp-relative addressing Loads and stores to global variables or constants usually require at least two instructions, e.g.: lw $2, addr → lui lw $at, %hi_addr $2, %lo_addr($at) 9–6 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 $2, addr($3) → sw lui addu sw $at, %hi_addr $at, $at, $3 $2, %lo_addr($at) A common low-level optimization supported by many toolchains is to use gp-relative addressing. This technique requires the cooperation of the compiler, assembler, linker and run-time start-up code to pool all of the ‘‘small’’ variables and constants into a single region of maximum size 64Kb, and then set register $28 (known as the global pointer or gp register) to point to the middle of this region†. With this knowledge the assembler can reduce the number of instructions used to access any of these small variables, e.g.: → lw $2, addr sw $2, addr($3) → lw $2, addr – _gp($at) addu sw $at, $gp, $3 $2, addr – _gp($at) By default most toolchains consider objects less than or equal to 8 bytes in size to be ‘‘small’’. This limit can usually be controlled by the ‘-G n’ compiler/assembler option; specifying ‘-G 0’ will switch this optimization off altogether. While it is a useful optimization, there are some pitfalls to beware of: • The programmer must take special care when writing assembler code to declare global data items correctly: a) Writable, initialized data of 8 bytes or less must be put explicitly into the .sdata section. b) Global common data must be declared with the correct size, e.g: .comm .comm c) smallobj, 4 bigobj, 100 Small external variables should also be explicitly declared, e.g: .externsmallext, 4 d) • • • • Most assemblers are effectively one-pass, so make sure that the program declares data before using it in the code, to get the most out of the optimization. In C, global variables must be declared correctly in all modules which use them. For external arrays either omit the size (e.g. extern int extarray[]), or give the correct size (e.g.int cmnarray[NARRAY]). Don’t just give a dummy size of 1. A very large number of small data items or constants may cause the 64Kb limit to be exceeded, causing strange relocation errors when linking. The simplest solution here is to completely disable gp-relative addressing (i.e. use –G 0). Some real-time operating systems, and many PROM monitors, can be entered by direct subroutine calls, rather then via a single ‘‘system call’’ interface. This makes it impossible (or at least very difficult) to switch back and forth between the two different values of gp that will be used by the application, and by the o/s or monitor. In this case either the applications or the o/s (but not necessarily both) must be built with –G 0. When the –G 0 option has been used for compilation of any set of modules, then it is usually essential that all libraries should also be compiled that way, to avoid relocation errors. † The actual handling may be toolchain dependent; this is the most common technique. 9–7 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING JUMPS, SUBROUTINE CALLS AND BRANCHES The MIPS architecture follows Motorola nomenclature: • PC-relative instructions are called ‘‘branch’’, and absolute-addressed instructions ‘‘jump’’; the operation mnemonics begin with a b or j. • A subroutine call is ‘‘jump and link’’ or ‘‘branch and link’’, and the mnemonics end ..al. • All the branch instructions, even branch-and-link, are conditional, testing one or two registers. They are therefore described in the next section. However, unconditional versions can be readily synthesized, e.g.: beq $0, $0, label. Jump instructions are: • j: this instruction (jump) transfers control unconditionally to an absolute address. Actually, j doesn’t quite manage a 32-bit address; the top 4 address bits of the target are not defined by the instruction and the top 4 bits of the current ‘‘PC’’ value is used instead. Most of the time this doesn’t matter: 28-bits still gives a maximum code size of 256 Mb. It can be argued that it is useful in system software, because it avoids changing the top 3 address bits which select the address segment (described earlier in this manual). To reach a really long way away, use the jr (jump to register) instruction; which is also used for computed jumps. • jal, jalr: these instructions implement a direct and indirect subroutine call. As well as jumping to the specified address, they store the current pc + 8 in register $31 (ra). Why add 8 to the program counter? Remember that jump instructions, like branches, always execute the following instruction (at pc + 4), so the return address is the instruction after the branch delay slot. Subroutine return is normally done with jr $31. Position independent subroutine calls can use the bal, bgezal and bltzal instructions. CONDITIONAL BRANCHES The MIPS architecture does not include a condition code register. Conditional branch machine instructions test one or two registers; and, together with a small group of compare-and-set instructions, are used to synthesize a complete set of arithmetic conditional branches. Conditional branches are always PC-relative. Branch instructions are listed below. Again there are architectural considerations: • Limited branch offset for PC-relative branches: the maximum branch displacement is ±32768 instructions (±128K bytes), because a 16-bit field is used for the offset. • Branch delay slot: the instruction immediately after a branch (or a jump) is always executed, whether or not the branch is taken. Many assemblers will normally hide this from the programmer, and will try to fill the branch delay slot with a useful instruction, or a nop if this is not possible. • No carry flag: due to the lack of condition codes; if software need to check for carry, then compare the operands and results to work out when it occurs (typically, this requires only one slt instruction). • No overflow flag: though the add and subtract instructions are available in an optional form which causes a trap if the result overflows into the sign bit. C compilers typically won’t generate those instructions, but Fortran might. 9–8 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 Co-processor conditional branches There are four pairs of branches, testing true/false on four ‘‘coprocessor condition’’ values CPCOND0-3. In the R3081, CPCOND1 is an internal flag which tests the floating point condition set by the FP compare instructions. Note that the coprocessor must be enabled for the branch instruction to be executed. COMPARE AND SET The compare-and-set instructions conform to the C standard; they set their destination to 1 if the condition is true, and zero otherwise. Their mnemonics start with an ‘‘s’’: so seq rd, rs, rt sets rd to a 1 or zero depending on whether rs is equal to rt. These instructions operate just like any 3-operand MIPS instruction. Floating point comparisons are done quite differently, and are described in the Floating-Point Accelerator chapter. COPROCESSOR TRANSFERS CPU control functions are provided by a set of registers, which the instruction set accesses as ‘‘co-processor 0’’ data registers. These registers deal with catching exceptions and interrupts, and accessing the memory management unit and caches. A R3051 family CPU has at least 12 registers; some have more. There’s much more about this in earlier chapters. The floating point accelerator is ‘‘co-processor 1’’, and is described in an earlier chapter. It has 16 64-bit registers to hold single- or doubleprecision FP values, which come apart into 32 32-bit registers when doing loads, stores and transfers to/from the integer registers. There are also two floating point control registers accessed with ctc1, cfc1 instructions. ‘‘Co-processor’’ instructions are encoded in a standard way, and the assembler doesn’t have to know much about what they do. There are a range of instructions for moving data to and from the coprocessor data and control registers. The assembler expects numbers specified with ‘‘$’’ in front (except for floating point registers, which are called $f0 to $f31); but most toolchains provide a header file for the C preprocessor which provides meaningful names for the CPU control and FP control registers. The assembler syntax makes no special provisions for ‘‘co-processor’’ registers; so if the program contains “obvious” mistakes (like reversing the CPU and special register names) the assembler will just silently do the wrong thing. Instruction Description mfc0 dst, dr move from CPU control register (to integer register) mtc0 src, dr move to CPU control register (from integer register) cfc1 dst, cr move from fpa control register (to integer register) ctc1 src, cr move to fpa control register (from integer register) mfc1 dst, dr move from FP register to integer register mtc1 src, dr move to FP register from integer register swc1 dr, offs(base) store FP register (to memory) lwc1 dr, offs(base) load FP register (from memory) Like conventional load instructions, there must always be one instruction after the move before the result can be used (the load-delay slot), whichever direction data is being moved. 9–9 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING Coprocessor Hazards A pipeline hazard occurs when the architecture definition allows the internal pipelining to ‘‘show through’’ and affect the software: examples being the load and branch delay slots. Most MIPS assemblers will usually shield the programmer from hazards by moving instructions around or inserting NOP’s, to ensure that the code executes as written. However some CPU control register writes have side-effects which require pipeline-aware programming; since most assemblers don’t understand anything about what these instructions are doing, they may not help. One outstanding example is the use of interrupt control fields in the Status and Cause registers. In these cases the programmer must account for any side-effects, and the fact that they are delayed for up to three instructions. For example, after an mtc0 to the Status register which changes an interrupt mask bit, it will be two further instructions before the interrupt is actually enabled or disabled. The same is also true when enabling or disabling floating-point coprocessor instructions (i.e. changing the CU1 bit). To cope with these situations usually requires the programmer to take explicit action to prevent the assembler from scheduling inappropriate instructions after a dangerous mtc0. This is done by using the .set noreorder directive, discussed below. A comprehensive summary of pipeline hazards can be found later in this chapter. ASSEMBLER DIRECTIVES Sections The names of, and support for different code and data sections is likely to differ from one toolchain to another. Most will at least support the original MIPS conventions, which are illustrated (for ROMable programs) by Figure 9.1, “Program segments in memory”. Within an assembler program the sections are selected as shown in Figure 9.1, “Program segments in memory”. .text, .rdata, .data Simply put the appropriate section name before the data or instructions, for example: msg: .rdata .asciiz"Hello world!\n" .data table: .word 1 .word 2 .word 3 func: .text sub ... sp, 64 .lit4, .lit8 These sections cannot be selected explicitly by the programmer. They are read-only data sections used implicitly by the assembler to hold floating-point constants which are given as arguments to the li.s or li.d macro instructions. Some assemblers and linkers will save space by combining identical constants. 9–10 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 ROM etext .rdata read-only data .text 1fc0000 program code _ftext RAM ???????? stack grows down from top of memory heap grows up towards stack end .bss uninitialized writable data .sbss uninitialized writable small data _fbss edata .lit8 64-bit floating point constants .lit4 32-bit floating point constants .sdata writable small data .data 00000200 writable data _fdata exception vectors 00000000 Figure 9.1: Program segments in memory 9–11 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING .bss This section is used to collect uninitialized data, the equivalent of C and Fortran’s common data. An uninitialized object is declared, together with its size. The linker then allocates space for it in the .bss section, using the maximum size from all those modules which declare it. If any module declares it in a real, initialized data section, then all the sizes are ignored and that definition is used. .comm dbgflag, 4 .lcomm sum, 4 .lcomm array, 100 # global common variable, 4 bytes # local common variable, 8 bytes # local common variable, 100 bytes “Uninitialized” is actually a misnomer: although these sections occupy no space in the object file, the run-time start-up code or operating-system must clear the .bss area to zero before entering the program; most C programs will rely on this behavior. Many tool chains will accommodate this need through the start up file provided with the tool, to be linked with the user program†. .sdata, .sbss These sections are equivalent to the .data and .bss sections above, but are used in some toolchains to hold small‡ data objects. This was described earlier in this chapter, when the use of the gp was discussed. Stack and heap The stack and heap are not real sections that are recognized by the assembler or linker. Typically they are initialized and maintained by the run-time system by setting the sp register to the top of physical memory (aligned to an 8-byte boundary), and setting the initial heap pointer (used by the malloc functions) to the address of the end symbol. Special symbols Figure 9.1, “Program segments in memory” also shows a number of special symbols which are automatically defined by the linker to allow programs to discover the start and end of their various sections. Some of these are part of the normal UNIX†† environment expected by many programs; others are specific to the MIPS environment. Symbol Standard? _ftext etext start of text (code) segment ✓ _fdata edata end of text (code) segment start of initialized data segment ✓ _fbss end Value end of initialized data segment start of uninitialized data segment ✓ end of uninitialized data segment Data definition and alignment Having selected the correct section, the data objects themselves are specified using the directives described in this section. † IDT/c provides this code in the file “/idtc/idt_csu.S”. ‡ The default for “small” is 8 bytes. This number can be changed with the “-G” compiler/assembler switch. †† UNIX is a trademark of Univel Inc. 9–12 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 .byte, .half, .word These directives output integers which are 1, 2, or 4 bytes long, respectively. A list of values may be given, separated by commas. Each value may be repeated a number of times by following it with a colon and a repeat count. For example. .byte .half .word 3 1, 2, 3 5 : 3, 6, 7 # 1 byte:3 # 3 halfwords:1 2 3 # 5 words:5 5 5 6 7 Note that the section’s location counter is automatically aligned to the appropriate boundary before the data is emitted. To actually emit unaligned data, explicit action must be taken using the .align directive described below. .float, .double These output single or double precision floating-point values, respectively. Multiple values and repeat counts may be used in the same way as the integer directives. .float 1.4142175 .double1e+10, 3.1415 # 1 single-precision value # 2 double-precision values .ascii, .asciiz These directives output ASCII strings, either without or with a terminating null character respectively. The following example outputs two identical strings: .ascii "Hello\0" .asciiz"Hello" .align This directive allows the programmer to specify an alignment greater than that which would normally be required for the next data directive. The alignment is specified as a power of two, for example: var: .align 4 .word 0 # align to 16-byte boundary (24) If a label (var in this case) comes immediately before the .align , then the label will still be aligned correctly. For example, the following is exactly equivalent to the above: var: .align 4 .word 0 # align to 16-byte boundary (24) For ‘‘packed’’ data structures this directive allows the programmer to override the automatic alignment feature of .half, .word, etc., by specifying a zero alignment. This will stay in effect until the next section change. For example: .half 3 .align 0 .word 100 # correctly aligned halfword # switch off auto-alignment # word aligned on halfword boundary .comm, .lcomm These directives declare a common, or uninitialized data object by specifying the object’s name and size. 9–13 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING An object declared with .comm is shared between all modules which declare it: it is allocated space by the linker, which uses the largest declared size. If any module declares it in one of the initialized .data, .sdata or .rdata sections, then all the sizes are ignored and the initialized definition is used instead†. An object declared with .lcomm is local to the current module, and is allocated space in the ‘‘uninitialized’’ .bss (or .sbss) section by the assembler. .comm dbgflag, 4 .lcomm array, 100 # global common variable, 4 bytes # local uninitialized object, 100 bytes .space The .space directive increments the current section’s location counter by a number of bytes, for example: struc: .word 3 .space 120 .word -1 # 120 byte gap For normal data and text sections it just emits that many zero bytes, but in assemblers which allow the programmer to declare new sections with labels but no real content (like .bss), it will just increment the location counter without emitting any data. Symbol binding attributes Symbols (i.e. labels in one of the code or data segments) can be made visible and used by the linker which joins separate modules into a single program. The linker binds a symbol to an address and substitutes the address for assembler-language references to the symbol. Symbols can have three levels of visibility: • Local: invisible outside the module they are declared in, and unused by the linker. The programmer does not need to worry about whether the same local symbol name is used in another module. • Global: made public for use by the linker. Programs can refer to a global symbol in another module without defining any local space for it, using the .extern directive. • Weak global: obscure feature provided by some toolchains. This allows the programmer to arrange that a symbol nominally referring to a locally-defined space will actually refer to a global symbol, if the linker finds one. If the linked program has no global symbol with that name, the local version is used instead. The preferred programming practice is to use the .comm directive whenever possible. .globl Unlike C, where module-level data and functions are automatically global unless declared with thestatic keyword, all assembler labels have local binding unless explicitly modified by the .globl directive. To define a label as having global binding that is visible to other modules, use the directive as follows: .data .globl status status:.word 0 # global variable .text .globl set_status# global function † The actual handling may be toolchain dependent; this is the most common technique. 9–14 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 set_status: subu ... sp,24 Note that .globl is not required for objects declared with the .comm directive; these automatically have global binding. .extern All references to labels which are not defined within the current module are automatically assumed to be references to globally-bound symbols in another module (i.e. external symbols). In some cases the assembler can generate better code if it knows how big the referenced object is (e.g. the global pointer, described earlier). An external object’s size is specified using the .extern directive, as follows: .externindex, 4 .externarray, 100 lw $3, index # load a 4 byte (1 word) external lw $2, array($3) # load part of a 100 byte external sw $2, value # store in an unknown size external .weakext Some assemblers and toolchains support the concept of weak global binding. This allows the program to specify a provisional binding for a symbol, which may be overridden if a normal, or strong global definition is encountered. For example: .data .weakext errno errno: .word 0 .text lw $2,errno # may use local or external # definition This module, and others which access errno, will use this local definition of errno, unless some other module also defines it with a .globl. It is also possible to declare a local variable with one name, but make it weakly global with a different name: .data myerrno: .word0 .weakext errno, myerrno .text lw lw $2,myerrno $2,errno # always use local definition # may use local definition, or # other Function directives Some MIPS assemblers expect the programmer to mark the start and end of each function, and describe the stack frame which it uses. In some toolchains this information is used by the debugger to perform stack backtraces and the like. .ent, .end These directives mark the start and end of a function. A trivial leaf function might look like this: .text .ent localfunc: addu 9–15 localfunc v0,a1,a2 # return (arg1 + arg2) CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING j .end ra localfunc The label name may be omitted from the .end directive, which then defaults to the name used in the last .ent. Specifying the name explicitly allows the assembler to check that the programmer did not miss earlier .ent or .end directives. .aent Some functions may provide multiple, alternative entry-points. The .aent directive identifies labels as such. For example: .text .globl .ent memcpy:move move move memcpy memcpy t0,a0 a0,a1 a1,t0 .globl .aent bcopy: lb sb addu addu subu bne j .end bcopy bcopy t0,0(a0) # very slow byte copy t0,0(a1) a0,1 a1,1 a2,1 a2,zero,bcopy ra memcpy # swap first two arguments .frame, .mask, .fmask Most functions need to allocate a stack frame in which to: • save the return address register ($31); • save any of the registers s0 - s9 and $f20 - $f31 which they modify (known as the callee-saves registers); • store local variables and temporaries; • pass arguments to other functions. In some CISC architectures the stack frame allocation, and possibly register saving, is done by special purpose enter and leave instructions, but in the MIPS architecture it is coded by the compiler or assemblylanguage programmer. However debuggers need to know the layout of each stack frame to do stack backtraces and the like, and in the original MIPS Corp. toolchain these directives provided this information; in other toolchains they may be quietly ignored, and the stack layout determined at run-time by disassembling the function prologue. Putting them in the code is therefore not always essential, but does no harm and may make the code more portable. Many toolchains supply a header file , which provides C-style macros to generate the appropriate directives, as required (the procedure call protocol, and stack usage, is described in a later chapter). The .frame directive takes 3 operands: • framereg: the register used to access the local stack frame – usually $sp. • returnreg: the register which holds the return address. Usually this is $0, which indicates that the return address is stored in the stack frame, or $31 if this is a leaf function (i.e. it doesn’t call any other functions) and the return address is not saved. • framesize: the total size of stack frame allocated by this function; it should always be the case that $sp + framesize = previous $sp. 9–16 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 .frame framereg, framesize, returnreg The .mask directive indicates where the function saves general registers in the stack frame; .fmask does the same for floating-point registers. Their first argument is regmask, a bitmap of which registers are being saved (i.e. bit 1 set = $1, bit 2 set = $2, etc.); the second argument is regoffset, the distance from framereg + framesize to the start of the register save area. .mask regmask, regoffset .fmask fregmask, fregoffs How these directives relate to the stack frame layout, and examples of their use, can be found in the next chapter. Remember that the directives do not create the stack frame, they just describe its layout; that code still has to be written explicitly by the compiler or assembly-language programmer. Assembler control (.set) The original MIPS Corp. assembler is an ambitious program which performs intelligent macro expansion of synthetic instructions, delay-slot filling, peephole optimization, and sophisticated instruction reordering, or scheduling, to minimize pipeline stalls. Many assemblers will be less complex: modern optimizing compilers usually prefer to do these sort of optimizations themselves. However in the interests of source code compatibility, and to make the programmer’s life easier, most MIPS assemblers perform macro expansion, insert extra nops as required to hide branch and load delay-slots, and prevent pipeline hazards in normal code (pipeline hazards are described in detail later). With a reordering assembler it is sometimes necessary to restrict the reordering, to guarantee correct timing, or to account for side-effects of instructions which the assembler cannot know about (e.g. enabling and disabling interrupts). The .set directives provide this control. .set noreorder/reorder By default most assemblers are in reorder mode, which allow them to reorder instructions to avoid pipeline hazards and (perhaps) to achieve better performance; in this mode it will not allow the programmer to insert nops. Conversely, code that is an a noreorder region will not be optimized or changed in any way. This means that the programmer can completely control the instruction order, but the downside is that the code must now be scheduled manually, and delay slots filled with useful instructions or nops. For example: .set noreorder lw t0, 0(a0) nop # LDSLOT subu t0, 1 bne t0, zero, loop nop # BDSLOT .set reorder .set volatile/novolatile Any load or store instruction within a volatile region will not be moved with respect to other loads and stores. This can be important for accesses to memory mapped device registers, where the order of reads and writes is important. For example, if the following code fragment did not use .set volatile, then the assembler might decide to move the second lw before the sw, to fill the first load delay-slot. Hazard avoidance and other optimizations are not affected by this option. 9–17 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING .set volatile lw t0,0(a0) sw t0,0(a1) lw t1,4(a0) .set novolatile .set noat/at The assembler reserves register $1 (known as the assembler temporary, or $at register) to hold intermediate values when performing macro expansions; if code attempts to use the register, a warning or error message will be sent. It is not always obvious when the assembler will use $at, and there are certain circumstances when the programmer may need to ensure that it does not (for example in exception handlers before $1 has been saved). Switching on noat will make the assembler generate an error message if it needs to use $1 in a macro instruction, and allows the programmer to use it explicitly without receiving warnings. For example: xcptgen: .set noat subu k0,sp,XCP_SIZE sw $at,XCP_AT(k0) .set at .set nomacro/macro Most of the time the programmer will not care whether an assembler statement generates more than one real machine instruction, but of course there are exceptions. For instance when manually filling a branch delayslot in a noreorder region, it would almost certainly be wrong to use a complex macro instruction; if the branch was taken, only the first instruction of the macro would be executed. Switching on nomacro will cause a warning if any statement expands to more than one machine instruction. For example, compare the following two code fragments: .set blt .set li .set .set noreorder a1,a2,loop nomacro a0,0x1234 macro reorder .set blt .set li .set .set noreorder a1,a2,loop nomacro a0,0x12345 macro reorder # BDSLOT # BDSLOT The first will assemble successfully, but the second will generate an assembler error message, because its li is expanded into two machine instructions (lui and ori). Some assemblers will catch this mistake automatically. .set nobopt/bopt Setting the nobopt control prevents the assembler from carrying out certain types of branch optimization. It is usually used only by compilers. THE COMPLETE GUIDE TO ASSEMBLER INSTRUCTIONS Table 9.2, “Assembler instructions” below shows, for every mnemonic defined by the MIPS assemblers for the R3000 (MIPS 1) instruction set, how it is likely to be implemented, and what it does. Some naming conventions in the assembler may appear confusing: 9–18 ASSEMBLER LANGUAGE PROGRAMMING CHAPTER 9 • Unsigned versions: a ‘‘u’’ suffix on the assembler mnemonic is usually to be read as ‘‘unsigned’’. Usually this follows the conventional meaning; but the most common u-suffix instructions are addu and subu: and here the u means that overflow into the sign bit will not cause a trap. Regular add is never generated by C compilers. Many compilers, not expecting there to be a run-time system to handle overflow traps, will always use the ‘‘u’’ variant. However, because the integer multiply instructions mult and multu generate 64-bit results the signed and unsigned versions are really different – and neither of the machine instructions produce a trap under any circumstances. • Immediate operands: as mentioned above, the programmer can use immediate operands with most instructions (e.g. add rd, rs, 1); quite a few arithmetic/logic instructions really do have ‘‘immediate’’ versions (called addi etc.). Most assemblers do not require the programmer to explicitly know which machine instructions support immediate variants. • Building addresses, %lo_ and %hi_: synthesis of addressing modes was described earlier. The table typically will list only one addressmode variant for each instruction in the table. • What it does: the function of each instruction is described using ‘‘C’’ expression syntax; it is easy to get a rough idea, but a thorough knowledge of C allows the exact behavior to be understood. The assembler descriptions use the following conventions: Word Used for rs,rt CPU registers used as operands rd CPU register which receives the result fs,ft floating point register operands fd floating point register which receives the result imm 16-bit ‘‘immediate’’ constant label the name of an entry point in the instruction stream addr one of a number of different address expressions %hi_addr %lo_addr where addr is a symbol defined in the data segment, ‘‘%hi_addr’’ and ‘‘%lo_addr’’ are as described above; that is, they are the high and low parts of the value which can be used in an lui/addui sequence. %gpoff_addr the offset in the ‘‘small data’’ segment of an address $at register $1, the ‘‘assembler temporary’’ register $zero register $0, which always contains a zero value $ra the ‘‘return address’’ register $31 RETURN the point to where control returns to after a subroutine call; this is the next instruction but one after the branch/ jump to subroutine, and is normally loaded into $ra by the ‘‘.. and link’’ instructions. trap(CAUSE, code) Take a CPU trap; ‘‘CAUSE’’ determines the setting of the Cause register, and ‘‘code’’ is a value not interpreted by the hardware, but which system software can obtain by looking at the trap instruction. CAUSE values can be BREAK; FPINT (for floating point exception); SYSCALL. Table 9.1: Assembler register and identifier conventions 9–19 CHAPTER 9 ASSEMBLER LANGUAGE PROGRAMMING Word Used for unordered(fs,ft) some exceptional floating point values cannot be sensibly compared; it is not sensible to ask whether one NaN is bigger than another (NaN, ‘‘not a number’’, is produced when the result of an operation is not defined). The IEEE754 standard requires that for such a pair that ‘‘fs ft’’ shall all be false. ‘‘unordered(fs,ft)’’ returns true for an unordered pair, false otherwise. fpcond the floating point ‘‘condition bit’’ found in the FP control/ status register, and tested by the bc1f and bc0t instructions. Table 9.1: Assembler register and identifier conventions Assembler move rd,rs Expands To addu rd,rs,$zero What it does rd = rs; Branch (PC-relative, all conditional) b label beq $zero,$zero,label beq rs,rt,label goto label; if (rs == rt) goto label; bge rs,rt,label slt $at,rs,rt beq $at,$zero,label if ((signed) rs >= (signed) rt) goto label; bgeu rs,rt,label sltu $at,rs,rt beq $at,$zero,label if ((unsigned) rs >= (unsigned) rt) goto label; bgt rs,rt,label slt $at,rt,rs bne $at,$zero,label if ((signed) rs > (signed) rt) goto label; bgtu rs,rt,label slt $at,rt,rs beq $at,$zero,label if ((unsigned) rs > (unsigned) rt) goto label; ble rs,rt,label sltu $at,rt,rs beq $at,$zero,label if ((signed) rs <= (signed) rt) goto label; bleu rs,rt,label sltu $at,rt,rs beq $at,$zero,label if ((unsigned) rs <= (unsigned) rt) goto label; blt rs,rt,label slt $at,rs,rt bne $at,$zero,label if ((signed) rs <(signed) rt) goto label; bltu rs,rt,label sltu $at,rs,rt bne $at,$zero,label if ((unsigned) rs <(unsigned) rt) goto label; bne rs,rt,label beqz rs,label if (rs != rt) goto label; beq rs,$zero,label if (rs == 0) goto label; bgez rs,label if ((signed) rs >= 0) goto label; bgtz rs,label if ((signed) rs > 0) goto label; blez rs,label if ((signed) rs <= 0) goto label; Table 9.2: Assembler instructions 9–20 ASSEMBLER LANGUAGE PROGRAMMING Assembler CHAPTER 9 Expands To bltz rs,label What it does if ((signed) rs <0) goto label; bnez rs,label bne rs,$zero,label if (rs != 0) goto label; bal label bgezal $zero,label ra = RETURN; goto label; bgezal rs,label if ((signed) rs >= 0) { ra = RETURN; goto label; } bltzal rs,label if ((signed) rs <0) { ra = RETURN; goto label; } Unary arithmetic/logic instructions abs rd,rs sra $at,rs,31 xor rd,rs,$at sub rd,rd,$at rd = rs <0 ? -rs: rs; abs rd sra $at,rd,31 xor rd,rd,$at sub rd,rd,$at rd = rd <0 ? -rd: rd; neg rd,rs sub rd,$zero,rs rd = -rs; /* trap on overflow */ neg rd sub rd,$zero,rd rd = -rd; /* trap on overflow */ negu rd,rs subu rd,$zero,rs rd = -rs; /* no trap */ negu rd subu rd,$zero,rd rd = -rd; /* no trap */ not rd,rs nor rd,rs,$zero rd = ~rs; not rd nor rd,rd,$zero rd = ~rd; Binary arithmetic/logical operations add rd,rs,rt add rd,rs rd = rs + rt; /* trap on overflow */ add rd,rd,rs rd += rs; /* trap on overflow */ addu rd,rs,rt rd = rs + rt; /* no trap on overflow */ addu rd,rs rd += rs; /* no trap on overflow */ and rd,rs,rt rd = rs & rt; and rd,rs and rd,rd,rs rd &= rs; Table 9.2: Assembler instructions 9–21 CHAPTER 9 Assembler ASSEMBLER LANGUAGE PROGRAMMING Expands To What it does div rs,rt bne rt,$zero,1f nop break 7 1: li $at,-1 bne rt,$at,2f nop lui $at,0x8000 bne rs,$at,2f nop break 6 2: mflo rd rd = rs/rt; div rd,rs as above rd = rd/rt; /* trap on errors */ divu rd,rs,rt divu rs,rt bne rt,$zero,1f nop break 7 1: mflo rd div rd,rs,rt /* trap divide by zero */ /* trap overflow conditions */ rd = rs/rt; /* trap on divide by zero */ /* no check for overflow */ or rd,rs,rt rd = rs | rt; mul rd,rs,rt multu rs,rt mflo rd rd = rs*rt; /* no checks */ mulo rd,rs,rt mult rs,rt mfhi rd sra rd,rd,31 mflo $at beq rd,$at,1f nop break 6 1: mflo rd rd = rs * rt; /* signed */ multu rs,rt mfhi $at mflo rd beq $at,$zero,1f nop break 6 1: rd = (unsigned) rs * rt; mulou rd,rs,rt nor rd,rs,rt /* trap on overflow */ /* trap on overflow */ rd = ~(rs | rt); Table 9.2: Assembler instructions 9–22 ASSEMBLER LANGUAGE PROGRAMMING Assembler rem rd,rs,rt remu rd,rs,rt CHAPTER 9 Expands To What it does div rs,rt bne rt,$zero,1f nop break 7 1: li $at,-1 bne rt,$at,2f nop lui $at,0x8000 bne rs,$at,2f nop break 6 2: mfhi rd rd = rs%rt; divu rs,rt bne rt,$zero,1f nop break 7 1: mfhi rd /* unsigned operation, ignore overflow */ rd = rs%rt; /* trap if rt == 0 */ /* trap if it will overflow */ /* trap if rt == 0 */ rol rd,rs,rt negu $at,rt srlv $at,rs,$at sllv rd,rs,rt or rd,rd,$at /* rd = rs rotated left by rt */ ror rd,rs,rt negu $at,rt sllv $at,rs,$at srlv rd,rs,rt or rd,rd,$at /* rd = rs rotated right by rt */ seq rd,rs,rt xor rd,rs,rt sltiu rd,rd,1 rd = (rs == rt) ? 1: 0; sge rd,rs,rt slt rd,rs,rt xori rd,rd,1 rd = ((signed)rs >= (signed)rt) ? 1: 0; sgeu rd,rs,rt sltu rd,rs,rt xori rd,rd,1 rd = ((unsigned)rs >= (unsigned)rt) ? 1: 0; sgt rd,rs,rt slt rd,rt,rs rd = ((signed)rs > (signed)rt) ? 1: 0; sgtu rd,rs,rt sltu rd,rt,rs rd = ((unsigned)rs > (unsigned)rt) ? 1: 0; sle rd,rs,rt slt rd,rt,rs xori rd,rd,1 rd = ((signed)rs <= (signed)rt) ? 1: 0; sleu rd,rs,rt sltu rd,rt,rs xori rd,rd,1 rd = ((unsigned)rs <= (unsigned)rt) ? 1: 0; slt rd,rs,rt rd = ((signed)rs <(signed)rt) ? 1: 0; sltu rd,rs,rt sltu rd,rs,rt xor rd,rs,rt rd = ((unsigned)rs <(unsigned)rt) ? 1: 0; sne rd,rs,rt sltu rd,$zero,rd rd = (rs == rt) ? 1: 0; Table 9.2: Assembler instructions 9–23 CHAPTER 9 Assembler ASSEMBLER LANGUAGE PROGRAMMING Expands To What it does sll rd,rs,rt sllv rd,rs,rt rd = rs <