Intel XScale Microarchitecture Users Manual User

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 198 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Intel® XScale™ Microarchitecture
for the PXA255 Processor
Users Manual
March, 2003
Order Number: 278796
ii Intel® XScale™ Microarchitecture User’s Manual
Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any
intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no
liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties
relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are
not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any
time, without notice.Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
The Intel® XScale™ Microarchitecture Users Manual for the PXA255 processor may contain design defects or errors known as errata
which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling 1-800-
548-4725 or by visiting Intel's website at http://www.intel.com.
Copyright © Intel Corporation, 2003
* Other names and brands may be claimed as the property of others.
ARM and StrongARM are registered trademarks of ARM, Ltd.
Intel® XScale™ Microarchitecture User’s Manual iii
Contents
Contents
1 Introduction...................................................................................................................................1-1
1.1 About This Document ........................................................................................................1-1
1.1.1 How to Read This Document................................................................................1-1
1.1.2 Other Relevant Documents ..................................................................................1-1
1.2 High-Level Overview of the Intel® XScale™ core as Implemented in the
Application Processors ......................................................................................................1-2
1.2.1 ARM* Compatibility...............................................................................................1-3
1.2.2 Features................................................................................................................1-3
1.2.2.1 Multiply/Accumulate (MAC)...................................................................1-3
1.2.2.2 Memory Management ...........................................................................1-4
1.2.2.3 Instruction Cache ..................................................................................1-4
1.2.2.4 Branch Target Buffer.............................................................................1-4
1.2.2.5 Data Cache ...........................................................................................1-4
1.2.2.6 Fill Buffer & Write Buffer .......................................................................1-5
1.2.2.7 Performance Monitoring........................................................................1-5
1.2.2.8 Power Management..............................................................................1-5
1.2.2.9 Debug ...................................................................................................1-5
1.3 Terminology and Conventions ...........................................................................................1-6
1.3.1 Number Representation........................................................................................1-6
1.3.2 Terminology and Acronyms ..................................................................................1-6
2 Programming Model .....................................................................................................................2-1
2.1 ARM* Architecture Compatibility........................................................................................2-1
2.2 ARM* Architecture Implementation Options ......................................................................2-1
2.2.1 Big Endian versus Little Endian ............................................................................2-1
2.2.2 Thumb...................................................................................................................2-1
2.2.3 ARM* DSP-Enhanced Instruction Set...................................................................2-2
2.2.4 Base Register Update...........................................................................................2-2
2.3 Extensions to ARM* Architecture.......................................................................................2-2
2.3.1 DSP Coprocessor 0 (CP0)....................................................................................2-3
2.3.1.1 Multiply With Internal Accumulate Format ............................................2-3
2.3.1.2 Internal Accumulator Access Format ....................................................2-6
2.3.2 New Page Attributes .............................................................................................2-9
2.3.3 Additions to CP15 Functionality..........................................................................2-10
2.3.4 Event Architecture ..............................................................................................2-11
2.3.4.1 Exception Summary............................................................................2-11
2.3.4.2 Event Priority.......................................................................................2-11
2.3.4.3 Prefetch Aborts ...................................................................................2-12
2.3.4.4 Data Aborts .........................................................................................2-12
2.3.4.5 Events from Preload Instructions ........................................................2-14
2.3.4.6 Debug Events .....................................................................................2-15
3 Memory Management...................................................................................................................3-1
3.1 Overview............................................................................................................................3-1
3.2 Architecture Model.............................................................................................................3-1
3.2.1 Version 4 vs. Version 5.........................................................................................3-2
3.2.2 Instruction Cache..................................................................................................3-2
3.2.3 Data Cache and Write Buffer................................................................................3-2
iv Intel® XScale™ Microarchitecture User’s Manual
Contents
3.2.4 Details on Data Cache and Write Buffer Behavior................................................3-3
3.2.5 Memory Operation Ordering .................................................................................3-3
3.2.6 Exceptions ............................................................................................................3-4
3.3 Interaction of the MMU, Instruction Cache, and Data Cache ............................................3-4
3.4 Control ...............................................................................................................................3-4
3.4.1 Invalidate (Flush) Operation .................................................................................3-4
3.4.2 Enabling/Disabling ................................................................................................3-5
3.4.3 Locking Entries .....................................................................................................3-5
3.4.4 Round-Robin Replacement Algorithm ..................................................................3-7
4 Instruction Cache..........................................................................................................................4-1
4.1 Overview............................................................................................................................4-1
4.2 Operation...........................................................................................................................4-2
4.2.1 Instruction Cache is Enabled ................................................................................4-2
4.2.2 The Instruction Cache Is Disabled........................................................................4-2
4.2.3 Fetch Policy ..........................................................................................................4-2
4.2.4 Round-Robin Replacement Algorithm ..................................................................4-3
4.2.5 Parity Protection ...................................................................................................4-3
4.2.6 Instruction Fetch Latency......................................................................................4-4
4.2.7 Instruction Cache Coherency ...............................................................................4-4
4.3 Instruction Cache Control ..................................................................................................4-5
4.3.1 Instruction Cache State at RESET .......................................................................4-5
4.3.2 Enabling/Disabling ................................................................................................4-5
4.3.3 Invalidating the Instruction Cache.........................................................................4-5
4.3.4 Locking Instructions in the Instruction Cache .......................................................4-6
4.3.5 Unlocking Instructions in the Instruction Cache....................................................4-7
5 Branch Target Buffer ....................................................................................................................5-1
5.1 Branch Target Buffer (BTB) Operation ..............................................................................5-1
5.1.1 Reset ....................................................................................................................5-2
5.1.2 Update Policy........................................................................................................5-2
5.2 BTB Control .......................................................................................................................5-2
5.2.1 Disabling/Enabling ................................................................................................5-2
5.2.2 Invalidation............................................................................................................5-3
6 Data Cache...................................................................................................................................6-1
6.1 Overviews ..........................................................................................................................6-1
6.1.1 Data Cache Overview...........................................................................................6-1
6.1.2 Mini-Data Cache Overview...................................................................................6-2
6.1.3 Write Buffer and Fill Buffer Overview....................................................................6-3
6.2 Data Cache and Mini-Data Cache Operation ....................................................................6-4
6.2.1 Operation When Caching is Enabled....................................................................6-4
6.2.2 Operation When Data Caching is Disabled ..........................................................6-4
6.2.3 Cache Policies ......................................................................................................6-4
6.2.3.1 Cacheability ..........................................................................................6-4
6.2.3.2 Read Miss Policy ..................................................................................6-4
6.2.3.3 Write Miss Policy...................................................................................6-5
6.2.3.4 Write-Back Versus Write-Through ........................................................6-6
6.2.4 Round-Robin Replacement Algorithm ..................................................................6-6
6.2.5 Parity Protection ...................................................................................................6-6
6.2.6 Atomic Accesses ..................................................................................................6-7
Intel® XScale™ Microarchitecture User’s Manual v
Contents
6.3 Data Cache and Mini-Data Cache Control ........................................................................6-7
6.3.1 Data Memory State After Reset............................................................................6-7
6.3.2 Enabling/Disabling ................................................................................................6-7
6.3.3 Invalidate & Clean Operations ..............................................................................6-8
6.3.3.1 Global Clean and Invalidate Operation .................................................6-8
6.4 Re-configuring the Data Cache as Data RAM .................................................................6-10
6.5 Write Buffer/Fill Buffer Operation and Control .................................................................6-13
7 Configuration ................................................................................................................................7-1
7.1 Overview............................................................................................................................7-1
7.2 CP15 Registers..................................................................................................................7-3
7.2.1 Register 0: ID & Cache Type Registers................................................................7-4
7.2.2 Register 1: Control & Auxiliary Control Registers .................................................7-5
7.2.3 Register 2: Translation Table Base Register ........................................................7-7
7.2.4 Register 3: Domain Access Control Register........................................................7-8
7.2.5 Register 5: Fault Status Register..........................................................................7-8
7.2.6 Register 6: Fault Address Register.......................................................................7-9
7.2.7 Register 7: Cache Functions ................................................................................7-9
7.2.8 Register 8: TLB Operations ................................................................................7-10
7.2.9 Register 9: Cache Lock Down ............................................................................7-11
7.2.10 Register 10: TLB Lock Down ..............................................................................7-12
7.2.11 Register 13: Process ID......................................................................................7-12
7.2.11.1 The PID Register Affect On Addresses ..............................................7-13
7.2.12 Register 14: Breakpoint Registers ......................................................................7-13
7.2.13 Register 15: Coprocessor Access Register ........................................................7-14
7.3 CP14 Registers................................................................................................................7-15
7.3.1 Registers 0-3: Performance Monitoring ..............................................................7-16
7.3.2 Registers 6-7: Clock and Power Management ...................................................7-16
7.3.3 Registers 8-15: Software Debug.........................................................................7-17
8 Performance Monitoring ...............................................................................................................8-1
8.1 Overview............................................................................................................................8-1
8.2 Clock Counter (CCNT; CP14 - Register 1) ........................................................................8-1
8.3 Performance Count Registers (PMN0 - PMN1; CP14 - Register 2 and 3, Respectively)..8-2
8.3.1 Extending Count Duration Beyond 32 Bits ...........................................................8-2
8.4 Performance Monitor Control Register (PMNC) ................................................................8-2
8.4.1 Managing the PMNC ............................................................................................8-4
8.5 Performance Monitoring Events ........................................................................................8-4
8.5.1 Instruction Cache Efficiency Mode .......................................................................8-5
8.5.2 Data Cache Efficiency Mode ................................................................................8-6
8.5.3 Instruction Fetch Latency Mode............................................................................8-6
8.5.4 Data/Bus Request Buffer Full Mode .....................................................................8-6
8.5.5 Stall/Writeback Statistics Mode.............................................................................8-7
8.5.6 Instruction TLB Efficiency Mode ...........................................................................8-8
8.5.7 Data TLB Efficiency Mode ....................................................................................8-8
8.6 Multiple Performance Monitoring Run Statistics ................................................................8-8
8.7 Examples ...........................................................................................................................8-8
9 Test...............................................................................................................................................9-1
9.1 Boundary-Scan Architecture and Overview.......................................................................9-1
9.2 Reset .................................................................................................................................9-3
vi Intel® XScale™ Microarchitecture User’s Manual
Contents
9.3 Instruction Register............................................................................................................9-3
9.3.1 Boundary-Scan Instruction Set .............................................................................9-3
9.4 Test Data Registers ...........................................................................................................9-5
9.4.1 Bypass Register....................................................................................................9-5
9.4.2 Boundary-Scan Register.......................................................................................9-6
9.4.3 Device Identification (ID) Code Register...............................................................9-8
9.4.4 Data Specific Registers ........................................................................................9-8
9.5 TAP Controller ...................................................................................................................9-8
9.5.1 Test Logic Reset State .........................................................................................9-9
9.5.2 Run-Test/Idle State.............................................................................................9-10
9.5.3 Select-DR-Scan State.........................................................................................9-10
9.5.4 Capture-DR State ...............................................................................................9-10
9.5.5 Shift-DR State.....................................................................................................9-10
9.5.6 Exit1-DR State ....................................................................................................9-11
9.5.7 Pause-DR State..................................................................................................9-11
9.5.8 Exit2-DR State ....................................................................................................9-11
9.5.9 Update-DR State ................................................................................................9-11
9.5.10 Select-IR Scan State ..........................................................................................9-12
9.5.11 Capture-IR State.................................................................................................9-12
9.5.12 Shift-IR State ......................................................................................................9-12
9.5.13 Exit1-IR State......................................................................................................9-12
9.5.14 Pause-IR State ...................................................................................................9-12
9.5.15 Exit2-IR State......................................................................................................9-13
9.5.16 Update-IR State..................................................................................................9-13
10 Software Debug..........................................................................................................................10-1
10.1 Introduction ......................................................................................................................10-1
10.1.1 Halt Mode ...........................................................................................................10-1
10.1.2 Monitor Mode......................................................................................................10-2
10.2 Debug Registers..............................................................................................................10-2
10.3 Debug Control and Status Register (DCSR) ...................................................................10-3
10.3.1 Global Enable Bit (GE) .......................................................................................10-4
10.3.2 Halt Mode Bit (H) ................................................................................................10-4
10.3.3 Vector Trap Bits (TF,TI,TD,TA,TS,TU,TR) .........................................................10-4
10.3.4 Sticky Abort Bit (SA) ...........................................................................................10-5
10.3.5 Method of Entry Bits (MOE)................................................................................10-5
10.3.6 Trace Buffer Mode Bit (M) ..................................................................................10-5
10.3.7 Trace Buffer Enable Bit (E).................................................................................10-5
10.4 Debug Exceptions............................................................................................................10-5
10.4.1 Halt Mode ...........................................................................................................10-6
10.4.2 Monitor Mode......................................................................................................10-7
10.5 HW Breakpoint Resources ..............................................................................................10-8
10.5.1 Instruction Breakpoints .......................................................................................10-9
10.5.2 Data Breakpoints ................................................................................................10-9
10.6 Software Breakpoints.....................................................................................................10-11
10.7 Transmit/Receive Control Register (TXRXCTRL) .........................................................10-11
10.7.1 RX Register Ready Bit (RR) .............................................................................10-12
10.7.2 Overflow Flag (OV)...........................................................................................10-13
10.7.3 Download Flag (D)............................................................................................10-13
10.7.4 TX Register Ready Bit (TR) ..............................................................................10-14
Intel® XScale™ Microarchitecture User’s Manual vii
Contents
10.7.5 Conditional Execution Using TXRXCTRL.........................................................10-14
10.8 Transmit Register (TX) ..................................................................................................10-15
10.9 Receive Register (RX) ...................................................................................................10-15
10.10 Debug JTAG Access .....................................................................................................10-16
10.10.1 SELDCSR JTAG Command .............................................................................10-16
10.10.2 SELDCSR JTAG Register ................................................................................10-17
10.10.2.1 DBG.HLD_RST.................................................................................10-18
10.10.2.2 DBG.BRK..........................................................................................10-18
10.10.2.3 DBG.DCSR .......................................................................................10-18
10.10.3 DBGTX JTAG Command..................................................................................10-19
10.10.4 DBGTX JTAG Register.....................................................................................10-19
10.10.5 DBGRX JTAG Command .................................................................................10-20
10.10.6 DBGRX JTAG Register ....................................................................................10-20
10.10.6.1 RX Write Logic ..................................................................................10-21
10.10.6.2 DBGRX Data Register ......................................................................10-21
10.10.6.3 DBG.RR ............................................................................................10-22
10.10.6.4 DBG.V...............................................................................................10-22
10.10.6.5 DBG.RX ............................................................................................10-22
10.10.6.6 DBG.D...............................................................................................10-23
10.10.6.7 DBG.FLUSH .....................................................................................10-23
10.10.7 Debug JTAG Data Register Reset Values........................................................10-23
10.11 Trace Buffer ...................................................................................................................10-23
10.11.1 Trace Buffer CP Registers................................................................................10-23
10.11.1.1 Checkpoint Registers........................................................................10-24
10.11.1.2 Trace Buffer Register (TBREG) ........................................................10-25
10.11.2 Trace Buffer Usage...........................................................................................10-25
10.12 Trace Buffer Entries.......................................................................................................10-27
10.12.1 Message Byte ...................................................................................................10-27
10.12.1.1 Exception Message Byte ..................................................................10-28
10.12.1.2 Non-exception Message Byte ...........................................................10-28
10.12.1.3 Address Bytes...................................................................................10-29
10.13 Downloading Code into the Instruction Cache...............................................................10-30
10.13.1 LDIC JTAG Command ......................................................................................10-30
10.13.2 LDIC JTAG Data Register ................................................................................10-31
10.13.3 LDIC Cache Functions......................................................................................10-32
10.13.4 Loading IC During Reset ..................................................................................10-33
10.13.4.1 Loading IC During Cold Reset for Debug .........................................10-34
10.13.4.2 Loading IC During a Warm Reset for Debug ....................................10-36
10.13.5 Dynamically Loading IC After Reset .................................................................10-38
10.13.5.1 Dynamic Code Download Synchronization.......................................10-39
10.13.6 Mini Instruction Cache Overview ......................................................................10-40
10.14 Halt Mode Software Protocol .........................................................................................10-40
10.14.1 Starting a Debug Session .................................................................................10-40
10.14.1.1 Setting up Override Vector Tables....................................................10-41
10.14.1.2 Placing the Handler in Memory.........................................................10-41
10.14.2 Implementing a Debug Handler ........................................................................10-42
10.14.2.1 Debug Handler Entry ........................................................................10-42
10.14.2.2 Debug Handler Restrictions ..............................................................10-42
10.14.2.3 Dynamic Debug Handler...................................................................10-43
10.14.2.4 High-Speed Download ......................................................................10-44
10.14.3 Ending a Debug Session ..................................................................................10-45
viii Intel® XScale™ Microarchitecture User’s Manual
Contents
10.15 Software Debug Notes...................................................................................................10-46
11 Performance Considerations ......................................................................................................11-1
11.1 Branch Prediction ............................................................................................................11-1
11.2 Instruction Latencies........................................................................................................11-2
11.2.1 Performance Terms............................................................................................11-2
11.2.2 Branch Instruction Timings .................................................................................11-3
11.2.3 Data Processing Instruction Timings ..................................................................11-4
11.2.4 Multiply Instruction Timings ................................................................................11-5
11.2.5 Saturated Arithmetic Instructions........................................................................11-6
11.2.6 Status Register Access Instructions ...................................................................11-7
11.2.7 Load/Store Instructions.......................................................................................11-7
11.2.8 Semaphore Instructions......................................................................................11-8
11.2.9 Coprocessor Instructions ....................................................................................11-8
11.2.10 Miscellaneous Instruction Timing........................................................................11-8
11.2.11 Thumb Instructions .............................................................................................11-9
11.3 Interrupt Latency..............................................................................................................11-9
A Optimization Guide...................................................................................................................... A-1
A.1 Introduction....................................................................................................................... A-1
A.1.1 About This Guide ................................................................................................. A-1
A.2 Intel® XScale™ Core Pipeline.......................................................................................... A-1
A.2.1 General Pipeline Characteristics ......................................................................... A-2
A.2.1.1. Number of Pipeline Stages .................................................................. A-2
A.2.1.2. Intel® XScale™ Core Pipeline Organization ....................................... A-2
A.2.1.3. Out Of Order Completion..................................................................... A-3
A.2.1.4. Register Dependencies........................................................................ A-3
A.2.1.5. Use of Bypassing................................................................................. A-3
A.2.2 Instruction Flow Through the Pipeline ................................................................. A-4
A.2.2.1. ARM* v5 Instruction Execution ............................................................ A-4
A.2.2.2. Pipeline Stalls ...................................................................................... A-4
A.2.3 Main Execution Pipeline ...................................................................................... A-4
A.2.3.1. F1 / F2 (Instruction Fetch) Pipestages................................................. A-4
A.2.3.2. ID (Instruction Decode) Pipestage ....................................................... A-5
A.2.3.3. RF (Register File / Shifter) Pipestage .................................................. A-5
A.2.3.4. X1 (Execute) Pipestages ..................................................................... A-5
A.2.3.5. X2 (Execute 2) Pipestage .................................................................... A-6
A.2.3.6. XWB (write-back)................................................................................. A-6
A.2.4 Memory Pipeline .................................................................................................. A-6
A.2.4.1. D1 and D2 Pipestage........................................................................... A-6
A.2.5 Multiply/Multiply Accumulate (MAC) Pipeline ...................................................... A-6
A.2.5.1. Behavioral Description......................................................................... A-7
A.3 Basic Optimizations ..........................................................................................................A-7
A.3.1 Conditional Instructions ....................................................................................... A-7
A.3.1.1. Optimizing Condition Checks............................................................... A-7
A.3.1.2. Optimizing Branches............................................................................ A-8
A.3.1.3. Optimizing Complex Expressions ...................................................... A-10
A.3.2 Bit Field Manipulation ........................................................................................ A-11
A.3.3 Optimizing the Use of Immediate Values........................................................... A-11
A.3.4 Optimizing Integer Multiply and Divide .............................................................. A-11
A.3.5 Effective Use of Addressing Modes................................................................... A-12
Intel® XScale™ Microarchitecture User’s Manual ix
Contents
A.4 Cache and Prefetch Optimizations ................................................................................. A-12
A.4.1 Instruction Cache............................................................................................... A-13
A.4.1.1. Cache Miss Cost................................................................................ A-13
A.4.1.2. Round Robin Replacement Cache Policy.......................................... A-13
A.4.1.3. Code Placement to Reduce Cache Misses ....................................... A-13
A.4.1.4. Locking Code into the Instruction Cache ........................................... A-13
A.4.2 Data and Mini Cache ......................................................................................... A-14
A.4.2.1. Non Cacheable Regions .................................................................... A-14
A.4.2.2. Write-through and Write-back Cached Memory Regions .................. A-14
A.4.2.3. Read Allocate and Read-write Allocate Memory Regions ................. A-15
A.4.2.4. Creating On-chip RAM....................................................................... A-15
A.4.2.5. Mini-data Cache................................................................................. A-15
A.4.2.6. Data Alignment .................................................................................. A-16
A.4.2.7. Literal Pools ....................................................................................... A-17
A.4.3 Cache Considerations ....................................................................................... A-17
A.4.3.1. Cache Conflicts, Pollution and Pressure............................................ A-17
A.4.3.2. Memory Page Thrashing.................................................................... A-18
A.4.4 Prefetch Considerations .................................................................................... A-18
A.4.4.1. Prefetch Distances............................................................................. A-18
A.4.4.2. Prefetch Loop Scheduling.................................................................. A-18
A.4.4.3. Compute vs. Data Bus Bound............................................................ A-19
A.4.4.4. Low Number of Iterations................................................................... A-19
A.4.4.5. Bandwidth Limitations ........................................................................ A-19
A.4.4.6. Cache Memory Considerations.......................................................... A-20
A.4.4.7. Cache Blocking .................................................................................. A-21
A.4.4.8. Prefetch Unrolling .............................................................................. A-21
A.4.4.9. Pointer Prefetch ................................................................................. A-22
A.4.4.10. Loop Interchange ............................................................................... A-23
A.4.4.11. Loop Fusion ....................................................................................... A-23
A.4.4.12. Prefetch to Reduce Register Pressure .............................................. A-23
A.5 Instruction Scheduling .................................................................................................... A-24
A.5.1 Scheduling Loads .............................................................................................. A-24
A.5.1.1. Scheduling Load and Store Double (LDRD/STRD) ........................... A-26
A.5.1.2. Scheduling Load and Store Multiple (LDM/STM)............................... A-27
A.5.2 Scheduling Data Processing Instructions .......................................................... A-28
A.5.3 Scheduling Multiply Instructions ........................................................................ A-28
A.5.4 Scheduling SWP and SWPB Instructions.......................................................... A-29
A.5.5 Scheduling the MRA and MAR Instructions (MRRC/MCRR)............................. A-29
A.5.6 Scheduling the MIA and MIAPH Instructions..................................................... A-30
A.5.7 Scheduling MRS and MSR Instructions............................................................. A-30
A.5.8 Scheduling Coprocessor Instructions ................................................................ A-31
A.6 Optimizations for Size..................................................................................................... A-31
A.6.1 Multiple Word Load and Store ........................................................................... A-31
A.6.2 Use of Conditional Instructions .......................................................................... A-31
A.6.3 Use of PLD Instructions ..................................................................................... A-32
A.6.4 Thumb Instructions ............................................................................................ A-32
Figures
1-1 Intel® XScale™ Microarchitecture Architecture Features .........................................................1-3
3-1 Example of Locked Entries in TLB.............................................................................................3-8
4-1 Instruction Cache Organization .................................................................................................4-1
4-2 Locked Line Effect on Round Robin Replacement ....................................................................4-6
xIntel® XScale™ Microarchitecture User’s Manual
Contents
5-1 BTB Entry ..................................................................................................................................5-1
5-2 Branch History ...........................................................................................................................5-2
6-1 Data Cache Organization ..........................................................................................................6-2
6-2 Mini-Data Cache Organization ..................................................................................................6-3
6-3 Locked Line Effect on Round Robin Replacement..................................................................6-13
9-1 Test Access Port (TAP) Block Diagram.....................................................................................9-2
9-2 BSDL code for 256-MBGA package..........................................................................................9-7
9-3 TAP Controller State Diagram ...................................................................................................9-9
10-1SELDCSR Hardware .............................................................................................................10-17
10-2DBGTX Hardware..................................................................................................................10-19
10-3DBGRX Hardware .................................................................................................................10-20
10-4RX Write Logic.......................................................................................................................10-21
10-5DBGRX Data Register...........................................................................................................10-22
10-6High Level View of Trace Buffer ............................................................................................10-26
10-7Message Byte Formats..........................................................................................................10-27
10-8Indirect Branch Entry Address Byte Organization .................................................................10-30
10-9LDIC JTAG Data Register Hardware.....................................................................................10-31
10-10Format of LDIC Cache Functions ........................................................................................10-33
10-11Code Download During a Cold Reset For Debug ................................................................10-35
10-12Code Download During a Warm Reset For Debug..............................................................10-37
10-13Downloading Code in IC During Program Execution ...........................................................10-38
A-1 Intel® XScale™ Core RISC Superpipeline...................................................A-2
Tables
2-1 Multiply with Internal Accumulate Format..................................................................................2-4
2-2 MIA{<cond>} acc0, Rm, Rs .......................................................................................................2-4
2-3 MIAPH{<cond>} acc0, Rm, Rs ..................................................................................................2-5
2-4 MIAxy{<cond>} acc0, Rm, Rs....................................................................................................2-6
2-5 Internal Accumulator Access Format.........................................................................................2-7
2-6 MAR{<cond>} acc0, RdLo, RdHi ...............................................................................................2-8
2-7 MRA{<cond>} RdLo, RdHi, acc0 ...............................................................................................2-8
2-8 First-level Descriptors................................................................................................................2-9
2-9 Second-level Descriptors for Coarse Page Table .....................................................................2-9
2-10Second-level Descriptors for Fine Page Table ........................................................................2-10
2-11Exception Summary ................................................................................................................2-11
2-12Event Priority ...........................................................................................................................2-11
2-13Intel® XScale™ Core Encoding of Fault Status for Prefetch Aborts .......................................2-12
2-14Intel® XScale™ Core Encoding of Fault Status for Data Aborts .............................................2-13
3-1 Data Cache and Buffer Behavior when X = 0............................................................................3-2
3-2 Data Cache and Buffer Behavior when X = 1............................................................................3-3
3-3 Memory Operations that Impose a Fence .................................................................................3-4
3-4 Valid MMU & Data/mini-data Cache Combinations...................................................................3-4
7-1 MRC/MCR Format.....................................................................................................................7-2
7-2 LDC/STC Format when Accessing CP14..................................................................................7-2
7-3 CP15 Registers .........................................................................................................................7-3
7-4 ID Register.................................................................................................................................7-4
7-5 Cache Type Register.................................................................................................................7-5
7-6 ARM* Control Register ..............................................................................................................7-6
7-7 Auxiliary Control Register ..........................................................................................................7-7
Intel® XScale™ Microarchitecture User’s Manual xi
Contents
7-8 Translation Table Base Register ...............................................................................................7-7
7-9 Domain Access Control Register...............................................................................................7-8
7-10Fault Status Register .................................................................................................................7-8
7-11Fault Address Register ..............................................................................................................7-9
7-12Cache Functions........................................................................................................................7-9
7-13TLB Functions..........................................................................................................................7-11
7-14Cache Lockdown Functions.....................................................................................................7-11
7-15Data Cache Lock Register.......................................................................................................7-11
7-16TLB Lockdown Functions ........................................................................................................7-12
7-17Accessing Process ID..............................................................................................................7-12
7-18Process ID Register.................................................................................................................7-13
7-19Accessing the Debug Registers...............................................................................................7-13
7-20Coprocessor Access Register .................................................................................................7-14
7-21CP14 Registers........................................................................................................................7-16
7-22Accessing the Performance Monitoring Registers...................................................................7-16
7-23PWRMODE Register 7 ............................................................................................................7-17
7-24CCLKCFG Register 6 ..............................................................................................................7-17
7-25Clock and Power Management valid operations .....................................................................7-17
7-26Accessing the Debug Registers...............................................................................................7-18
8-1 Clock Count Register (CCNT) ...................................................................................................8-2
8-2 Performance Monitor Count Register (PMN0 and PMN1).........................................................8-2
8-3 Performance Monitor Control Register (CP14, register 0).........................................................8-3
8-4 Performance Monitoring Events ................................................................................................8-4
8-5 Some Common Uses of the PMU..............................................................................................8-5
9-1 TAP Controller Pin Definitions ...................................................................................................9-2
9-2 JTAG Instruction Codes.............................................................................................................9-4
9-3 JTAG Instruction Descriptions ...................................................................................................9-4
10-1Coprocessor 15 Debug Registers............................................................................................10-2
10-2Coprocessor 14 Debug Registers............................................................................................10-2
10-3Debug Control and Status Register (DCSR) ...........................................................................10-3
10-4Event Priority ...........................................................................................................................10-6
10-5Instruction Breakpoint Address and Control Register (IBCRx)................................................10-9
10-6Data Breakpoint Register (DBRx)............................................................................................10-9
10-7Data Breakpoint Controls Register (DBCON)........................................................................10-10
10-8TX RX Control Register (TXRXCTRL)...................................................................................10-12
10-9Normal RX Handshaking .......................................................................................................10-12
10-10High-Speed Download Handshaking States........................................................................10-13
10-11TX Handshaking...................................................................................................................10-14
10-12TXRXCTRL Mnemonic Extensions ......................................................................................10-14
10-13TX Register ..........................................................................................................................10-15
10-14RX Register..........................................................................................................................10-15
10-15DEBUG Data Register Reset Values...................................................................................10-23
10-16CP 14 Trace Buffer Register Summary................................................................................10-24
10-17Checkpoint Register (CHKPTx) ...........................................................................................10-24
10-18TBREG Format ....................................................................................................................10-25
10-19Message Byte Formats ........................................................................................................10-28
10-20LDIC Cache Functions .........................................................................................................10-32
11-1Branch Latency Penalty...........................................................................................................11-1
11-2Latency Example .....................................................................................................................11-3
11-3Branch Instruction Timings (Those predicted by the BTB) ......................................................11-3
xii Intel® XScale™ Microarchitecture User’s Manual
Contents
11-4Branch Instruction Timings (Those not predicted by the BTB) ................................................11-4
11-5Data Processing Instruction Timings .......................................................................................11-4
11-6Multiply Instruction Timings .....................................................................................................11-5
11-7Multiply Implicit Accumulate Instruction Timings .....................................................................11-6
11-8Implicit Accumulator Access Instruction Timings.....................................................................11-6
11-9Saturated Data Processing Instruction Timings ......................................................................11-7
11-10Status Register Access Instruction Timings...........................................................................11-7
11-11Load and Store Instruction Timings .......................................................................................11-7
11-12Load and Store Multiple Instruction Timings..........................................................................11-8
11-13Semaphore Instruction Timings .............................................................................................11-8
11-14CP15 Register Access Instruction Timings............................................................................11-8
11-15CP14 Register Access Instruction Timings............................................................................11-8
11-16SWI Instruction Timings .........................................................................................................11-8
11-17Count Leading Zeros Instruction Timings ..............................................................................11-9
A-1 Pipelines and Pipe stages ............................................................................A-3
Intel® XScale™ Microarchitecture User’s Manual 1-1
Introduction
1
1.1 About This Document
This document describes the Intel® XScale™ core as implemented in the PXA255 processor.
Intel Corporation assumes no responsibility for any errors which may appear in this document nor
does it make a commitment to update the information contained herein.
Intel retains the right to make changes to these specifications at any time, without notice. In
particular, descriptions of features, timings, and pin-outs does not imply a commitment to
implement them.
1.1.1 How to Read This Document
It is necessary to be familiar with the ARM* Version 5TE Architecture in order to understand some
aspects of this document.
Each chapter in this document focuses on a specific architectural feature of the Intel® XScale™
core.
Chapter 2, “Programming Model”
Chapter 3, “Memory Management”
Chapter 4, “Instruction Cache”
Chapter 5, “Branch Target Buffer”
Chapter 6, “Data Cache”
Chapter 7, “Configuration”
Chapter 8, “Performance Monitoring
Chapter 10, “Software Debug
Chapter 11, “Performance Considerations”
Appendix A, “Optimization Guide” covers instruction scheduling techniques.
Note: Most of the “buzz words” and acronyms found throughout this document are captured in
Section 1.3.2, “Terminology and Acronyms” on page 1-6, located at the end of this chapter.
1.1.2 Other Relevant Documents
ARM* Architecture Reference Manual Document Number: ARM DDI 0100E
This document describes the ARM* Architecture and is publicly available.
See http://www.arm.com/ARMARM for details. Sold as:
ARM* Architecture Reference Manual
Second Edition, edited by David Seal: Addison-Wesley: ISBN 0-201-73719-1
Intel® PXA255 Processor Developers Manual, Intel Order # 278693
1-2 Intel® XScale™ Microarchitecture User’s Manual
Introduction
Intel® PXA255 Processor Design Guide, Intel Order # 278694
Intel® 80200 Processor Development Manual, Intel Order #273411
This document describes the first implementation of the Intel® XScale™ Microarchitecture
in a microprocessor targeted at IO applications
Available from http://developer.intel.com
1.2 High-Level Overview of the Intel® XScale™ core as
Implemented in the Application Processors
The Intel® XScale™ core is an ARM* V5TE compliant microprocessor. It is a high performance
and low-power device that leads the industry in MIPS/mW. The core is not intended to be delivered
as a stand alone product but as a building block for an ASSP (Application Specific Standard
Product) with embedded markets such as handheld devices, networking, storage, remote access
servers, etc. The PXA255 processor is an example of an ASSP designed primarily for handheld
devices. This document limits itself to describing the implementation of the Intel® XScale™ core
as it is implemented in the PXA255 processor. In almost every attribute the Intel® XScale™ core
used in the application processor is identical to the Intel® XScale™ core implemented in the
Intel® 80200
The Intel® XScale™ core incorporates an extensive list of microarchitecture features that allow it
to achieve high performance. This rich feature set lets you select the appropriate features that
obtain the best performance for your application. Many of the micro-architectural features added to
the Intel® XScale™ core help hide memory latency which often is a serious impediment to high
performance processors. This includes:
The ability to continue instruction execution even while the data cache is retrieving data from
external memory
A write buffer
Write-back caching
Various data cache allocation policies which can be configured differently for each application
Cache locking
All these features improve the efficiency of the memory bus external to the core.
The Intel® XScale™ core efficiently handles audio processing through the support of 16-bit data
types and enhanced 16-bit operations. These audio coding enhancements center around multiply
and accumulate operations which accelerate many of the audio filtering and multimedia CODEC
algorithms.
1.2.1 ARM* Compatibility
ARM* Version 5 (V5) Architecture added new features to ARM* Version 4, including among other
inclusions, floating point instructions. The Intel® XScale™ core implements the integer
instruction set of ARM* V5, but does not provide hardware support for any of the floating point
instructions.
Intel® XScale™ Microarchitecture User’s Manual 1-3
Introduction
The Intel® XScale™ core provides the ARM* V5T Thumb instruction set and the ARM* V5E
DSP extensions. To further enhance multimedia applications, the Intel® XScale™ core includes
additional Multiply-Accumulate functionality as the first instantiation of Intel® Media Processing
Technology. These new operations from Intel are mapped into ARM* coprocessor space.
Backward compatibility with StrongARM* products is maintained for user-mode applications.
Operating systems may require modifications to match the specific hardware features of the Intel®
XScale™ core and to take advantage of the performance enhancements added.
1.2.2 Features
Figure 1-1 shows the major functional blocks of the Intel® XScale™ core. The following sections
give a brief, high-level overview of these blocks.
1.2.2.1 Multiply/Accumulate (MAC)
The MAC unit supports early termination of multiplies/accumulates in two cycles and can sustain a
throughput of a MAC operation every cycle. Several architectural enhancements were made to the
MAC to support audio coding algorithms, which include a 40-bit accumulator and support for 16-
bit packed data.
Refer to Section 2.3, “Extensions to ARM* Architecture” on page 2-2 for more information.
1.2.2.2 Memory Management
The Intel® XScale™ core implements the Memory Management Unit (MMU) Architecture
specified in the ARM* Architecture Reference Manual. The MMU provides access protection and
virtual to physical address translation.
The MMU Architecture also specifies the caching policies for the instruction cache and data cache.
These policies are specified as page attributes and include:
Figure 1-1. Intel® XScale™ Microarchitecture Architecture Features
Micro-
Processor
Instruction Data Cache
Data Ram
Mini-Data
Fill Buffer
Cache Cache
Branch Target
Performance
Debug
IMMU DMMU
Write Buffer
JTAG
Power MAC
Buffer
Mgnt
Ctrl
Monitoring
7 Stage
pipeline
128 Entries
Hardware Breakpoints
Branch History Table
2 Kbytes
2 Ways
Max 32 Kbytes
32 Ways
WR - Back or
WR - Through
Hit under miss
32 Kbytes
32 Ways
Lockable by line
32 entry TLB
Fully associative
Lockable by entry
32 entry TLB
Fully associative
Lockable by entry
4 - 8 entries
Single cycle through-
16-bit SIMD
40-bit accumulator
put (16*32)
Max 28 Kbytes
Re-Map of data
cache
8 entries
Full coalescing
1-4 Intel® XScale™ Microarchitecture User’s Manual
Introduction
identifying code as cacheable or non-cacheable
selecting between the mini-data cache or data cache
write-back or write-through data caching
enabling data write allocation policy
enabling the write buffer to coalesce stores to external memory
Refer to Chapter 3, “Memory Management” for more information.
1.2.2.3 Instruction Cache
The Intel® XScale™ core implements a 32-Kbyte, 32-way set associative instruction cache with a
line size of 32 bytes. All requests that “miss” the instruction cache generate a 32-byte read request
to external memory. A mechanism to lock critical code within the cache is also provided.
Refer to Chapter 4, “Instruction Cache” for more information.
In addition to the main instruction cache there is a 2-Kbyte mini-instruction cache dedicated to
advanced debugging features. Refer to Chapter 10, “Software Debug” for more information.
1.2.2.4 Branch Target Buffer
The Intel® XScale™ core provides a Branch Target Buffer (BTB) to predict the outcome of branch
type instructions. It provides storage for the target address of branch type instructions and predicts
the next address to present to the instruction cache when the current instruction address is that of a
branch.
The BTB holds 128 entries. Refer to Chapter 5, “Branch Target Buffer” for more information.
1.2.2.5 Data Cache
The Intel® XScale™ core implements a 32-Kbyte, 32-way set associative data cache and a 2-
Kbyte, 2-way set associative mini-data cache. Each cache has a line size of 32 bytes, supporting
write-through or write-back caching.
The data/mini-data cache is controlled by page attributes defined in the MMU Architecture and by
coprocessor 15.
Refer to Chapter 6, “Data Cache” for more information.
The Intel® XScale™ core allows applications to re-configure a portion of the data cache as data
RAM. Software may place special tables or frequently used variables in this RAM. Refer to
Section 6.4, “Re-configuring the Data Cache as Data RAM” on page 6-10 for more information.
1.2.2.6 Fill Buffer & Write Buffer
The Fill Buffer and Write Buffer enable the loading and storing of data to memory beyond the
Intel® XScale™ core. The Write Buffer carries all write traffic beyond the core allowing data
coalescing when both globally enabled, and when associated with the appropriate memory page
types. The Fill buffer assists the loading of data from memory, which along with an associated
Pend Buffer allows multiple memory reads to be outstanding. Another key function of the Fill
Intel® XScale™ Microarchitecture User’s Manual 1-5
Introduction
Buffer [along with the Instruction Fetch Buffers] is to allow the application processor external
SDRAM to be read as 4-word bursts, rather than single word accesses, improving overall memory
bandwidth.
Both the Fill, Pend and Write buffers help to decouple core speed from any limitations to accessing
external memory. Further details on these buffers can be found in Section 6.5, “Write Buffer/Fill
Buffer Operation and Control” on page 6-13
1.2.2.7 Performance Monitoring
Two performance monitoring counters have been added to the Intel® XScale™ core that can be
configured to monitor various events in the Intel® XScale™ core. These events allow a software
developer to measure cache efficiency, detect system bottlenecks and reduce the overall latency of
programs.
Refer to Chapter 8, “Performance Monitoring” for more information.
1.2.2.8 Power Management
The Intel® XScale™ core incorporates a power and clock management unit that can assist ASSPs
in controlling their clocking and managing their power. These features are described in Section 7.3,
“CP14 Registers” on page 7-15.
1.2.2.9 Debug
Intel® XScale™ core supports software debugging through two instruction address breakpoint
registers, one data-address breakpoint register, one data-address/mask breakpoint register, a mini-
instruction cache and a trace buffer.
Testability & hardware debugging is supported on the Intel® XScale™ core through the Test
Access Port (TAP) Controller implementation, which is based on IEEE 1149.1 (JTAG) Standard
Test Access Port and Boundary-Scan Architecture. The purpose of the TAP controller is to support
test logic internal and external to the Intel® XScale™ core such as built-in self-test and boundary-
scan.
The JTAG port can also be used as a hardware interface for debugger control of software. Refer to
Chapter 10, “Software Debug for more information.
1.3 Terminology and Conventions
1.3.1 Number Representation
All numbers in this document can be assumed to be base 10 unless designated otherwise. In text
and pseudo code descriptions, hexadecimal numbers have a prefix of 0x and binary numbers have a
prefix of 0b. For example, 107 would be represented as 0x6B in hexadecimal and 0b1101011 in
binary.
Bitfields are expressed with a colon within square brackets, for example a * b [63:0] denotes a 64
bit arithmetic partial result.
1-6 Intel® XScale™ Microarchitecture User’s Manual
Introduction
1.3.2 Terminology and Acronyms
ASSP Application Specific Standard Product. Defined for a specific purpose
but not exclusively available to a single customer.
API Application Programming Interface, typically a defined set of function
calls and passed parameters defining how layers of software interact.
Assert This term refers to the logically active value of a signal or bit.
BTB Branch Target Buffer, a predictor of instructions that follow branches.
Clean A ‘clean’ operation with regard to a data cache is the writing back of
modified data to the external memory system, resulting in no ‘dirty’ lines
remaining in the cache.
Coalescing Coalescing means bringing together a new store operation with an
existing store operation already issued. This includes, in Peripheral
Component Interconnect [PCI] terminology, write merging, write
collapsing, and write combining.
Deassert This term refers to the logically inactive value of a signal or bit.
Flush A ‘flush’ operation invalidates the location(s) in the cache by deasserting
the valid bit. This now invalid cacheline will no longer be searched on
cache accesses. Aflush’ operation on a write-back data cache does not
implicitly imply a ‘clean’ operation.
NOP Shortening of No OPeration, meaning an instruction with no state
changing effect. A typical example might be Add-constant-zero without
condition flag update
Privilege Mode Any chip mode of operation that is not User Mode; the mode typically
used for applications software. A Privileged Mode gains access to shared
system resources.
Reserved A reserved field is a register field that may be used by an implementation
but not intended to be programmed. If the initial value of a reserved field
is supplied by software, this value must be zero. Software should not
modify reserved fields or depend on any values in reserved fields.
TLB Translation Look-aside Buffer, a cache of Page Table descriptors loaded
from memory to minimize page-table walking overhead.
Intel® XScale™ Microarchitecture User’s Manual 2-1
Programming Model
2
This chapter describes the programming model of the Intel® XScale™ core, namely the
implementation options and extensions to the ARM* Version 5 architecture chosen for the
PXA255 processor.
2.1 ARM* Architecture Compatibility
The Intel® XScale™ core implements the integer instruction set architecture specified in ARM*
Version 5TE. T refers to the Thumb instruction set and E refers to the DSP-Enhanced instruction
set.
ARM* Version 5 introduces a few more architecture features over Version 4;
tiny pages of 1 Kbyte each
a new instruction (CLZ) that counts the leading zeroes in a data value
enhanced ARM-Thumb transfer instructions
new breakpoint instructions (BKPT)
a modification of the system control coprocessor, CP15.
2.2 ARM* Architecture Implementation Options
2.2.1 Big Endian versus Little Endian
The Intel® XScale™ core supports both big and little endian data representation. The B-bit of the
Control Register (Coprocessor 15, register 1, bit 7) selects big and little endian mode.
The default behavior of the application processor at reset is little endian. To run in big endian
mode, the B bit must be set before attempting any sub-word accesses to memory. Note that the
endian bit takes effect even if the MMU is disabled.
The B-bit affects the data path, fill and write buffers and subword data location in memory. The
LCD controller and DMA controller on the application processor can also switch endianism for
data movement independent of the B-bit.
In concurrence with the changes introduced in ARM* V5, the Intel® XScale™ core does not
support legacy code requiring the 26-bit address space.
2.2.2 Thumb
The Intel® XScale™ core supports the Thumb instruction set. These are 16-bit ARM* instructions
that implement similar functions to the ARM* 32-bit instruction set, but offer advantages in
reducing code size.
2-2 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
2.2.3 ARM* DSP-Enhanced Instruction Set
The Intel® XScale™ core implements ARM*’s DSP-enhanced instruction set. There are new
multiply instructions that operate on 16-bit data values and new saturation instructions. Some of
the new instructions are:
SMLAxy32<=16x16+32
SMLAWy 32<=32x16+32
SMLALxy64<=16x16+64
SMULxy32<=16x16
SMULWy32<=32x16
QADDadds two registers and saturates the result if an overflow occurred
QDADDdoubles and saturates one of the input registers then add and saturate
QSUBsubtracts two registers and saturates the result if an overflow occurred
QDSUBdoubles and saturates one of the input registers then subtract and saturate
The Intel® XScale™ core also implements LDRD, STRD and PLD instructions with the following
implementation notes:
PLD is interpreted as a read operation by the MMU and is ignored by the data breakpoint unit,
i.e., PLD will never generate data breakpoint events.
PLD to a non-cacheable page performs no action. Also, if the targeted cache line is already
resident, this instruction has no effect.
Both LDRD and STRD instructions will generate an alignment exception when the address is
not on a 64-bit boundary.
MCRR and MRRC are only supported on the Intel® XScale™ core when directed to coprocessor 0
and are used to access the internal accumulator. See Section 2.3.1.2 for more information. Access
to other coprocessors on the application processor generates the Undefined instruction exception.
2.2.4 Base Register Update
If a data abort is signalled on a memory instruction that specifies writeback, the contents of the
base register will not be updated. This holds for all load and store instructions. This behavior
matches that of the first generation StrongARM* processor and is referred to in the ARM* V5
architecture as the Base Restored Abort Model.
2.3 Extensions to ARM* Architecture
The Intel® XScale™ core made a few extensions to the ARM* Version 5 architecture to meet the
needs of various markets and design requirements. The following is a list of the extensions which
are discussed in the next sections.
A DSP coprocessor (CP0) has been added that contains a 40-bit accumulator and 8 new
operations in coprocessor space, hereafter referred to as new instructions.
Intel® XScale™ Microarchitecture User’s Manual 2-3
Programming Model
New page attributes were added to the page table descriptors. The C and B page attribute
encoding was extended by one more bit to allow for more encodings: write allocate and mini-
data cache.
Additional functionality has been added to coprocessor 15. Coprocessor 14 was created.
Enhancements were made to the Event Architecture, instruction cache and data cache parity
error exceptions, breakpoint events, and imprecise external data aborts.
2.3.1 DSP Coprocessor 0 (CP0)
The Intel® XScale™ core adds a DSP coprocessor to the architecture for the purpose of increasing
the performance and the precision of audio processing algorithms. This coprocessor contains a 40-
bit accumulator and 8 new instructions.
The 40-bit accumulator is referenced by several new instructions that were added to the
architecture; MIA, MIAPH and MIAxy are multiply/accumulate instructions that reference the
40-bit accumulator instead of a register specified accumulator. MRA and MAR provide the ability
to read and write the 40-bit accumulator.
Access to CP0 is always allowed in all processor modes when bit 0 of the Coprocessor Access
Register is set. Any access to CP0 when this bit is clear will cause an undefined exception. (See
Section 7.2.13, “Register 15: Coprocessor Access Register” on page 7-14 for more details). Only
privileged software can set this bit in the Coprocessor Access Register.
The 40-bit accumulator will need to be saved on a context switch if multiple processes are using it.
Two new instruction formats were added for coprocessor 0: Multiply with Internal Accumulate
Format and Internal Accumulate Access Format. The formats and instructions are described next.
2.3.1.1 Multiply With Internal Accumulate Format
A new multiply format has been created to define operations on 40-bit accumulators. Table 2-1,
“Multiply with Internal Accumulate Format” on page 2-4 shows the layout of the new format. The
opcodes selected for this new format lie within the ARM* coprocessor register transfer instruction
type. These instructions have their own syntax.
2-4 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
Two new fields were created for this format, acc and opcode_3. The acc field specifies 1 of 8
internal accumulators to operate on and opcode_3 defines the operation for this format. The Intel®
XScale™ core defines a single 40-bit accumulator referred to as acc0; future implementations may
define multiple internal accumulators. The Intel® XScale™ core uses opcode_3 to define six
instructions, MIA, MIAPH, MIABB, MIABT, MIATB and MIATT.
The MIA instruction operates similarly to MLA except that the 40-bit accumulator is used. MIA
multiplies the signed value in register Rs (multiplier) by the signed value in register Rm
(multiplicand) and then adds the result to the 40-bit accumulator (acc0).
Table 2-1. Multiply with Internal Accumulate Format
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
cond 11100010 opcode_3 Rs 0000 acc 1 Rm
Bits Description Notes
31:28 cond - ARM* condition codes -
19:16 opcode_3 - specifies the type of multiply with
internal accumulate
The Intel® XScale™ core defines the
following:
0b0000 = MIA
0b1000 = MIAPH
0b1100 = MIABB
0b1101 = MIABT
0b1110 = MIATB
0b1111 = MIATT
The effect of all other encodings are
unpredictable.
15:12 Rs - Multiplier
7:5 acc - select 1 of 8 accumulators
The Intel® XScale™ core only implements
acc0; access to any other acc has
unpredictable effect.
3:0 Rm - Multiplicand -
Table 2-2. MIA{<cond>} acc0, Rm, Rs
313029282726252423222120191817161514131211109876543210
cond 111000100000 Rs 00000001 Rm
Operation: if ConditionPassed(<cond>) then
acc0 = (Rm[31:0] * Rs[31:0])[39:0] + acc0[39:0]
Exceptions: none
Qualifiers Condition Code
No condition code flags are updated
Notes: Early termination is supported. Instruction timings can be found
in Section 11.2.4, “Multiply Instruction Timings” on page 11-5.
Specifying R15 for register Rs or Rm has unpredictable results.
acc0 is defined to be 0b000 on the Intel® XScale™ core.
Intel® XScale™ Microarchitecture User’s Manual 2-5
Programming Model
MIA does not support unsigned multiplication; all values in Rs and Rm will be interpreted as
signed data values. MIA is useful for operating on signed 16-bit data that was loaded into a general
purpose register by LDRSH.
The instruction is only executed if the condition specified in the instruction matches the condition
code status.
The MIAPH instruction performs two16-bit signed multiplies on packed half word data and
accumulates these to a single 40-bit accumulator. The first signed multiplication is performed on
the lower 16 bits of the value in register Rs with the lower 16 bits of the value in register Rm. The
second signed multiplication is performed on the upper 16 bits of the value in register Rs with the
upper 16 bits of the value in register Rm. Both signed 32-bit products are sign extended and then
added to the value in the 40-bit accumulator (acc0).
The instruction is only executed if the condition specified in the instruction matches the condition
code status.
Table 2-3. MIAPH{<cond>} acc0, Rm, Rs
313029282726252423222120191817161514131211109876543210
cond 111000101000 Rs 00000001 Rm
Operation: if ConditionPassed(<cond>) then
acc0 = sign_extend(Rm[15:0] * Rs[15:0]) +
sign_extend(Rm[31:16] * Rs[31:16]) +
acc0[39:0]
Exceptions: none
Qualifiers Condition Code
No condition code flags are updated
Notes: Instruction timings can be found
in Section 11.2.4, “Multiply Instruction Timings” on page 11-5.
Specifying R15 for register Rs or Rm has unpredictable results.
acc0 is defined to be 0b000 on the Intel® XScale™ core
2-6 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
The MIAxy instruction performs one16-bit signed multiply and accumulates these to a single 40-
bit accumulator. x refers to either the upper half or lower half of register Rm (multiplicand) and y
refers to the upper or lower half of Rs (multiplier). A value of 0x1 will select bits [31:16] of the
register which is specified in the mnemonic as T (for top). A value of 0x0 will select bits [15:0] of
the register which is specified in the mnemonic as B (for bottom).
MIAxy does not support unsigned multiplication; all values in Rs and Rm will be interpreted as
signed data values.
The instruction is only executed if the condition specified in the instruction matches the condition
code status.
2.3.1.2 Internal Accumulator Access Format
The Intel® XScale™ core defines a new instruction format for accessing internal accumulators in
CP0. Table 2-5, “Internal Accumulator Access Format” on page 2-7 shows that the opcode falls
into the coprocessor register transfer space.
The RdHi and RdLo fields allow up to 64 bits of data transfer between standard ARM* registers
and an internal accumulator. The acc field specifies 1 of 8 internal accumulators to transfer data to/
from. The Intel® XScale™ core implements a single 40-bit accumulator referred to as acc0; future
implementations can specify multiple internal accumulators of varying sizes.
Table 2-4. MIAxy{<cond>} acc0, Rm, Rs
313029282726252423222120191817161514131211109876543210
cond 1110001011xy Rs 00000001 Rm
Operation: if ConditionPassed(<cond>) then
if (bit[17] == 0)
<operand1> = Rm[15:0]
else
<operand1> = Rm[31:16]
if (bit[16] == 0)
<operand2> = Rs[15:0]
else
<operand2> = Rs[31:16]
acc0[39:0] = sign_extend(<operand1> * <operand2>) + acc0[39:0]
Exceptions: none
Qualifiers Condition Code
No condition code flags are updated
Notes: Instruction timings can be found
in Section 11.2.4, “Multiply Instruction Timings” on page 11-5.
Specifying R15 for register Rs or Rm has unpredictable results.
acc0 is defined to be 0b000 on the Intel® XScaleTM™ core.
Intel® XScale™ Microarchitecture User’s Manual 2-7
Programming Model
Access to the internal accumulator is allowed in all processor modes (user and privileged) as long
as bit 0 of the Coprocessor Access Register is set. (See Section 7.2.13, “Register 15: Coprocessor
Access Register” on page 7-14 for more details).
The Intel® XScale™ core implements two instructions MAR and MRA that move two ARM*
registers to acc0 and move acc0 to two ARM* registers, respectively.
Note: MAR has the same encoding as MCRR (to coprocessor 0) and MRA has the same encoding as
MRRC (to coprocessor 0). These instructions move 64-bits of data to/from ARM* registers from/
to coprocessor registers. MCRR and MRRC are defined in ARM’s DSP instruction set.
Disassemblers not aware of MAR and MRA will produce the following syntax:
MCRR{<cond>} p0, 0x0, RdLo, RdHi, c0
MRRC{<cond>} p0, 0x0, RdLo, RdHi, c0
Table 2-5. Internal Accumulator Access Format
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
cond 1100010L RdHi RdLo 000000000 acc
Bits Description Notes
31:28 cond - ARM* condition codes -
20
L - move to/from internal accumulator
0= move to internal accumulator (MAR)
1= move from internal accumulator (MRA)
-
19:16 RdHi - specifies the high order eight (39:32)
bits of the internal accumulator.
On a read of the acc, this 8-bit high order field
will be sign extended.
On a write to the acc, the lower 8 bits of this
register will be written to acc[39:32]
15:12 RdLo - specifies the low order 32 bits of the
internal accumulator -
7:4 Should be zero
This field could be used in future
implementations to specify the type of
saturation to perform on the read of an internal
accumulator. (e.g., a signed saturation to 16-
bits may be useful for some filter algorithms.)
3 Should be zero -
2:0 acc - specifies 1 of 8 internal accumulators The Intel® XScale™ core only implements
acc0; access to any other acc is unpredictable
2-8 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
The MAR instruction moves the value in register RdLo to bits[31:0] of the 40-bit accumulator
(acc0) and moves bits[7:0] of the value in register RdHi into bits[39:32] of acc0.
The instruction is only executed if the condition specified in the instruction matches the condition
code status.
This instruction executes in any processor mode.
The MRA instruction moves the 40-bit accumulator value (acc0) into two registers. Bits[31:0] of
the value in acc0 are moved into the register RdLo. Bits[39:32] of the value in acc0 are sign
extended to 32 bits and moved into the register RdHi.
The instruction is only executed if the condition specified in the instruction matches the condition
code status.
This instruction executes in any processor mode.
Table 2-6. MAR{<cond>} acc0, RdLo, RdHi
313029282726252423222120191817161514131211109876543210
cond 11000100 RdHi RdLo 000000000000
Operation: if ConditionPassed(<cond>) then
acc0[39:32] = RdHi[7:0]
acc0[31:0] = RdLo[31:0]
Exceptions: none
Qualifiers Condition Code
No condition code flags are updated
Notes: Instruction timings can be found in
Section 11.2.4, “Multiply Instruction Timings” on page 11-5
Specifying R15 as either RdHi or RdLo has unpredictable results.
Table 2-7. MRA{<cond>} RdLo, RdHi, acc0
313029282726252423222120191817161514131211109876543210
cond 11000101 RdHi RdLo 000000000000
Operation: if ConditionPassed(<cond>) then
RdHi[31:0] = sign_extend(acc0[39:32])
RdLo[31:0] = acc0[31:0]
Exceptions: none
Qualifiers Condition Code
No condition code flags are updated
Notes: Instruction timings can be found in
Section 11.2.4, “Multiply Instruction Timings” on page 11-5
Specifying the same register for RdHi and RdLo has unpredictable
results.
Specifying R15 as either RdHi or RdLo has unpredictable results.
Intel® XScale™ Microarchitecture User’s Manual 2-9
Programming Model
2.3.2 New Page Attributes
The Intel® XScale™ core extends the ARM* page attributes defined by the C & B bits in the page
descriptors with an additional X bit. This bit allows four more attributes to be encoded when X=1.
These new encodings include allocating data for the mini-data cache and write-allocate caching. A
full description of the encodings can be found in Section 3.2.3, “Data Cache and Write Buffer” on
page 3-2.
The Intel® XScale™ core retains ARM* definitions of the C & B encoding when X = 0, which is
different than the StrongARM* products. The memory attribute for the mini-data cache has been
moved and replaced with the write-through caching attribute.
When write-allocate is enabled, a store operation that misses the data cache (cacheable data only)
will generate a line fill. If disabled, a line fill only occurs when a load operation misses the data
cache (cacheable data only).
Write-through caching causes all store operations to be written to memory, whether they are first
written to the cache or not. This feature is useful for maintaining data cache coherency.
The Intel® XScale™ core also adds a P bit in the first level descriptors to allow an ASSP to
identify a new memory attribute. The application processor doesn’t implement a function for this
bit. All instances of the P bit in first-level descriptors must be written as zero. Bit 1 in the Control
Register (coprocessor 15, register 1, opcode=1) that is used to interact with the P bit must also be
written as zero.
The X, C, B & P attributes are programmed in the translation table descriptors, which are
highlighted in Table 2-8, “First-level Descriptors” on page 2-9, Table 2-9, “Second-level
Descriptors for Coarse Page Table” on page 2-9 and Table 2-10, “Second-level Descriptors for
Fine Page Table” on page 2-10. Two second-level descriptor formats have been defined for the
Intel® XScale™ core, one is used for the coarse page table and the other is used for the fine page
table.
AP bits are ARM* Access Permission controls.
Table 2-8. First-level Descriptors
313029282726252423222120191817161514131211109876543210
SBZ 0 0
Coarse page table base address P Domain SBZ 0 1
Section base address SBZ TEX AP P Domain 0 C B 1 0
Fine page table base address SBZ P Domain SBZ 1 1
Table 2-9. Second-level Descriptors for Coarse Page Table
313029282726252423222120191817161514131211109876543210
SBZ 0 0
Large page base address TEX AP3 AP2 AP1 AP0 C B 0 1
Small page base address AP3 AP2 AP1 AP0 C B 1 0
Extended small page base address SBZ TEX AP C B 1 1
2-10 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
The TEX (Type Extension) field is present in several of the descriptor types. In the Intel®
XScale™ core, only the LSB of this field is used; this is called the X bit.
A Small Page descriptor does not have a TEX field. For these descriptors, TEX is implicitly zero;
that is, they operate as if the X bit had a ‘0’ value.
The X bit, when set, modifies the meaning of the C and B bits. Description of page attributes and
their encoding can be found in Chapter 3, “Data Cache and Write Buffer”.
2.3.3 Additions to CP15 Functionality
To accommodate the functionality in the Intel® XScale™ core, registers in CP15 and CP14 have
been added or augmented. See Chapter 7, “Configuration” for details.
At times it is necessary to be able to guarantee exactly when a CP15 update takes effect. For
example, when enabling memory address translation (turning on the MMU), it is vital to know
when the MMU is actually guaranteed to be in operation. To address this need, a processor-specific
code sequence is defined for the Intel® XScale™ core. The sequence -- called CPWAIT -- is
shown in Example 2-1.
When setting multiple CP15 registers, system software may opt to delay the assurance of their
update. This is accomplished by executing CPWAIT only after a sequence of MCR instructions.
Table 2-10. Second-level Descriptors for Fine Page Table
313029282726252423222120191817161514131211109876543210
SBZ 0 0
Large page base address TEX AP3 AP2 AP1 AP0 C B 0 1
Small page base address AP3 AP2 AP1 AP0 C B 1 0
Tiny Page Base Address TEX AP C B 1 1
Example 2-1. CPWAIT: Canonical method to wait for CP15 update
;; The following macro should be used when software needs to be
;; assured that a CP15 update has taken effect.
;; It may only be used while in a privileged mode, because it
;; accesses CP15.
MACRO CPWAIT
MRC P15, 0, R0, C2, C0, 0 ; arbitrary read of CP15
MOV R0, R0 ; wait for it
SUB PC, PC, #4 ; branch to next instruction
; At this point, any previous CP15 writes are
; guaranteed to have taken effect.
ENDM
Intel® XScale™ Microarchitecture User’s Manual 2-11
Programming Model
The CPWAIT sequence guarantees that CP15 side-effects are complete by the time the CPWAIT is
complete. It is possible, however, that the CP15 side-effect will take place before CPWAIT
completes or is issued. Programmers should take care that this does not affect the correctness of
their code.
2.3.4 Event Architecture
2.3.4.1 Exception Summary
Table 2-11 shows all the exceptions that the Intel® XScale™ core may generate, and the attributes
of each. Subsequent sections give details on each exception. A precise exception is defined as one
where R14_mode always contains a pointer to locate the instruction that caused the exception.
Imprecise exceptions know the last instruction, but must look beyond the ARM* registers for the
cause of the exception.
2.3.4.2 Event Priority
The Intel® XScale™ core follows the exception priority specified in the ARM Architecture
Reference Manual. The processor has additional exceptions that might be generated while
debugging. For information on these debug exceptions, see Chapter 10, “Software Debug”.
Table 2-11. Exception Summary
Exception Description Exception Typea
a. Exception types are those described in the ARM, section 2.5.
Precise?
Reset Reset N
FIQ FIQ N
IRQ IRQ N
External Instruction Prefetch Y
Instruction MMU Prefetch Y
Instruction Cache Parity Prefetch Y
Lock Abort Data Y
MMU Datab
b. Only exception that updates Fault Address Register, CP15 Register 6
Data Y
External Data Data N
Data Cache Parity Data N
Software Interrupt Software Interrupt Y
Undefined Instruction Undefined Instruction Y
Debug Eventsc
c. Refer to Chapter 10, “Software Debug” for more details
varies varies
Table 2-12. Event Priority (Sheet 1 of 2)
Exception Priority
Reset 1 (Highest)
Data Abort (Precise & Imprecise) 2
FIQ 3
2-12 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
2.3.4.3 Prefetch Aborts
The Intel® XScale™ core detects three types of prefetch aborts: Instruction MMU abort, external
abort on an instruction access, and an instruction cache parity error. These aborts are described in
Table 2-13. Domains are defined in the ARM* Architecture Reference Manual. The Fault Address
Register (FAR) in coprocessor 15 holds the address causing an abort under certain conditions.
When a prefetch abort occurs, hardware reports the highest priority one in the extended Status field
of the Fault Status Register (FSR). The value placed in R14_ABORT (the link register in abort
mode) is the address of the aborted instruction + 4.
2.3.4.4 Data Aborts
Two types of data aborts exist in the Intel® XScale™ core: precise and imprecise. A precise data
abort is defined as one where R14_ABORT always contains the PC (+8) of the instruction that
caused the exception. An imprecise abort is one where R14_ABORT contains the PC (+4) of the
next instruction to execute and not the address of the instruction that caused the abort. In other
words, instruction execution will have advanced beyond the instruction that caused the data abort.
On the Intel® XScale™ core precise data aborts are recoverable and imprecise data aborts are not
recoverable.
Precise Data Aborts
A lock abort is a precise data abort; the extended Status field of the Fault Status Register (FSR)
is set to 0xb10100. This abort occurs when a lock operation directed to the MMU (instruction
or data) or instruction cache causes an exception, due to either a translation fault, access
permission fault or external bus fault.
IRQ 4
Prefetch Abort 5
Undefined Instruction, SWI 6 (Lowest)
Table 2-12. Event Priority (Sheet 2 of 2)
Exception Priority
Table 2-13. Intel® XScale™ Core Encoding of Fault Status for Prefetch Aborts
Priority Sources FSR[10,3:0]a
a. All other encodings not listed in the table are reserved.
Domain FAR
Highest
Instruction MMU Exception
Several exceptions can generate this encoding:
- translation faults
- domain faults, and
- permission faults
It is up to software to figure out which one occurred.
0b10000 invalid invalid
External Instruction Error Exception
This exception occurs when the memory system beyond
the core reports an error on an instruction cache fetch.
For the application processor this is an internal bus error as
there is no signal pin to generate an error from external
memory
0b10110 invalid invalid
Lowest Instruction Cache Parity Error Exception 0b11000 invalid invalid
Intel® XScale™ Microarchitecture User’s Manual 2-13
Programming Model
The Fault Address Register (FAR) is undefined and R14_ABORT is the address of the aborted
instruction + 8.
A data MMU abort is precise. These are due to an alignment fault, translation fault, domain
fault, permission fault or external data abort on an MMU translation. The status field is set to a
predetermined ARM* definition which is shown in Table 2-14, “Intel® XScale™ Core
Encoding of Fault Status for Data Aborts” on page 2-13.
The Fault Address Register is set to the effective data address of the instruction and
R14_ABORT is the address of the aborted instruction + 8.
First-level and Second-level refer to page descriptor fetches. A section is a 1Mbyte memory
area. For details see the ARM* Architecture Reference Manual.
Imprecise data aborts
A data cache parity error is imprecise; the extended Status field of the Fault Status Register is
set to 0xb11000.
All external data aborts except for those generated on a data MMU translation are imprecise.
The Fault Address Register for all imprecise data aborts is undefined and R14_ABORT is the
address of the next instruction to execute + 4, which is the same for both ARM* and Thumb mode.
Although the Intel® XScale™ core guarantees the Base Restored Abort Model for precise aborts, it
cannot do so in the case of imprecise aborts. A Data Abort handler may encounter an updated base
register if it is invoked because of an imprecise abort.
Imprecise data aborts may create scenarios that are difficult for an abort handler to recover. Both
external data aborts and data cache parity errors may result in corrupted data in the targeted
registers. Because these faults are imprecise, it is possible that the corrupted data will have been
used before the Data Abort fault handler is invoked. Because of this, software should treat
imprecise data aborts as unrecoverable.
Table 2-14. Intel® XScale™ Core Encoding of Fault Status for Data Aborts
Priority Sources FSR[10,3:0]a
a. All other encodings not listed in the table are reserved.
Domain FAR
Highest Alignment 0b000x1 invalid valid
External Abort on Translation First level
Second level
0b01100
0b01110
invalid
valid
valid
valid
Translation Section
Page
0b00101
0b00111
invalid
valid
valid
valid
Domain Section
Page
0b01001
0b01011
valid
valid
valid
valid
Permission Section
Page
0b01101
0b01111
valid
valid
valid
valid
Lock Abort
This data abort occurs on an MMU lock operation (data or
instruction TLB) or on an Instruction Cache lock operation.
0b10100 invalid invalid
Imprecise External Data Abort 0b10110 invalid invalid
Lowest Data Cache Parity Error Exception 0b11000 invalid invalid
2-14 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
Memory accesses marked as “stall until complete” (see Section 3.2.3) can also result in imprecise
data aborts. For these types of accesses, the fault is easier to manage than the general case as it is
guaranteed to be raised within three instructions of the instruction that caused it. In other words, if
a “stall until complete” LD or ST instruction triggers an imprecise fault, then that fault will be seen
by the program within three instructions.
With this knowledge, it is possible to write code that can reliably access faulting memory. Place
three NOP instructions after such an access. If an imprecise fault occurs, it will do so during the
NOPs; the data abort handler will see identical register and memory state as it would with a precise
exception, and so would be able to recover. An example of this is shown in Example 2-2 on
page 2-14.
Of course, if a system design precludes events that could cause external aborts, then such
precautions are not necessary.
Multiple Data Aborts
Multiple data aborts may be detected by hardware but only the highest priority one will be
reported. If the reported data abort is precise, software can correct the cause of the abort and re-
execute the aborted instruction. If the lower priority abort still exists, it will be reported. Software
can handle each abort separately until the instruction successfully executes.
If the reported data abort is imprecise, software needs to check the Saved Program Status Register
(SPSR) to see if the previous context was executing in abort mode. If this is the case, the link back
to the current process has been lost and the data abort is unrecoverable.
2.3.4.5 Events from Preload Instructions
A PLD instruction will never cause the Data MMU to fault for any of the following reasons:
Domain Fault
Permission Fault
Translation Fault
If execution of the PLD would cause one of the above faults, then the PLD does not cause the load
to occur.
This feature allows software to issue PLDs speculatively. Example 2-3 on page 2-15 places a PLD
instruction early in the loop. This PLD is used to fetch data for the next loop iteration. In this
example, the list is terminated with a node that has a null pointer. When execution reaches the end
of the list, the PLD on address 0x0 will not cause a fault. Rather, it will be ignored and the loop
will terminate normally.
Example 2-2. Shielding Code from Potential Imprecise Aborts
;; Example of code that maintains architectural state through the
;; window where an imprecise fault might occur.
LD R0, [R1] ; R1 points to stall-until-complete
; region of memory
NOP
NOP
NOP
; Code beyond this point is guaranteed not to see any aborts
; from the LD.
Intel® XScale™ Microarchitecture User’s Manual 2-15
Programming Model
2.3.4.6 Debug Events
Debug events are covered in Section 10.4, “Debug Exceptions” on page 10-5.
Example 2-3. Speculatively issuing PLD
;; R0 points to a node in a linked list. A node has the following layout:
;; Offset Contents
;;----------------------------------
;; 0 data
;; 4 pointer to next node
;; This code computes the sum of all nodes in a list. The sum is placed into R9.
;;
MOV R9, #0 ; Clear accumulator
sumList:
LDR R1, [R0, #4] ; R1 gets pointer to next node
LDR R3, [R0] ; R3 gets data from current node
PLD [R1] ; Speculatively start load of next node
ADD R9, R9, R3 ; Add into accumulator
MOVS R0, R1 ; Advance to next node. At end of list?
BNE sumList ; If not then loop
2-16 Intel® XScale™ Microarchitecture User’s Manual
Programming Model
Intel® XScale™ Microarchitecture User’s Manual 3-1
Memory Management
3
This chapter describes the memory management unit implemented in the Intel® XScale™ core.
3.1 Overview
The Intel® XScale™ core implements the Memory Management Unit (MMU) Architecture
specified in the ARM Architecture Reference Manual. To accelerate virtual to physical address
translation, the Intel® XScale™ core uses both an instruction Translation Look-aside Buffer (TLB)
and a data TLB to cache the latest translations. Each TLB holds 32 entries and is fully-associative.
Not only do the TLBs contain the translated addresses, but also the access rights for memory
references.
If an instruction or data TLB miss occurs, a hardware translation-table-walking mechanism is
invoked to translate the virtual address to a physical address. Once translated, the physical address
is placed in the TLB along with the access rights and attributes of the page or section. These
translations can also be locked down in either TLB to guarantee the performance of critical
routines.
The Intel® XScale™ core allows system software to associate various attributes with regions of
memory:
Cacheable
Bufferable
Line allocate policy
Write policy
I/O
Mini Data Cache
Coalescing
See Section 3.2.3, “Data Cache and Write Buffer” on page 3-2 for a description of page attributes
and Section 2.3.2, “New Page Attributes” on page 2-9 to find out where these attributes have been
mapped in the MMU descriptors.
Note: The virtual address with which the TLBs are accessed may be remapped by the PID register. See
Section 7.2.11, “Register 13: Process ID” on page 7-12 for a description of the PID register.
3.2 Architecture Model
The following sub-sections describe the Architecture Model.
3-2 Intel® XScale™ Microarchitecture User’s Manual
Memory Management
3.2.1 Version 4 vs. Version 5
ARM* MMU Version 5 Architecture introduces the support of tiny pages, which are 1 KByte in
size. The reserved field in the first-level descriptor (encoding 0b11) is used as the fine page table
base address. The exact bit fields and the format of the first and second-level descriptors are found
in Section 2.3.2, “New Page Attributes” on page 2-9.
The attributes associated with a particular region of memory are configured in the memory
management page table and control the behavior of accesses to the instruction cache, data cache,
mini-data cache, and the write buffer. These attributes are ignored when the MMU is disabled.
To allow compatibility with older system software, the new Intel® XScale™ core attributes take
advantage of encoding space in the descriptors that were formerly reserved and defaulted to zero.
3.2.2 Instruction Cache
When examining the X, C, and B bits in a descriptor, the Instruction Cache only utilizes the C bit.
If the C bit is clear, the Instruction Cache considers a code fetch from that memory to be non-
cacheable, and will not fill a cache entry. If the C bit is set, then fetches from the associated
memory region will be cached.
3.2.3 Data Cache and Write Buffer
All of the X, C & B descriptor bits affect the behavior of the Data Cache and the Write Buffer.
If the X bit for a descriptor is zero, the C and B bits operate as mandated by the ARM* architecture.
This behavior is detailed in Table 3-1.
If the X bit for a descriptor is one, the C and B bits’ meaning is extended, as detailed in Table 3-2.
Table 3-1. Data Cache and Buffer Behavior when X = 0
C B Cacheable? Bufferable? Write Policy
Line
Allocation
Policy
Notes
0 0 N N - - Stall until completea
a. Normally, the processor will continue executing after a data access if no dependency on that access is encountered. With
this setting, the processor will stall execution until the data access completes. This guarantees to software that the data ac-
cess has taken effect by the time execution of the data access instruction completes. External data aborts from such access-
es will be imprecise (but see Section 2.3.4.4 for a method to shield code from this imprecision).
0 1b
b. Different from StrongARM, which for these attributes did not coalesce in the Write Buffer
NY - -
1 0c
c. Different from StrongARM, which for these attributes selected the mini data cache, see when X = 1
Y Y Write Through Read Allocate
1 1 Y Y Write Back Read Allocate
Intel® XScale™ Microarchitecture User’s Manual 3-3
Memory Management
3.2.4 Details on Data Cache and Write Buffer Behavior
If the MMU is disabled all data accesses will be non-cacheable and non-bufferable. This is the
same behavior as when the MMU is enabled, and a data access uses a descriptor with X, C, and B
all set to 0.
The X, C, and B bits determine when the processor should place new data into the Data Cache. The
cache places data into the cache in lines (also called blocks). Thus, the basis for making a decision
about placing new data into the cache is a called a “Line Allocation Policy”.
If the Line Allocation Policy is read-allocate, all load operations that miss the cache request a 32-
byte cache line from external memory and allocate it into either the data cache or mini-data cache
(this is assuming the cache is enabled). Store operations that miss the cache will not cause a line to
be allocated.
If read/write-allocate is in effect, load or store operations that miss the cache will request a 32-byte
cache line from external memory if the cache is enabled.
The other policy determined by the X, C, and B bits is the Write Policy. A write-through policy
instructs the Data Cache to keep external memory coherent by performing stores to both external
memory and the cache. A write-back policy only updates external memory when a line in the cache
is cleaned or needs to be replaced with a new line. Generally, write-back provides higher
performance because it generates less data traffic to external memory.
More details on cache policies can be found in Section 6.2.3, “Cache Policies” on page 6-4.
3.2.5 Memory Operation Ordering
A fence memory operation (memop) is one that guarantees all memops issued prior to the fence
will execute before any memop issued after the fence. Thus software may issue a fence to impose a
partial ordering on memory accesses.
Table 3-3 on page 3-4 shows the circumstances in which memops act as fences.
Table 3-2. Data Cache and Buffer Behavior when X = 1
C B Cacheable? Bufferable? Write Policy
Line
Allocation
Policy
Notes
0 0 - - - - Unpredictable -- do not use
0 1 N Y - - Writes will not coalesce into
buffersa
a. Normally, bufferable writes can coalesce with previously buffered data in the same address range
1 0 (Mini Data
Cache) ---
Cache policy is determined
by MD field of Auxiliary
Control registerb
b. See Section 7.2.2 for a description of this register
1 1 Y Y Write Back Read/Write
Allocate
3-4 Intel® XScale™ Microarchitecture User’s Manual
Memory Management
Any swap (SWP or SWPB), to a page that would create a fence, is also a fence.
3.2.6 Exceptions
The MMU can generate prefetch aborts for instruction accesses and data aborts for data memory
accesses. The types and priorities of these exceptions are described in Section 2.3.4, “Event
Architecture” on page 2-11.
Data address alignment checking is enabled by setting bit 1 of the Control Register (CP15,
register 1). Alignment faults are still reported even if the MMU is disabled. All other MMU
exceptions are disabled when the MMU is disabled.
3.3 Interaction of the MMU, Instruction Cache, and Data
Cache
The MMU, instruction cache, and data/mini-data cache may be enabled/disabled independently.
The instruction cache can be enabled with the MMU enabled or disabled. However, the data cache
can only be enabled when the MMU is enabled. Therefore only three of the four combinations of
the MMU and data/mini-data cache enables are valid. The invalid combination will cause
undefined results.
3.4 Control
3.4.1 Invalidate (Flush) Operation
The entire instruction and data TLBs can be invalidated at the same time with one command or
they can be invalidated separately. An individual entry in the data or instruction TLB can also be
invalidated. See Table 7-13, “TLB Functions” on page 7-11 for a listing of commands supported by
the Intel® XScale™ core.
Table 3-3. Memory Operations that Impose a Fence
operation X C B
load - 0 -
store101
load or store 0 0 0
Table 3-4. Valid MMU & Data/mini-data Cache Combinations
MMU Data/mini-data Cache
Off Off
On Off
On On
Intel® XScale™ Microarchitecture User’s Manual 3-5
Memory Management
Globally invalidating a TLB will not affect locked TLB entries. However, the invalidate-entry
operations can invalidate individual locked entries. In this case, the locked entry remains in the
TLB, but will never “hit” on an address translation. Effectively, a hole is in the TLB. This situation
can be rectified by unlocking the TLB.
3.4.2 Enabling/Disabling
The MMU is enabled by setting bit 0 in coprocessor 15, register 1 (Control Register).
When the MMU is disabled, accesses to the instruction cache default to cacheable and all accesses
to data memory are made non-cacheable.
A recommended code sequence for enabling the MMU is shown in Equation 3-1.
3.4.3 Locking Entries
Individual entries can be explicitly loaded & locked into the instruction and data TLBs. See
Table 7-16, “TLB Lockdown Functions” on page 7-12 for the exact commands.
Note: If a lock operation finds the virtual address translation already resident in the TLB, the results are
unpredictable. An invalidate entry command, see Table 7-13, “TLB Functions” on page 7-11,
before the lock command will ensure proper operation. Software can also accomplish this by
invalidating all entries, as shown in Example 3-2 on page 3-6.
Locking entries into either the instruction TLB or data TLB reduces the available number of entries
(by the number that was locked down) for hardware to cache other virtual to physical address
translations.
A procedure for locking entries into the instruction TLB is shown in Example 3-2 on page 3-6.
Example 3-1. Enabling the MMU
; This routine provides software with a predictable way of enabling the MMU.
; After the CPWAIT, the MMU is guaranteed to be enabled. Be aware
; that the MMU will be enabled sometime after MCR and before the instruction
; that executes after the CPWAIT.
; Programming Note: This code sequence requires a one-to-one virtual to
; physical address mapping on this code since
; the MMU may be enabled part way through. This would allow the instructions
; after MCR to execute properly regardless of the state of the MMU.
MRC P15,0,R0,C1,C0,0; Read CP15, register 1
ORR R0, R0, #0x1; Turn on the MMU
MCR P15,0,R0,C1,C0,0; Write to CP15, register 1
; For a description of CPWAIT, see
; Section 2.3.3, “Additions to CP15 Functionality” on page 2-10
CPWAIT
; The MMU is guaranteed to be enabled at this point; the next instruction or
; data address will be translated.
3-6 Intel® XScale™ Microarchitecture User’s Manual
Memory Management
If a MMU abort is generated during an instruction or data TLB lock operation, the Fault Status
Register is updated to indicate a Lock Abort (see Section 2.3.4.4, “Data Aborts” on page 2-12), and
the exception is reported as a data abort.
Note: If exceptions are allowed to occur in the middle of this routine, the TLB may end up caching a
translation that is about to be locked. For example, if R1 is the virtual address of an interrupt
service routine and that interrupt occurs immediately after the TLB has been invalidated, the lock
operation will be ignored when the interrupt service routine returns back to this code sequence.
Software should disable interrupts (FIQ or IRQ) in this case.
As a general rule, software should avoid locking in anything other than Supervisor mode.
The proper procedure for locking entries into the data TLB is shown in Example 3-3 on page 3-7.
Example 3-2. Locking Entries into the Instruction TLB
; R1, R2 and R3 contain the virtual addresses to translate and lock into
; the instruction TLB.
; The value in R0 is ignored in the following instruction.
; Hardware guarantees that accesses to CP15 occur in program order
MCR P15,0,R0,C8,C5,0 ; Invalidate the entire instruction TLB
MCR P15,0,R1,C10,C4,0 ; Translate virtual address (R1) and lock into
; instruction TLB
MCR P15,0,R2,C10,C4,0 ; Translate
; virtual address (R2) and lock into instruction TLB
MCR P15,0,R3,C10,C4,0 ; Translate virtual address (R3) and lock into
; instruction TLB
CPWAIT
; For a description of CPWAIT, see
; Section 2.3.3, “Additions to CP15 Functionality” on page 2-10
; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked instruction TLB entries.
Intel® XScale™ Microarchitecture User’s Manual 3-7
Memory Management
Note: If exceptions are allowed to occur in the middle of this routine, the TLB may end up caching a
translation into a data TLB entry that was about to be locked. In general exceptions should be
avoided while locking TLB entries. Software should preferably lock TLBs in Supervisor mode.
3.4.4 Round-Robin Replacement Algorithm
The line replacement algorithm for the TLBs is round-robin; there is a round-robin pointer that
keeps track of the next entry to replace. The next entry to replace is the one sequentially after the
last entry that was written. For example, if the last virtual to physical address translation was
written into entry 5, the next entry to replace is entry 6.
At reset, the round-robin pointer is set to entry 31. Once a translation is written into entry 31, the
round-robin pointer gets set to the next available entry, beginning with entry 0 if no entries have
been locked down. Subsequent translations move the round-robin pointer to the next sequential
entry until entry 31 is reached, where it will wrap back to entry 0 upon the next translation.
A lock pointer is used for locking entries into the TLB and is set to entry 0 at reset. A TLB lock
operation places the specified translation at the entry designated by the lock pointer, moves the
lock pointer to the next sequential entry, and resets the round-robin pointer to entry 31. Locking
entries into either TLB effectively reduces the available entries for updating. For example, if the
first three entries were locked down, the round-robin pointer would be entry 3 after it rolled over
from entry 31.
Only entries 0 through 30 can be locked in either TLB; entry 31 can never be locked. If the lock
pointer is at entry 31, a lock operation will update the TLB entry with the translation and ignore the
lock. In this case, the round-robin pointer will stay at entry 31.
Example 3-3. Locking Entries into the Data TLB
; R1, and R2 contain the virtual addresses to translate and lock into the data TLB
MCR P15,0,R1,C8,C6,1 ; Invalidate the data TLB entry specified by the
; virtual address in R1
MCR P15,0,R1,C10,C8,0 ; Translate virtual address (R1) and lock into
; data TLB
; Repeat sequence for virtual address in R2
MCR P15,0,R2,C8,C6,1 ; Invalidate the data TLB entry specified by the
; virtual address in R2
MCR P15,0,R2,C10,C8,0 ; Translate virtual address (R2) and lock into
; data TLB
CPWAIT ; wait for locks to complete
; For a description of CPWAIT, see
; Section 2.3.3, “Additions to CP15 Functionality” on page 2-10
; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked data TLB entries.
3-8 Intel® XScale™ Microarchitecture User’s Manual
Memory Management
Figure 3-1. Example of Locked Entries in TLB
entry 0
entry 1
entry 7
entry 8
entry 22
entry 23
entry 30
entry 31
Locked
Eight entries locked, 24 entries available for
round robin replacement
Intel® XScale™ Microarchitecture User’s Manual 4-1
Instruction Cache
4
The Intel® XScale™ core instruction cache enhances performance by reducing the number of
instruction fetches from external memory. The cache provides fast execution of cached code. Code
can also be locked down when guaranteed or fast access time is required. An additional 2Kbyte
mini instruction cache is used exclusively during debugging, see Section 10.13.6 for details.
4.1 Overview
Figure 4-1 shows the cache organization and how the instruction address is used to access the
cache.
The instruction cache is a 32-Kbyte, 32-way set associative cache; this means there are 32 sets with
each set containing 32 ways. Each way of a set contains eight 32-bit words and one valid bit, which
is referred to as a line. The replacement policy is a round-robin algorithm. The cache also supports
the ability to lock code in at a line granularity.
The instruction cache is virtually addressed and virtually tagged. Tag bits[1:0] are ignored as far as
the cache lookup is concerned. Bit 1 may be used subsequently to select a Thumb instruction.
Figure 4-1. Instruction Cache Organization
way 0
way 1
way 31
8 Words (cache line)
Set 31
CAM Instructions
way 0
way 1
way 31
8 Words (cache line)
Set 1
CAM DATA
way 0
way 1
way 31
8 Words (cache line)
Set Index
Set 0
Tag
Instruction Word
(4 bytes)
Instruction Address (Virtual)
31 10 9 5 4 2 1 0
Tag Set Index Word
Word Select
CAM Instructions
This example
shows Set 0 being
selected by the
set index.
CAM: Content
Addressable Memory
4-2 Intel® XScale™ Microarchitecture User’s Manual
Instruction Cache
Note: The virtual address presented to the instruction cache may be remapped by the PID register. See
Section 7.2.11, “Register 13: Process ID” on page 7-12 for a description of the PID register.
4.2 Operation
4.2.1 Instruction Cache is Enabled
When the cache is enabled, it compares every instruction request address against the addresses of
instructions that it is currently holding. If the cache contains the requested instruction, the access
“hits” the cache, and the cache returns the requested instruction. If the cache does not contain the
requested instruction, the access “misses” the cache, and the cache requests a fetch from external
memory of the 8-word line (32 bytes) that contains the requested instruction using the fetch policy
described in Section 4.2.3. As the fetch returns instructions to the cache, they are placed in one of
two fetch buffers and the requested instruction is delivered to the instruction decoder.
A fetched line will be written from the fetch buffer into the cache if it is cacheable. Code is
designated as eligible for caching when the Memory Management Unit (MMU) is disabled
Section 4.2.2 or when the MMU is enabled and the cacheable (C) bit is set to 1 in it’s
corresponding page. See Chapter 2, “New Page Attributes” for details on page attributes.
Disabling the MMU requires invalidation of the instruction cache first to prevent accidental
matching of physical addresses with the existing virtual addresses in the cache.
Note: An instruction fetch may miss the cache but hit one of the fetch buffers. When this happens, the
requested instruction will be delivered to the instruction decoder in the same manner as a cache hit.
4.2.2 The Instruction Cache Is Disabled
Disabling the cache prevents any lines from being loaded into the instruction cache from memory.
While the cache is disabled, it is still accessed and may generate a hit if the instruction is already in
the cache.
Disabling the instruction cache does not disable instruction buffering that may occur within the
instruction fetch buffers. The two 32-byte instruction fetch buffers will always be enabled while
the cache is disabled. So long as instruction fetches continue to hit within either buffer (even in the
presence of forward and backward branches), no external fetches for instructions are generated. A
miss causes one or the other buffer to be filled from external memory using the fetch policy
described in Section 4.2.3.
To flush the Instruction Fetch Buffers see Section 4.3.3.
4.2.3 Fetch Policy
An instruction-cache miss occurs when the requested instruction is not found in the instruction
fetch buffers or instruction cache; a fetch request is then made to external memory. The instruction
cache can handle up to two misses. Each external fetch request uses a fetch buffer that holds 32-
bytes and eight valid bits, one for each word.
A miss causes the following:
1. A fetch buffer is allocated.
Intel® XScale™ Microarchitecture User’s Manual 4-3
Instruction Cache
2. The instruction cache sends a fetch request to the external bus. This request is for a complete
32-byte cache line that will fill in sequence.
3. Instructions words are returned back from the external bus, at a maximum rate of 1 word per
core cycle but typically much slower for Flash or SDRAM. As each word returns, the
corresponding valid bit is set for the word in the currently selected fetch buffer.
4. As soon as the fetch buffer receives the requested instruction, it forwards the instruction to the
instruction decoder for execution. Note at this point, with the pipeline restarting, another
instruction cache miss may occur necessitating the use of the other Instruction Fetch Buffer.
5. When all words to an Instruction Fetch Buffer have returned, the fetched line will be written
into the instruction cache if it’s cacheable and if the instruction cache is enabled. The line
chosen for update in the cache is controlled by the round-robin replacement algorithm. This
update may evict a valid line at that location.
6. Once the cache is updated, the eight valid bits of the fetch buffer are invalidated and the
cacheline valid bit is set.
4.2.4 Round-Robin Replacement Algorithm
The line replacement algorithm for the instruction cache is round-robin. Each set in the instruction
cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next
line to replace in a set is the one after the last line that was written from a fetch buffer. For example,
if the line for the last external instruction fetch was written into way 5-set 2, the next line to replace
for that set would be way 6. The other round-robin pointers for the other sets will not be affected.
After reset, way 31 is pointed to by the round-robin pointer for all the sets. Once a line is written
into way 31, the round-robin pointer points to the first available way of a set, beginning with way0
if no lines have been locked into that particular set. Locking lines into the instruction cache reduces
the available lines for cache updating. For example, if the first three lines of a set were locked
down, the round-robin pointer would point to the line at way 3 after it rolled over from way 31.
Refer to Section 4.3.4, “Locking Instructions in the Instruction Cache” on page 4-6 for more details
on cache locking.
4.2.5 Parity Protection
The instruction cache is protected by parity to ensure data integrity. Each instruction cache word
has 1 parity bit. The instruction cache tag is not parity protected. When a parity error is detected on
an instruction cache access, a prefetch abort exception occurs if the Intel® XScale™ core attempts
to execute the instruction. Before servicing the exception, hardware places a notification of the
error in the Fault Status Register (Coprocessor 15, register 5).
A software exception handler can recover from an instruction cache parity error. This can be
accomplished by invalidating the instruction cache and the branch target buffer and then returning
to the instruction that caused the prefetch abort exception. A simplified code example is shown in
Example 4-1, Recovering from an Instruction Cache Parity Error. A more complex handler might
choose to invalidate the specific line that caused the exception and then invalidate the BTB.
4-4 Intel® XScale™ Microarchitecture User’s Manual
Instruction Cache
If a parity error occurs on an instruction that is locked in the cache, the software exception handler
needs to unlock the instruction cache, invalidate the cache and then re-lock the code in before it
returns to the faulting instruction.
4.2.6 Instruction Fetch Latency
The instruction fetch latency is dependent on the core to memory frequency ratio, system bus
bandwidth, system memory, etc. The outstanding external memory bus activity on the PXA255
processor will have the highest impact on instruction fetch latency.
4.2.7 Instruction Cache Coherency
The instruction cache does not detect modification to program memory by loads, stores or actions
of other bus masters. Several situations may require program memory modification, such as
uploading code from disk.
The application program is responsible for synchronizing code modification and invalidating the
cache. In general, software must ensure that modified code space is not accessed until both
memory modification and invalidation of the instruction cache are completed.
To achieve cache coherence, instruction cache contents can be invalidated after code modification
in external memory is complete. Refer to Section 4.3.3, “Invalidating the Instruction Cache” on
page 4-5 for the proper procedure for invalidating the instruction cache.
If the instruction cache is not enabled, or code is being written to a non-cacheable region, software
must still invalidate the instruction cache before using the newly-written code. This precaution
ensures that state associated with the new code is not buffered elsewhere in the processor, such as
the fetch buffers or the BTB.
When writing code into memory the writes involve the data cache not the instruction cache. Care
must be taken to force writes completely out of the processor data path into external memory
before attempting to execute the code. If writing into a non-cacheable region, flushing the write
buffers is sufficient precaution (see Section 7.2.7 for a description of this operation). If writing to a
cacheable region, then the data cache should be submitted to a Clean/Invalidate operation (see
Section 6.3.3.1) to ensure coherency. Any data cache cleaning must be done before the previously
mentioned instruction cache invalidation. This typically applies to code that is created, being
modified, or simply written into memory, prior to code execution. Typical examples would be
copying ROM code to DRAM, or dynamic libraries.
Example 4-1. Recovering from an Instruction Cache Parity Error
; Prefetch abort handler
MCR P15,0,R0,C7,C5,0 ; Invalidate the instruction cache and branch target
; buffer
CPWAIT ; wait for effect (see Section 2.3.3 for a
; description of CPWAIT)
SUBS PC,R14,#4 ; Returns to the instruction that generated the
; parity error
Intel® XScale™ Microarchitecture User’s Manual 4-5
Instruction Cache
4.3 Instruction Cache Control
4.3.1 Instruction Cache State at RESET
After reset, the instruction cache is always disabled, unlocked, and invalidated (flushed).
4.3.2 Enabling/Disabling
The instruction cache is enabled by setting bit 12 in coprocessor 15, register 1 (Control Register).
This process is illustrated in Example 4-2, Enabling the Instruction Cache.
4.3.3 Invalidating the Instruction Cache
The entire instruction cache along with the fetch buffers are invalidated by writing to
coprocessor 15, register 7. (See Table 7-12, “Cache Functions” on page 7-9 for the exact
command.) This command does not unlock any lines that were locked in the instruction cache or
invalidate those locked lines. To invalidate the entire cache including locked lines, the unlock
instruction cache command needs to be executed before the invalidate command. This unlock
command can also be found in Table 7-14, “Cache Lockdown Functions” on page 7-11.
There is an inherent delay from the execution of the instruction cache invalidate command to
where the next instruction will see the result of the invalidate. The following routine can be used to
guarantee proper synchronization.
The Intel® XScale™ core also supports invalidating an individual line from the instruction cache.
See Table 7-12, “Cache Functions” on page 7-9 for the exact command.
Example 4-2. Enabling the Instruction Cache
; Enable the Instruction Cache
MRC P15, 0, R0, C1, C0, 0 ; Read the control register
ORR R0, R0, #0x1000 ; set bit 12 -- the I bit
MCR P15, 0, R0, C1, C0, 0 ; Write the control register
CPWAIT ; wait for effect, see Section 2.3.3
Example 4-3. Invalidating the Instruction Cache
MCR P15,0,R1,C7,C5,0 ; Invalidate the instruction cache and branch
; target buffer
CPWAIT ; wait for effect, see Section 2.3.3
; The instruction cache is guaranteed to be invalidated at this point; the next
; instruction sees the result of the invalidate command.
4-6 Intel® XScale™ Microarchitecture User’s Manual
Instruction Cache
4.3.4 Locking Instructions in the Instruction Cache
Software has the ability to lock performance critical code routines into the instruction cache. Up to
28 lines in each set can be locked; hardware will ignore the lock command if software is trying to
lock all the lines in a particular set i.e. ways 28-31 can never be locked. When this happens, the line
will still be allocated into the cache but the lock will be ignored. The round-robin pointer will stay
between way 28 and way 31 for that set.
Lines can be locked into the instruction cache by initiating a write to coprocessor 15. See
Table 7-14, “Cache Lockdown Functions” on page 7-11 for the exact command.
Lines are locked into a set starting at way 0 and may progress up to way 27; which set a line gets
locked into depends on the set index of the virtual address. Figure 4-2 is an example of where lines
of code may be locked into the cache along with how the round-robin pointer is affected.
There are several requirements for locking down code. Failure to follow these requirements will
produce unpredictable results when accessing the instruction cache.
the routine used to lock lines down in the cache must be placed in non-cacheable memory,
which means the MMU is enabled. Hence fetches of cacheable code must not occur while
locking instructions into the cache.
the code being locked into the cache must be cacheable.
the instruction cache must be enabled and invalidated prior to locking down lines.
Note: Exceptions must be disabled during cache locking, to prevent the wrong contents from being
locked down.
System programmers must ensure that the code to lock instructions into the cache is not closer than
128 bytes to a non-cacheable/cacheable page boundary. If the processor fetches ahead into a
cacheable page, then the first requirement noted above could be violated
Figure 4-2. Locked Line Effect on Round Robin Replacement
way 0
way 1
way 7
way 8
way 22
way 23
way 30
way 31
set 1 set 31
Locked
set 0
Locked
set 2
Locked
...
set 0: 8 ways locked, 24 ways available for round robin replacement
set 1: 23 ways locked, 9 ways available for round robin replacement
set 2: 28 ways locked, only way28-31 available for replacement
set 31: all 32 ways available for round robin replacement
...
......
Intel® XScale™ Microarchitecture User’s Manual 4-7
Instruction Cache
Software can lock down several different routines located at different memory locations. This may
cause some sets to have more locked lines than others as shown in Figure 4-2.
Example 4-4, Locking Code into the Instruction Cache shows how a routine, called “lockMe” in
this example, might be locked into the instruction cache. This example is incomplete as it doesn’t
take into account the notes on page 4-6. However if this code runs with interrupts disabled and the
locking code is placed by the assembler above the code to be locked in memory to avoid the
prefetch issue, then the notes can be satisfied.
4.3.5 Unlocking Instructions in the Instruction Cache
The Intel® XScale™ core provides a global unlock command for the instruction cache. There is no
unlock function for individual lines in the cache.
Writing to coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves
them valid. These lines then become available for the round-robin replacement algorithm. (See
Table 7-14, “Cache Lockdown Functions” on page 7-11 for the exact command.)
Example 4-4. Locking Code into the Instruction Cache
lockMe: ; This is the code that will be locked into the cache
mov r0, #5
add r5, r1, r2
. . .
lockMeEnd:
. . .
codeLock: ; here is the code to lock the “lockMe” routine
ldr r0, =(lockMe AND NOT 31); r0 gets a pointer to the 1st line locked
ldr r1, =(lockMeEnd AND NOT 31); r1 contains a pointer to the last line
lockLoop:
mcr p15, 0, r0, c9, c1, 0; lock next line of code into I-Cache
cmp r0, r1 ; are we done yet?
add r0, r0, #32 ; advance pointer to next line
bne lockLoop ; if not done, do the next line
4-8 Intel® XScale™ Microarchitecture User’s Manual
Instruction Cache
Intel® XScale™ Microarchitecture User’s Manual 5-1
Branch Target Buffer
5
The Intel® XScale™ core uses dynamic branch prediction to reduce the penalties associated with
changing the flow of program execution. The Intel® XScale™ core features a branch target buffer
that provides the instruction cache with the target address of branch type instructions. The branch
target buffer is implemented as a 128-entry, direct mapped cache.
This chapter is primarily for those optimizing their code for performance. An understanding of the
branch target buffer is needed in this case so that code can be scheduled to best utilize the
performance benefits of the branch target buffer.
5.1 Branch Target Buffer (BTB) Operation
The BTB stores the history of branches that have executed along with their targets. Figure 5-1
shows an entry in the BTB, where the tag is the instruction address of a previously executed branch
and the data contains the target address of the previously executed branch along with two bits of
history information.
The BTB takes the current instruction address and checks to see if this address is a branch that was
previously seen. It uses bits [8:2] of the current address to select the tag from the BTB and then
compares this tag to bits [31:9,1] of the current instruction address. If the current instruction
address matches the tag in the BTB and the history bits indicate that this branch is usually taken in
the past, the BTB uses the data (target address) as the next instruction address to send to the
instruction cache.
Bit[1] of the branch address is included in the tag comparison in order to support Thumb execution.
This organization means that two consecutive Thumb branch (B) instructions, with instruction
address bits[8:2] the same, will contend for the same BTB entry. Thumb also requires 31 bits for
the branch target address. In ARM* mode, bit[1] is zero.
The history bits represent four possible prediction states for a branch entry in the BTB. Figure 5-2,
“Branch History” shows these states along with the possible transitions. The initial state for
branches stored in the BTB is Weakly-Taken (WT). Every time a branch that exists in the BTB is
executed, the history bits are updated to reflect the latest outcome of the branch, either taken or not-
taken.
Chapter 11, “Performance Considerations” describes which instructions are dynamically predicted
by the BTB and the performance penalty for mispredicting a branch.
The BTB does not have to be managed explicitly by software, it is disabled by default after reset
and is invalidated when the instruction cache is invalidated.
Figure 5-1. BTB Entry
Branch Address[31:9,1] Target Address[31:1] History
DATA
TAG
Bits[1:0]
5-2 Intel® XScale™ Microarchitecture User’s Manual
Branch Target Buffer
5.1.1 Reset
After Processor Reset, the BTB is disabled and all entries are invalidated.
5.1.2 Update Policy
A new entry is stored into the BTB when the following conditions are met:
the BTB is enabled
the branch instruction has executed
the branch was taken
the branch is not currently in the BTB
The entry is then marked valid and the history bits are set to WT. If another valid branch exists at
the same entry in the BTB, it will be evicted by the new branch.
Once a branch is stored in the BTB, the history bits are updated upon every execution of the branch
as shown in Figure 5-2.
5.2 BTB Control
5.2.1 Disabling/Enabling
The BTB is always disabled with Reset. Software enables the BTB through the Control Register
bit[11] in coprocessor 15 (see Section 7.2.2).
Figure 5-2. Branch History
SN WN WT ST
Take n
Not Take
n
Taken
Take n
Not Taken
Not Taken
Not Taken
Take n
SN: Strongly Not Taken
WN: Weakly Not Taken
ST: Strongly Taken
WT: Weakly Taken
Intel® XScale™ Microarchitecture User’s Manual 5-3
Branch Target Buffer
Before enabling or disabling the BTB, software must invalidate it (described in the following
section). This action will ensure correct operation in case stale data is in the BTB. Software must
not place any branch instruction between the code that invalidates the BTB and the code that
enables/disables it.
5.2.2 Invalidation
There are four ways the contents of the BTB can be invalidated.
1. Reset
2. Software can directly invalidate the BTB via a CP15, register 7 function. Refer to
Section 7.2.7, “Register 7: Cache Functions” on page 7-9.
3. The BTB is invalidated when the Process ID Register is written.
4. The BTB is invalidated when the instruction cache is invalidated via CP15, register 7
functions.
5-4 Intel® XScale™ Microarchitecture User’s Manual
Branch Target Buffer
Intel® XScale™ Microarchitecture User’s Manual 6-1
Data Cache
6
The Intel® XScale™ core data cache enhances performance by reducing the number of data
accesses to and from external memory. There are two data cache structures in the Intel® XScale
core, a 32 Kbyte data cache and a 2 Kbyte mini-data cache. An eight entry write buffer and a four
entry fill buffer are also implemented to decouple the Intel® XScale™ core instruction execution
from external memory accesses, which increases overall system performance.
6.1 Overviews
6.1.1 Data Cache Overview
The data cache is a 32-Kbyte, 32-way set associative cache; this means there are 32 sets with each
set containing 32 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit.
There also exist two dirty bits for every line, one for the lower 16 bytes and the other one for the
upper 16 bytes. When a store hits the cache the dirty bit associated with that half of the cache line is
set. The replacement policy is a round-robin algorithm and the cache also supports the ability to
reconfigure each line as data RAM.
Figure 6-1 shows the cache organization and how the data address is used to access the cache.
Cache policies may be adjusted for particular regions of memory by altering page attribute bits in
the MMU descriptor that controls that memory. See Section 3.2.3 for a description of these bits.
The data cache is virtually addressed and virtually tagged. It supports write-back and write-through
caching policies. The data cache always allocates a line in the cache when a cacheable read miss
occurs and will allocate a line into the cache on a cacheable write miss when write allocate is
specified by its page attribute. Page attribute bits determine whether a line gets allocated into the
data cache or mini-data cache.
6-2 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
6.1.2 Mini-Data Cache Overview
The mini-data cache is a 2-Kbyte, 2-way set associative cache; this means there are 32 sets with
each set containing 2 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit.
There are also 2 dirty bits for every line, one for the lower 16 bytes and the other one for the upper
16 bytes. When a store hits the cache the dirty bit associated with that half of the cacheline is set.
The replacement policy is a round-robin algorithm.
Figure 6-2 shows the cache organization and how the data address is used to access the cache.
The mini-data cache is virtually addressed and virtually tagged and supports the same caching
policies as the data cache. However, lines can not be locked into the mini-data cache.
Figure 6-1. Data Cache Organization
way 0
way 1
way 31
32 bytes (cache line)
Set 31
CAM DATA
way 0
way 1
way 31
32 bytes (cache line)
Set 1
CAM DATA
way 0
way 1
way 31
32 bytes (cache line)
Set Index
Set 0
Ta g
Data Address (Virtual)
31 10 9 5 4 2 1 0
Tag Set Index Word Byte
Word Select
CAM DATA
Data Word
(4 bytes to Destination Register)
Byte Alignment
Sign Extension
Byte Select
This example shows
Set 0 being selected
by the set index.
CAM: Content Addressable Memory
Intel® XScale™ Microarchitecture User’s Manual 6-3
Data Cache
6.1.3 Write Buffer and Fill Buffer Overview
The Intel® XScale™ core employs an eight entry write buffer, each entry containing 16 bytes.
Stores to external memory are first placed in the write buffer and subsequently taken out when the
bus is available.
The write buffer supports the coalescing of multiple store requests to external memory where those
stores are to a common 16-byte aligned address location. A store to memory may coalesce with any
of the preceding eight entries so long as the store is to a bufferable memory page.
The fill buffer holds the external memory request information for a data cache or mini-data cache
fill or non-cacheable read request. Up to four 32-byte read request operations can be outstanding in
the fill buffer before the Intel® XScale™ core needs to stall.
The fill buffer has been augmented with a four entry pend buffer that captures data memory
requests to outstanding fill operations. Each entry in the pend buffer contains enough data storage
to hold one 32-bit word, specifically for store operations. Cacheable load or store operations that
hit an entry in the fill buffer get placed in the pend buffer and are completed when the associated
fill completes. Any entry in the pend buffer can be pended against any of the entries in the fill
buffer; multiple entries in the pend buffer can be pended, or postponed, awaiting a particular entry
in the fill buffer to complete.
Pended operations complete in program order.
Figure 6-2. Mini-Data Cache Organization
way 0
way 1 32 bytes (cache line)
Set 1
way 0
way 1 32 bytes (cache line)
Set Index
Set 0
Tag
Data Word
(4 bytes to Destination Register)
Data Address (Virtual)
31 109 54 210
Tag Set Index Word Byte
Word Select
way 0
way 1 32 bytes (cache line)
Set 31
Byte Alignment
Sign Extension
Byte Select
6-4 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
6.2 Data Cache and Mini-Data Cache Operation
The following refers to the data cache and mini-data cache as one cache (data/mini-data) since their
behavior is the same when accessed.
6.2.1 Operation When Caching is Enabled
When the data/mini-data cache is enabled for an access, the data/mini-data cache compares the
address of the request against the addresses of data that it is currently holding. If the line containing
the address of the request is resident in the cache, the access hits the cache. For a load operation the
cache returns the requested data to the destination register and for a store operation the data is
stored into the cache. The data associated with the store may also be written to external memory if
write-through caching is specified for that area of memory. If the cache does not contain the
requested data, the access misses the cache, and the sequence of events that follows depends on the
configuration of the cache, the configuration of the MMU and the page attributes. These are
described in Section 6.2.3.2, “Read Miss Policy” and Section 6.2.3.3, “Write Miss Policy”.
6.2.2 Operation When Data Caching is Disabled
The data/mini-data cache is still accessed even when it is disabled. If a load hits the cache it will
return the requested data to the destination register. If a store hits the cache, the data is written into
the cache.
Any access that misses the cache will not allocate a line in the cache when it’s disabled, even if the
MMU is enabled and the memory regions cacheability attribute is set. Any data reads or writes
that miss in the cache will be directed to memory as controlled by the MMU. If both the data cache
and MMU are disabled then cache misses will be issued as cycles on the application processor
internal memory bus.
Disabling the cache prevents cache line refilling, but not the cache lookup from the processor.
6.2.3 Cache Policies
6.2.3.1 Cacheability
Data at a specified address is cacheable given the following:
the MMU is enabled
the cacheable attribute is set in the descriptor for the accessed address
and the data/mini-data cache is enabled
6.2.3.2 Read Miss Policy
The following sequence of events occurs when a cacheable (see Section 6.2.3.1, “Cacheability”)
load operation misses the data cache:
1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
If so, the current request is placed in the pending buffer and waits until the previously
requested fill completes, after which it accesses the cache again, to obtain the request data and
returns it to the destination register.
Intel® XScale™ Microarchitecture User’s Manual 6-5
Data Cache
If there is no outstanding fill request for that line, the current load request is placed in the fill
buffer and a 32-byte external memory read request is made. If the pending buffer or fill buffer
is full, the Intel® XScale™ core will stall until an entry is available.
2. A line is allocated in the cache to receive the 32-bytes of fill data. The line selected is
determined by the round-robin pointer (see Section 6.2.4, “Round-Robin Replacement
Algorithm”). The line chosen may contain a valid line previously allocated in the cache. In this
case both dirty bits are examined. Any half cache lines with a dirty bit that’s asserted will be
written back to external memory as a four word burst operation.
3. When the data requested by the load is returned from external memory, it is immediately sent
to the destination register specified by the load.
4. As data returns from external memory it is written to the cache into the previously allocated
line.
A load operation that misses the cache and is NOT cacheable makes a request from external
memory for the exact data size of the original load request. For example, LDRH requests exactly
two bytes from external memory, LDR requests 4 bytes from external memory, etc. This request is
placed in the fill buffer until, the data is returned from external memory, which is then forwarded
back to the destination register(s).
For the PXA255 processor, the size of a data load depends also on the memory bank addressed in
the access. For example, all 32-bit wide SDRAM reads are bursts of 4 words. All loads from this
SDRAM generate a read of 4 words, despite that for uncacheable loads only the object the core
requests will be used.
6.2.3.3 Write Miss Policy
A write operation that misses the cache will request a 32-byte cache line from external memory if
the access is cacheable and write allocation is specified in the page. In this case the following
sequence of events occur:
1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
If so, the current request is placed in the pending buffer and waits until the previously
requested fill completes, after which it writes its data into the recently allocated cache line.
If there is no outstanding fill request for that line, the current store request is placed in the fill
buffer and a 32-byte external memory read request is made. If the pending buffer or fill buffer
is full, the Intel® XScale™ core will stall until an entry is available.
2. The 32-bytes of data are returned back to the Intel® XScale™ core in the PXA255 processor
in sequential order. Note that it does not matter, for performance reasons, which order the data
is returned to the Intel® XScale™ core since the store operation has to wait until the entire line
is written into the cache before it can complete.
3. When the entire 32-byte line has returned from external memory, a line is allocated in the
cache, selected by the round-robin pointer (see Section 6.2.4, “Round-Robin Replacement
Algorithm”). The line to be written into the cache may replace a valid line previously allocated
in the cache. In this case both dirty bits are examined and if any are set, the four words
associated with a dirty bit that’s asserted will be written back to external memory as a 4 word
burst operation. This write operation will be placed in the write buffer.
4. The line is written into the cache along with the data associated with the store operation.
If the above condition for requesting a 32-byte cache line is not met, a write miss will cause a write
request to external memory for the exact data size specified by the store operation, assuming the
write request doesn’t coalesce with another write operation in the write buffer.
6-6 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
6.2.3.4 Write-Back Versus Write-Through
The Intel® XScale™ core supports write-back caching or write-through caching, controlled
through the MMU page attributes. When write-through caching is specified, all store operations are
written to external memory even if the access hits the cache. This feature keeps the external
memory coherent with the cache, i.e., no dirty bits are set for this region of memory in the data/
mini-data cache. This however does not guarantee that the data/mini-data cache is coherent with
external memory, which is dependent on the system level configuration, specifically if the external
memory is shared by another master.
When write-back caching is specified, a store operation that hits the cache will not generate an
immediate write to external memory, waiting instead for round-robin eviction or cache cleaning to
write the data into external memory. Thus write-back caching is typically preferred for
performance as it reduces external memory write traffic.
6.2.4 Round-Robin Replacement Algorithm
The line replacement algorithm for the data cache is round-robin. Each set in the data cache has a
round-robin pointer that keeps track of the next line (in that set) to replace. The next line to replace
in a set is the next sequential line after the last one that was just filled. For example, if the line for
the last fill was written into way 5-set 2, the next line to replace for that set would be way 6. None
of the other round-robin pointers for the other sets are affected when they are not the sets being
accessed.
After reset, way 31 is pointed to by the round-robin pointer in each of the sets. Once a line is
written into way 31, the round-robin pointer points to the first available way of a set, beginning
with way 0 if no lines have been re-configured as data RAM in that particular set. Re-configuring
lines as data RAM effectively reduces the available lines for cache updating. For example, if the
first three lines of a set were re-configured, the round-robin pointer would point to the line at way 3
after it rolled over from way 31. Refer to Section 6.4, “Re-configuring the Data Cache as Data
RAM” for more details on data RAM.
The mini-data cache follows the same round-robin replacement algorithm as the data cache except
that there are only two lines the round-robin pointer can point to such that the round-robin pointer
always points to the least recently filled line. A least recently used replacement algorithm is not
supported because the purpose of the mini-data cache is to cache data that exhibits low temporal
locality, i.e.,data that is placed into the mini-data cache is typically modified once and then written
back out to external memory. Examples of data items that should be streamed through the mini data
cache for improved system performance might be a audio or video bit stream such as MP3 or
MPEG-4 data.
6.2.5 Parity Protection
The data cache and mini-data cache are protected by parity to ensure data integrity; there is one
parity bit per byte of data. (The tags are NOT parity protected.) When a parity error is detected on a
data/mini-data cache access, a data abort exception occurs. Before servicing the exception,
hardware will set bit 10 of the Fault Status Register.
A data/mini-data cache parity error is an imprecise data abort, meaning R14_ABORT (+8) may not
point to the instruction that caused the parity error. If the parity error occurred during a load, the
targeted register may be updated with incorrect data.
Intel® XScale™ Microarchitecture User’s Manual 6-7
Data Cache
A data abort due to a data/mini-data cache parity error may not be recoverable if the data address
that caused the abort occurred on a line in the cache that has a write-back caching policy. Prior
updates to this line may be lost; in this case the software exception handler should perform a “clean
and clear” operation on the data cache, ignoring subsequent parity errors, and restart the offending
process. The “clean & clear” operation is shown in Section 6.3.3.1.
6.2.6 Atomic Accesses
The SWP and SWPB instructions generate an atomic load and store operation to a common
location, allowing a memory semaphore to be loaded and altered without interruption. These
accesses may hit or miss the data/mini-data cache depending on the configuration of the cache,
configuration of the MMU, and the page attributes.
The application processor guarantees that no other on-chip master (or process) divides a SWP or
SWPB instruction. Note that there is no external bus lock pin, hence software coherency is
compulsory if companion chips are to be required to access semaphores in external memory.
6.3 Data Cache and Mini-Data Cache Control
6.3.1 Data Memory State After Reset
After processor reset, both the data cache and mini-data cache are disabled, all valid bits are set to
zero (invalid), and the round-robin bit points to way 31. Any lines in the data cache that were
configured as data RAM before reset are changed back to cacheable lines after reset, i.e., there are
32 KBytes of data cache and zero bytes of data RAM.
6.3.2 Enabling/Disabling
The data cache and mini-data cache are enabled by setting bit 2 in coprocessor 15, register 1
(Control Register). See Chapter 7, “Configuration”, for a description of this register and others.
Example 6-1 shows code that enables the data and mini-data caches. Note that the MMU must be
enabled to use the data cache.
Example 6-1. Enabling the Data Cache
enableDCache:
MCR p15, 0, r0, c7, c10, 4; Drain pending data operations...
; (see Chapter 7.2.8, Register 7: Cache functions)
MRC p15, 0, r0, c1, c0, 0; Get current control register
ORR r0, r0, #4 ; Enable D-Cache by setting ‘C’ (bit 2)
MCR p15, 0, r0, c1, c0, 0; And update the Control register
6-8 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
6.3.3 Invalidate & Clean Operations
Individual entries can be cleaned and invalidated in the data cache and mini-data cache via
coprocessor 15, register 7. Note that a line locked into the data cache remains locked even after it
has been subjected to an invalidate-entry operation. This will leave an unusable line in the cache
until a global unlock or reset has occurred. For this reason, do not use these commands on locked
lines.
This same register also provides the command to invalidate the entire data cache and mini-data
cache. Refer to Table 7-12, “Cache Functions” on page 7-9 for a listing of the commands. These
global invalidate commands have no effect on lines locked in the data cache. Locked lines must be
unlocked before they can be invalidated. This is accomplished by the Unlock Data Cache
command found in Table 7-14, “Cache Lockdown Functions” on page 7-11.
6.3.3.1 Global Clean and Invalidate Operation
A simple software routine is used to globally clean the data cache. It takes advantage of the line-
allocate data cache operation, which allocates a line into the data cache. This allocation evicts any
cache dirty data back to external memory. Example 6-2 on page 6-9 shows how data cache can be
cleaned.
Intel® XScale™ Microarchitecture User’s Manual 6-9
Data Cache
The line-allocate operation does not require physical memory to exist at the virtual address
specified by the instruction, since it does not generate a load/fill request to external memory. Also,
the line-allocate operation does not set the 32 bytes of data associated with the line to any known
value. Reading this data will produce unpredictable results.
The line-allocate command will not operate on the mini Data Cache, so system software must clean
this cache by reading 2KByte of contiguous unused data into it. This data must be unused and
reserved for this purpose so that it will not already be in the cache. It must reside in a page that is
marked as mini Data Cache cacheable (see Section 2.3.2).
The time it takes to execute a global clean operation depends on the number of dirty lines in cache.
Example 6-2. Global Clean Operation
; Global Clean/Invalidate THE DATA CACHE
; R1 contains the virtual address of a region of cacheable memory reserved for
; this clean operation
; R0 is the loop count; Iterate 1024 times which is the number of lines in the
; data cache
;; Macro ALLOCATE performs the line-allocation cache operation on the
;; address specified in register Rx.
;;
MACRO ALLOCATE Rx
MCR P15, 0, Rx, C7, C2, 5
ENDM
MOV R0, #1024
LOOP1:
ALLOCATE R1 ; Allocate a line at the virtual address
; specified by R1.
ADD R1, R1, #32 ; Increment the address in R1 to the next cache line
SUBS R0, R0, #1 ; Decrement loop count
BNE LOOP1
;
; Clean the Mini-data Cache
; Can’t use line-allocate command, so cycle 2KB of unused data through.
; R2 contains the virtual address of a region of cacheable memory reserved for
; cleaning the Mini-data Cache
; R0 is the loop count; Iterate 64 times which is the number of lines in the
; Mini-data Cache.
MOV R0, #64
LOOP2:
LDR R3,[R2],#32 ; Load and increment to next cache line
SUBS R0, R0, #1 ; Decrement loop count
BNE LOOP2
;
; Invalidate the data cache and mini-data cache
MCR P15, 0, R0, C7, C6, 0
;
6-10 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
6.4 Re-configuring the Data Cache as Data RAM
Software has the ability to lock tags associated with 32-byte lines in the data cache, thus creating
the appearance of data RAM. Any subsequent access to this line will always hit the cache unless it
is invalidated. Once a line is locked into the data cache it is no longer available for cache allocation
on a line fill. Up to 28 lines in each set can be reconfigured as data RAM, such that the maximum
data RAM size is 28 Kbytes.
Hardware does not support locking lines into the mini-data cache; any attempt to do this will
produce unpredictable results.
There are two methods for locking tags into the data cache; the method of choice depends on the
application. One method is used to lock data that resides in external memory into the data cache
and the other method is used to re-configure lines in the data cache as data RAM. Locking data
from external memory into the data cache is useful for lookup tables, constants, and any other data
that is frequently accessed. Re-configuring a portion of the data cache as data RAM is useful when
an application needs scratch memory (bigger than the register file can provide) for frequently used
variables. These variables may be strewn across memory, making it advantageous for software to
pack them into data RAM memory.
Code examples for these two applications are shown in Example 6-3 on page 6-11 and Example
6-4 on page 6-12. The difference between these two routines is that Example 6-3 on page 6-11
actually requests the entire line of data from external memory and Example 6-4 on page 6-12 uses
the line-allocate operation to lock the tag into the cache. No external memory request is made,
which means software can map any unallocated area of memory as data RAM. However, the line-
allocate operation does validate the target address with the MMU, so system software must ensure
that the memory has a valid descriptor in the page table.
Another item to note in Example 6-4 on page 6-12 is that the 32 bytes of data located in a newly
allocated line in the cache must be initialized by software before it can be read. The line allocate
operation does not initialize the 32 bytes and therefore reading from that line without first writing
to it will produce unpredictable results.
Any line locked in the data cache could be written to by software. The locking is more a function of
the address tag and not the data values. If the cache values represent constants shared by software
then the memory area should be protected by the MMU. Using the MMU data abort exception,
software can decide whether to prevent the write or allow the cache value to be overwritten. The
exception handler would determine whether a write to memory is also required to overcome any
coherency issues.
In both examples, the code drains the pending loads before and after locking data. This step ensures
that outstanding loads do not end up in the wrong place -- either unintentionally locked into the
cache or mistakenly left out. A drain operation has been placed after the operation that locks the tag
into the cache. This drain ensures predictable results if a programmer tries to lock more than 28
lines in a set; the tag will get allocated in this case but not locked into the cache.
The data cache can only be unlocked by using the global unlock command See Table 7-14, “Cache
Lockdown Functions” on page 7-11. The invalidate-entry command should not be issued to a
locked line as this will render the line useless until a global unlock is issued.
Intel® XScale™ Microarchitecture User’s Manual 6-11
Data Cache
Example 6-3. Locking Data into the Data Cache
; R1 contains the virtual address of a region of memory to lock,
; configured with C=1 and B=1
; R0 is the number of 32-byte lines to lock into the data cache. In this
; example 16 lines of data are locked into the cache.
; MMU and data cache are enabled prior to this code.
MACRO DRAIN
MCR P15, 0, R0, C7, C10, 4 ; drain pending loads and stores
ENDM
DRAIN
MOV R2, #0x1
MCR P15,0,R2,C9,C2,0 ; Put the data cache in lock mode
CPWAIT ; wait for effect (see Section 2.3.3)
MOV R0, #16
LOOP1:
MCR P15,0,R1,C7,C10,1 ; Write back the line if it’s dirty in the cache
MCR P15,0,R1, C7,C6,1 ; Flush/Invalidate the line from the cache
PLD [R1], #32 ; Load and lock 32 bytes of data located at [R1]
; into the data cache. Post-increment the address
; in R1 to the next cache line.
DRAIN
SUBS R0, R0, #1; Decrement loop count
BNE LOOP1
; Turn off data cache locking
DRAIN
MOV R2, #0x0
MCR P15,0,R2,C9,C2,0 ; Take the data cache out of lock mode.
CPWAIT
6-12 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
Tags can be locked into the data cache by enabling the data cache lock mode bit located in
coprocessor 15, register 9. (See Table 7-14, “Cache Lockdown Functions” on page 7-11 for the
exact command.) Once enabled, any new lines allocated into the data cache will be locked down.
Note that the PLD instruction will not affect the cache contents if it encounters an error while
executing. For this reason, system software should ensure the memory address used in the PLD is
correct. If this cannot be ascertained, replace the PLD with a LDR instruction that targets a scratch
register.
Example 6-4. Creating Data RAM
; R1 contains the virtual address of a region of memory to configure as data RAM,
; which is aligned on a 32-byte boundary.
; MMU is configured so that the memory region is cacheable.
; R0 is the number of 32-byte lines to designate as data RAM. In this example 16
; lines of the data cache are re-configured as data RAM.
; The inner loop is used to initialize the newly allocated lines
; MMU and data cache are enabled prior to this code.
MACRO ALLOCATE Rx
MCR P15, 0, Rx, C7, C2, 5
ENDM
MACRO DRAIN
MCR P15, 0, R0, C7, C10, 4 ; drain pending loads and stores
ENDM
DRAIN
MOV R4, #0x0
MOV R5, #0x0
MOV R2, #0x1
MCR P15,0,R2,C9,C2,0 ; Put the data cache in lock mode
CPWAIT ; wait for effect (see Section 2.3.3)
MOV R0, #16
LOOP1:
ALLOCATE R1 ; Allocate and lock a tag into the data cache at
; address [R1].
; initialize 32 bytes of newly allocated line
DRAIN
STRD R4, [R1],#4 ;
STRD R4, [R1],#4 ;
STRD R4, [R1],#4 ;
STRD R4, [R1],#4 ;
SUBS R0, R0, #1 ; Decrement loop count
BNE LOOP1
; Turn off data cache locking
DRAIN ; Finish all pending operations
MOV R2, #0x0
MCR P15,0,R2,C9,C2,0; Take the data cache out of lock mode.
CPWAIT
Intel® XScale™ Microarchitecture User’s Manual 6-13
Data Cache
Lines are locked into a set starting at way 0 and may progress up to way 27; which set a line gets
locked into depends on the set index of the virtual address of the request. Figure 6-3, “Locked Line
Effect on Round Robin Replacement” is an example of where lines of data may be locked into the
cache along with how the round-robin pointer is affected.
Software can lock down data located at different memory locations. This may cause some sets to
have more locked lines than others as shown in Figure 6-3.
Lines are unlocked in the data cache by performing an unlock operation. See Section 7.2.9,
“Register 9: Cache Lock Down” on page 7-11 for more information about locking and unlocking
the data cache.
Before locking, the programmer must ensure that no part of the target data range is already resident
in the cache. The Intel® XScale™ core will not refetch such data, which will result in it not being
locked into the cache. If there is any doubt as to the location of the targeted memory data, the cache
should be cleaned and invalidated to prevent this scenario. If the cache contains a locked region
which the programmer wishes to lock again, then the cache must be unlocked before being cleaned
and invalidated.
6.5 Write Buffer/Fill Buffer Operation and Control
The write buffer is always enabled which means stores to external memory will be buffered. The K
bit in the Auxiliary Control Register (CP15, register 1) is a global enable/disable for allowing
coalescing in the write buffer. When this bit disables coalescing, no coalescing will occur
regardless of the value of the page attributes. If this bit enables coalescing, the page attributes X, C,
and B are examined to see if coalescing is enabled for each region of memory.
Coalescing means that memory writes which occur with the same 16-byte aligned memory region
can become one burst to external memory rather than distinct bus cycles. Merges can match with
any write buffer entry, but they need to form contiguous data areas to coalesce. Byte writes may
coalesce into halfwords, halfwords into words, or words into a multi-word burst to external
memory. The Write Buffer attempts to replace distinct writes with burst writes to memory, greatly
improving write performance to burst memory devices such as SDRAM.
Figure 6-3. Locked Line Effect on Round Robin Replacement
way 0
way 1
way 7
way 8
way 22
way 23
way 30
way 31
set 1 set 31
Locked
set 0
Locked
set 2
Locked
...
...
......
set 0: 8 ways locked, 24 ways available for round robin replacement
set 1: 23 ways locked, 9 ways available for round robin replacement
set 2: 28 ways locked, only ways 28-31 available for replacement
set 31: all 32 ways available for round robin replacement
6-14 Intel® XScale™ Microarchitecture User’s Manual
Data Cache
All reads and writes to external memory occur in program order when coalescing is disabled in the
write buffer. If coalescing is enabled in the write buffer, writes may occur out of program order to
external memory. Program correctness is maintained in this case by comparing all store requests
with all the valid entries in the fill buffer.
The write buffer and fill buffer support a drain operation, such that before the next instruction
executes, all the Intel® XScale™ core data requests to external memory have completed. See
Table 7-12, “Cache Functions” on page 7-9 for the exact command. Using this command, software
running in a privileged mode can explicitly drain all buffered writes
Writes to a region marked non-cacheable/non-bufferable (page attributes C, B, and X all 0) will
cause execution to stall until the write completes.
Intel® XScale™ Microarchitecture User’s Manual 7-1
Configuration
7
This chapter describes the System Control Coprocessor (CP15) and coprocessor 14 (CP14). CP15
configures the MMU, caches, buffers and other system attributes. Where possible, the definition of
CP15 follows the definition of ARM* v5 products, see the ARM* Architecture Reference Manual
for details. The PXA255 processor also include various extra device-specific configuration
capabilities which are described here. CP14 contains the performance monitor registers and the
trace buffer registers.
7.1 Overview
CP15 is accessed through MRC and MCR coprocessor instructions whose format is shown in
Table 7-1, “MRC/MCR Format” on page 7-2. These instructions transfer data between ARM*
registers and coprocessor registers. Coprocessor access is only allowed in a privileged mode. Any
access to CP15 in user mode or with LDC or STC coprocessor instructions will cause an
Undefined Instruction exception.
CP14 registers can be accessed through MRC, MCR, LDC, and STC coprocessor instructions and
allowed only in privileged mode. LDC and STC transfer data between memory and coprocessor
registers. See Table 7-2, “LDC/STC Format when Accessing CP14” on page 7-2. Any access to
CP14 in user mode will cause an Undefined Instruction exception.
Coprocessors CP15 and CP14 on the Intel® XScale™ core do not support access via CDP,
MRRC, or MCRR instructions. An attempt to access these coprocessors with these instructions
will result in an Undefined Instruction exception.
Many of the MCR commands available in CP15 modify hardware state sometime after execution.
A software sequence is available for those wishing to determine when this update occurs and can
be found in Section 2.3.3, “Additions to CP15 Functionality” on page 2-10.
As in the Intel® SA-1110 product, and ARM* v5 Architecture specification, the Intel® XScale™
core includes an extra level of virtual address translation in the form of a PID (Process ID) register
and associated logic. For a detailed description of this facility, see Section 7.2.11, “Register 13:
Process ID” on page 7-12. Privileged code needs to be aware of this facility because when
interacting with CP15 some addresses are modified by the PID and others are not.
An address that has yet to be modified by the PID is known as a virtual address (VA). An address
that has been through the PID logic, but not translated into a physical address, is a modified virtual
address (MVA). Non-privileged code always deals with VAs, while privileged code that programs
CP15 occasionally needs to use MVAs.
The format of MRC and MCR, that move data to and from coprocessor registers, is shown in
Table 7-1.
cp_num is defined for CP15, CP14 and CP0 on the Intel® XScale™ core. CP0 supports
instructions specific for DSP and is described in Chapter 2, “Programming Model.”
Unless otherwise noted, unused bits in coprocessor registers have unpredictable values when read.
For compatibility with future implementations, software must program these bits as zero.
7-2 Intel® XScale™ Microarchitecture User’s Manual
Configuration
The format of LDC and STC for CP14 is shown in Table 7-2. LDC and STC follow the
programming notes in the ARM* Architecture Reference Manual loading and storing coprocessor
registers from/to memory. Access to CP15 with LDC and STC will cause an undefined exception
as will access to coprocessors CP1 through CP13.
LDC and STC transfer a single 32-bit word between a coprocessor register and memory. These
instructions do not allow the programmer to specify values for opcode_1, opcode_2, or Rm; those
fields implicitly contain zero.
Table 7-1. MRC/MCR Format
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
cond 1110opcode_
1n CRn Rd cp_num opcode_
21 CRm
Bits Description Notes
31:28 cond - ARM* condition codes -
23:21 opcode_1 - Reserved Must be programmed to zero for future
compatibility
20
n - Read or write coprocessor register
0 = MCR
1 = MRC
-
19:16 CRn - specifies which coprocessor register -
15:12 Rd - General Purpose Register, R0..R15 -
11:8 cp_num - coprocessor number
The Intel® XScale™ core defines three
coprocessors:
0b1111 = CP15
0b1110 = CP14
0x0000 = CP0
7:5 opcode_2 - Function bits
This field should be programmed to zero for
future compatibility unless a value has been
specified in the command.
3:0 CRm - Function bits
This field should be programmed to zero for
future compatibility unless a value has been
specified in the command.
Table 7-2. LDC/STC Format when Accessing CP14 (Sheet 1 of 2)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
cond 110PUNWL Rn CRd cp_num 8_bit_word_offset
Bits Description Notes
31:28 cond - ARM* condition codes -
24:23,21
P, U , W - specifies 1 of 3 addressing modes
identified by addressing mode 5 in the ARM
Architecture Reference Manual.
-
22 N - must be 0 for CP14 accesses. Setting this
bit to 1 has will have an undefined effect.
Intel® XScale™ Microarchitecture User’s Manual 7-3
Configuration
7.2 CP15 Registers
Table 7-3 lists the CP15 registers implemented in the Intel® XScale™ core.
20
L - Load or Store
0 = STC
1 = LDC
-
19:16 Rn - specifies the base register -
15:12 CRd - specifies the coprocessor register -
11:8 cp_num - coprocessor number
The Intel® XScale™ core defines the
following:
0b1110 = CP14
CP0-13 & CP15 = Undefined Exception
7:0 8-bit word offset -
Table 7-2. LDC/STC Format when Accessing CP14 (Sheet 2 of 2)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cond 110PUNWL Rn CRd cp_num 8_bit_word_offset
Bits Description Notes
Table 7-3. CP15 Registers
Register (CRn) Opcode_2 Access Description
0 0 Read / Write-Ignored ID
0 1 Read / Write-Ignored Cache Type
1 0 Read / Write Control
1 1 Read / Write Auxiliary Control
2 0 Read / Write Translation Table Base
3 0 Read / Write Domain Access Control
4 0 Unpredictable Reserved
5 0 Read / Write Fault Status
6 0 Read / Write Fault Address
7 0 Read-unpredictable / Write Cache Operations
8 0 Read-unpredictable / Write TLB Operations
9 0 Read / Write Cache Lock Down
10 0 Read / Write TLB Lock Down
11 - 12 0Unpredictable Reserved
13 0 Read / Write Process ID (PID)
14 0 Read / Write Breakpoint Registers
15 0 Read / Write (CRm = 1) CP Access
7-4 Intel® XScale™ Microarchitecture User’s Manual
Configuration
7.2.1 Register 0: ID & Cache Type Registers
Register 0 houses two read-only registers that are used for part identification: an ID register and a
cache type register.
The ID Register is selected when opcode_2=0. This register returns the code for the application
processor. Register 0 conforms with the values provided in the ARM* Architecture Reference
Manual which should be consulted for alternate values representing other ARM* devices.
The Cache Type Register is selected when opcode_2=1 and describes the cache configuration of
the Intel® XScale™ core. These values are device specific to the PXA255 processor, for the full
set of potential values consult the ARM* Architecture Reference Manual.
Table 7-4. ID Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
0110100100000101 Core
Gen
Core
Revisio
n
Product Number Product
Revision
reset value: As Shown
Bits Access Description
31:24 Read / Write Ignored Implementation trademark
(0x69 = ‘i’= Intel Corporation)
23:16 Read / Write Ignored Architecture version = ARM* Version 5TE = 0b00000101
15:13 Read / Write Ignored
Core Generation
0b001 = Intel® XScale™ core
This field reflects a specific set of architecture features
supported by the core. If new features are added/deleted/
modified this field will change.
12:10 Read / Write Ignored
Core Revision:
0b000
This field reflects revisions of core generations.
Differences may include errata that dictate different
operating conditions, software work-arounds, etc.
9:4 Read / Write Ignored Product Number
0b010000
3:0 Read / Write Ignored Product Revision
0b0000 for first stepping
Intel® XScale™ Microarchitecture User’s Manual 7-5
Configuration
7.2.2 Register 1: Control & Auxiliary Control Registers
Register 1 is made up of two registers, one that is compliant with ARM* Version 5 and referred by
opcode_2 = 0x0, and the other which is specific to the Intel® XScale™ core is referred by
opcode_2 = 0x1. The latter is known as the Auxiliary Control Register.
The Exception Vector Relocation bit (bit 13 of the ARM* control register) allows the vectors to be
mapped into high memory rather than their default location at address 0. This bit is readable and
writable by software. If the MMU is enabled, the exception vectors will be accessed via the usual
translation method involving the PID register (see Section 7.2.11, “Register 13: Process ID” on
page 7-12) and the TLBs. To avoid automatic application of the PID to exception vector accesses,
software may relocate the exceptions to high memory.
Table 7-5. Cache Type Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
00001011000110101010000110101010
reset value: As Shown
Bits Access Description
31:29 Read-as-Zero / Write Ignored Reserved
28:25 Read / Write Ignored
Cache class = 0b0101
The caches support locking, write back and round-robin
replacement. They do not support address by index.
24 Read / Write Ignored Harvard Cache = 1
23:21 Read-as-Zero / Write Ignored Reserved
20:18 Read / Write Ignored Data Cache Size
0b110 = 32 kB
17:15 Read / Write Ignored Data cache associativity = 0b101 = 32
14 Read-as-Zero / Write Ignored Reserved
13:12 Read / Write Ignored Data cache line length = 0b10 = 8 words/line
11:9 Read-as-Zero / Write Ignored Reserved
8:6 Read / Write Ignored Instruction cache size
0b110 = 32 kB
5:3 Read / Write Ignored Instruction cache associativity = 0b101 = 32
2 Read-as-Zero / Write Ignored Reserved
1:0 Read / Write Ignored Instruction cache line length = 0b10 = 8 words/line
7-6 Intel® XScale™ Microarchitecture User’s Manual
Configuration
The mini-data cache attribute bits, in the Auxiliary Control Register, are used to control the
allocation policy for the mini-data cache and whether it will use write-back caching or write-
through caching.
The configuration of the mini-data cache must be setup before any data access is made that may be
cached in the mini-data cache. Once data is cached, software must ensure that the mini-data cache
has been cleaned and invalidated before the mini-data cache attributes can be changed.
Table 7-6. ARM* Control Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
VI Z0RSB1111CAM
reset value: writable bits set to 0
Bits Access Description
31:14 Read-Unpredictable /
Write-as-Zero Reserved
13 Read / Write
Exception Vector Relocation (V).
0 = Base address of exception vectors is 0x0000_0000
1 = Base address of exception vectors is 0xFFFF_0000
12 Read / Write
Instruction Cache Enable/Disable (I)
0 = Disabled
1 = Enabled
11 Read / Write
Branch Target Buffer Enable (Z)
0 = Disabled
1 = Enabled
10 Read-as-Zero / Write-as-Zero Reserved
9 Read / Write
ROM Protection (R bit)
This selects the access checks performed by the memory
management unit. See the ARM Architecture Reference
Manual for more information.
8 Read / Write
System Protection (S bit)
This selects the access checks performed by the memory
management unit. See the ARM Architecture Reference
Manual for more information.
7 Read / Write
Big/Little Endian (B)
0 = Core Little-endian data operations
1 = Core Big-endian data operation
6:3 Read-as-One / Write-as-One = 0b1111
2 Read / Write
Data cache enable/disable (C)
0 = Disabled
1 = Enabled
1 Read / Write
Alignment fault enable/disable (A)
0 = Disabled
1 = Enabled
0 Read / Write
Memory management unit enable/disable (M)
0 = Disabled
1 = Enabled
Intel® XScale™ Microarchitecture User’s Manual 7-7
Configuration
7.2.3 Register 2: Translation Table Base Register
Table 7-7. Auxiliary Control Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
MD PK
reset value: writable bits set to 0
Bits Access Description
31:6 Read-Unpredictable /
Write-as-Zero Reserved
5:4 Read / Write
Mini Data Cache Attributes (MD)
All configurations options for the Mini-data cache via
these bits apply to mini-data cacheable accesses, stores
are buffered in the write buffer and stores will coalesce in
the write buffer as long as coalescing is globally enabled
(K = 0 in this register).
0b00 = Write back, Read allocate
0b01 = Write back, Read/Write allocate
0b10 = Write through, Read allocate
0b11 = Unpredictable
3:2 Read-Unpredictable /
Write-as-Zero Reserved
1 Read / Write
Page Table Memory Attribute (P)
This field is undefined in the PXA255 processor
implementation and must always be programmed as
zero.
0 Read / Write
Write Buffer Coalescing Disable (K)
This bit globally disables the coalescing of all stores in the
write buffer no matter what the value of the Cacheable
and Bufferable bits are in the page table descriptors.
0 = Coalescing Enabled as Page Descriptors
1 = Coalescing Disabled
Table 7-8. Translation Table Base Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Translation Table Base
reset value: unpredictable
Bits Access Description
31:14 Read / Write Translation Table Base - Physical address of the base of
the first-level descriptor table
13:0 Read-unpredictable / Write-as-Zero Reserved
7-8 Intel® XScale™ Microarchitecture User’s Manual
Configuration
7.2.4 Register 3: Domain Access Control Register
7.2.5 Register 5: Fault Status Register
The Fault Status Register (FSR) indicates which fault has occurred, which could be either a
prefetch abort or a data abort. Bit 10 extends the encoding of the status field for prefetch aborts and
data aborts. The definition of the extended status field is found in Section 2.3.4, “Event
Architecture” on page 2-11. Bit 9 indicates that a debug event occurred and the exact source of the
event is found in the debug control and status register (CP14, register 10). When bit 9 is set, the
domain and extended status field are undefined.
Upon entry into the prefetch abort or data abort handler, hardware will update this register with the
source of the exception. Software is not required to clear these fields.
Table 7-9. Domain Access Control Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
D15 D14 D13 D12 D11 D10 D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
reset value: unpredictable
Bits Access Description
31:0 Read / Write
Access permissions for all 16 domains - The meaning
of each field can be found in the ARM Architecture
Reference Manual.
Table 7-10. Fault Status Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210
XD0 Domain Status
reset value: unpredictable
Bits Access Description
31:11 Read-unpredictable / Write-as-Zero Reserved
10 Read / Write
Status Field Extension (X)
This bit is used to extend the encoding of the Status field,
when there is a prefetch abort [See Table 2-13 on
page 2-12] and when there is a data abort [See
Table 2-14 on page 2-13].
9 Read / Write
Debug Event (D)
This flag indicates a debug event has occurred and that
the cause of the debug event is found in the MOE field of
the debug control register (CP14, register 10)
8 Read-as-zero / Write-as-Zero = 0
7:4 Read / Write Domain - Specifies which of the 16 domains was being
accessed when a data abort occurred
3:0 Read / Write
Status - Used along with the X-bit above to determine the
type of cycle type that generated the exception. See
“Event Architecture” on page 2-11
Intel® XScale™ Microarchitecture User’s Manual 7-9
Configuration
7.2.6 Register 6: Fault Address Register
7.2.7 Register 7: Cache Functions
All the cache functions defined in existing StrongARM* products appear here. The Intel®
XScale™ core adds other functions as well. This register is write-only. Reads from this register, as
with an MRC, have an undefined effect.
Disabling/enabling a cache has no effect on contents of the cache: valid data stays valid, locked
items remain locked and accesses that hit in the cache will hit. To prevent cache hits after disabling
the cache it is necessary to invalidate it. The way to prevent hits on the fill buffer is to drain it. All
operations defined in Table 7-12 work regardless of whether the cache is enabled or disabled.
The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer. The
Intel® XScale™ core does not check permissions on addresses supplied for cache or TLB
functions. Because only privileged software may execute these functions, full accessibility is
assumed. Cache functions will not generate any of the following:
translation faults
domain faults
permission faults
Since the Clean D Cache Line function reads from the data cache, it is capable of generating a
parity fault. The other operations will not generate parity faults.
The invalidate instruction cache line command does not invalidate the BTB. If software invalidates
a line from the instruction cache and modifies the same location in external memory, it needs to
invalidate the BTB also. Not invalidating the BTB in this case will cause unpredictable results.
Table 7-11. Fault Address Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Fault Virtual Address
reset value: unpredictable
Bits Access Description
31:0 Read / Write Fault Virtual Address - Contains the MVA of the data
access that caused the memory abort
Table 7-12. Cache Functions (Sheet 1 of 2)
Function opcode_2 CRm Data Instruction
Invalidate I&D cache & BTB 0b000 0b0111 Ignored MCR p15, 0, Rd, c7, c7, 0
Invalidate I cache & BTB 0b000 0b0101 Ignored MCR p15, 0, Rd, c7, c5, 0
Invalidate I cache line 0b001 0b0101 MVA MCR p15, 0, Rd, c7, c5, 1
Invalidate D cache 0b000 0b0110 Ignored MCR p15, 0, Rd, c7, c6, 0
Invalidate D cache line 0b001 0b0110 MVA MCR p15, 0, Rd, c7, c6, 1
Clean D cache line 0b001 0b1010 MVA MCR p15, 0, Rd, c7, c10, 1
7-10 Intel® XScale™ Microarchitecture User’s Manual
Configuration
The line-allocate command allocates a tag into the data cache specified by bits [31:5] of Rd. If a
valid dirty line (with a different MVA) already exists at this location it will be evicted. The 32 bytes
of data associated with the newly allocated line are not initialized and therefore will generate
unpredictable results if read.
This command may be used for cleaning the entire data cache on a context switch and also when
re-configuring portions of the data cache as data RAM. In both cases, Rd is a virtual address that
maps to some non-existent physical memory. When creating data RAM, software must initialize
the data RAM before read accesses can occur. Specific uses of these commands can be found in
Chapter 6, “Data Cache.
Other items to note about the line-allocate command are:
It forces all pending memory operations to complete.
If the targeted cache line is already resident, this command has no effect.
This command cannot be used to allocate a line in the mini Data Cache.
The newly allocated line is not marked as “dirty”. However, if a valid store is made to that line
it will be marked as “dirty” and will get written back to external memory if another line is
allocated to the same cache location. This eviction will produce unpredictable results if the
line-allocate command used a virtual address that mapped to non-existent memory.
To avoid this situation, the line-allocate operation should only be used if one of the following
can be guaranteed:
The virtual address associated with this command is not one that will be generated during
normal program execution. This is the case when line-allocate is used to clean/invalidate
the entire cache.
The line-allocate operation is used only on a cache region destined to be locked. When the
region is unlocked, it must be invalidated before making another data access.
7.2.8 Register 8: TLB Operations
Disabling/enabling the MMU has no effect on the contents of either TLB: valid entries stay valid,
locked items remain locked. To invalidate the TLBs the commands below are required. All
operations defined in Table 7-13 work regardless of whether the cache is enabled or disabled.
This register is write-only. Reads from this register, as with an MRC, have an undefined effect.
Drain Write (& Fill) Buffer 0b100 0b1010 Ignored MCR p15, 0, Rd, c7, c10, 4
Invalidate Branch Target Buffer 0b110 0b0101 Ignored MCR p15, 0, Rd, c7, c5, 6
Allocate Line in the Data Cache 0b101 0b0010 MVA MCR p15, 0, Rd, c7, c2, 5
Table 7-12. Cache Functions (Sheet 2 of 2)
Function opcode_2 CRm Data Instruction
Intel® XScale™ Microarchitecture User’s Manual 7-11
Configuration
7.2.9 Register 9: Cache Lock Down
Register 9 is used for locking down entries into the instruction cache and data cache. (The protocol
for locking down entries can be found in Chapter 6, “Data Cache”.) Data can not be locked into the
mini-data cache.
Table 7-14 shows the command for locking down entries in the instruction cache, instruction TLB,
and data TLB. The cache entry to lock is specified by the virtual address in Rd. The data cache
locking mechanism follows a different procedure than the instruction cache. The data cache is
placed in lock down mode such that all subsequent fills to the data cache result in that line being
locked in, as controlled by Table 7-15.
Lock/unlock operations on a disabled cache have an undefined effect. This register is write-only.
Reads from this register, as with an MRC, have an undefined effect.
Table 7-13. TLB Functions
Function opcode_2 CRm Data Instruction
Invalidate I&D TLB 0b000 0b01<