Intel IXP2800 IXP2800_HRM User Manual To The A9aff12d 5366 4b8f B9b6 E7624962e2c6

User Manual: Intel IXP2800 to the manual

Open the PDF directly: View PDF .
Page Count: 430

Download
Open PDF In Browser	View PDF

Intel® IXP2800 Network
Processor
Hardware Reference Manual
August 2004

Order Number: 278882-010

Revision History
Date

Revision

March 2002

001

Description
First release for IXP2800 Customer Information Book V 0.4

May 2002

002

Update for the IXA SDK 3.0 release.

August 2002

003

Update for the IXA SDK 3.0 Pre-Release 4.

November 2002

004

Update for the IXA SDK 3.0 Pre-Release 5.

May 2003

005

Update for the IXA SDK 3.1 Alpha Release

September 2003

006

Update for the IXA SDK 3.5 Pre-Release 1

October 2003

007

Added information about Receiver and Transmitter
Interoperation with Framers and Switch Fabrics.

January 2004

008

Updated for new trademark usage: Intel XScale® technology.
Updated Sections 6.5.2, 8.5.2.2, 9.2.2.1, 9.3.1, 9.3.3.2,
9.5.1.4, 9.5.3.4, and 10.3.1.

May 2004

009

Updated Figure 123 and Timing Diagrams in Figures 43, 44,
46, 47, 50, 51, 54, and 55.

August 2004

010

Preparation for web posting.

Added Chapter 11, “Performance Monitor Unit”.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the
presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by
estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
The IXP2800 Network Processor may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means without the express written consent of Intel Corporation.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling
1-800-548-4725 or by visiting Intel's website at http://www.intel.com.
Intel and XScale are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2004, Intel Corporation.

Hardware Reference Manual

Contents

Contents
1

Introduction.................................................................................................................................. 25
1.1
1.2
1.3

About This Document .........................................................................................................25
Related Documentation ...................................................................................................... 25
Terminology ........................................................................................................................ 26

Technical Description ................................................................................................................. 27
2.1
2.2

2.3

2.4
2.5

Overview............................................................................................................................. 27
Intel XScale® Core Microarchitecture ................................................................................. 30
2.2.1 ARM* Compatibility ................................................................................................ 30
2.2.2 Features................................................................................................................. 30
2.2.2.1 Multiply/Accumulate (MAC).................................................................... 30
2.2.2.2 Memory Management ............................................................................ 30
2.2.2.3 Instruction Cache ................................................................................... 30
2.2.2.4 Branch Target Buffer.............................................................................. 31
2.2.2.5 Data Cache ............................................................................................ 31
2.2.2.6 Interrupt Controller ................................................................................. 31
2.2.2.7 Address Map .......................................................................................... 32
Microengines ...................................................................................................................... 33
2.3.1 Microengine Bus Arrangement .............................................................................. 35
2.3.2 Control Store.......................................................................................................... 35
2.3.3 Contexts................................................................................................................. 35
2.3.4 Datapath Registers ................................................................................................ 37
2.3.4.1 General-Purpose Registers (GPRs) ...................................................... 37
2.3.4.2 Transfer Registers ................................................................................. 37
2.3.4.3 Next Neighbor Registers ........................................................................ 38
2.3.4.4 Local Memory ....................................................................................... 39
2.3.5 Addressing Modes ................................................................................................. 41
2.3.5.1 Context-Relative Addressing Mode ....................................................... 41
2.3.5.2 Absolute Addressing Mode .................................................................... 42
2.3.5.3 Indexed Addressing Mode ..................................................................... 42
2.3.6 Local CSRs ............................................................................................................ 43
2.3.7 Execution Datapath ............................................................................................... 43
2.3.7.1 Byte Align............................................................................................... 43
2.3.7.2 CAM ....................................................................................................... 45
2.3.8 CRC Unit................................................................................................................ 48
2.3.9 Event Signals ......................................................................................................... 49
DRAM ................................................................................................................................. 50
2.4.1 Size Configuration ................................................................................................. 50
2.4.2 Read and Write Access ......................................................................................... 51
SRAM ................................................................................................................................. 51
2.5.1 QDR Clocking Scheme .......................................................................................... 52
2.5.2 SRAM Controller Configurations............................................................................ 52
2.5.3 SRAM Atomic Operations ...................................................................................... 53
2.5.4 Queue Data Structure Commands ........................................................................ 54
2.5.5 Reference Ordering ............................................................................................... 54
2.5.5.1 Reference Order Tables ........................................................................ 54
2.5.5.2 Microengine Software Restrictions to Maintain Ordering .......................56

Hardware Reference Manual

Contents

2.6
2.7

2.8
2.9

2.10
2.11

2.12
2.13
3

Intel XScale® Core ....................................................................................................................... 79
3.1
3.2

Scratchpad Memory............................................................................................................ 56
2.6.1 Scratchpad Atomic Operations .............................................................................. 57
2.6.2 Ring Commands .................................................................................................... 57
Media and Switch Fabric Interface ..................................................................................... 59
2.7.1 SPI-4...................................................................................................................... 60
2.7.2 CSIX ...................................................................................................................... 61
2.7.3 Receive.................................................................................................................. 61
2.7.3.1 RBUF ..................................................................................................... 62
2.7.3.1.1 SPI-4 and the RBUF .............................................................. 62
2.7.3.1.2 CSIX and RBUF..................................................................... 63
2.7.3.2 Full Element List .................................................................................... 63
2.7.3.3 RX_THREAD_FREELIST ...................................................................... 63
2.7.3.4 Receive Operation Summary................................................................. 64
2.7.4 Transmit................................................................................................................. 65
2.7.4.1 TBUF...................................................................................................... 65
2.7.4.1.1 SPI-4 and TBUF..................................................................... 66
2.7.4.1.2 CSIX and TBUF ..................................................................... 67
2.7.4.2 Transmit Operation Summary................................................................ 67
2.7.5 The Flow Control Interface .................................................................................... 68
2.7.5.1 SPI-4 ...................................................................................................... 68
2.7.5.2 CSIX....................................................................................................... 68
Hash Unit ............................................................................................................................ 69
PCI Controller ..................................................................................................................... 71
2.9.1 Target Access........................................................................................................ 71
2.9.2 Master Access ....................................................................................................... 71
2.9.3 DMA Channels....................................................................................................... 71
2.9.3.1 DMA Descriptor...................................................................................... 72
2.9.3.2 DMA Channel Operation........................................................................ 73
2.9.3.3 DMA Channel End Operation ................................................................ 74
2.9.3.4 Adding Descriptors to an Unterminated Chain....................................... 74
2.9.4 Mailbox and Message Registers............................................................................ 74
2.9.5 PCI Arbiter ............................................................................................................. 75
Control and Status Register Access Proxy......................................................................... 76
Intel XScale® Core Peripherals .......................................................................................... 76
2.11.1 Interrupt Controller................................................................................................. 76
2.11.2 Timers.................................................................................................................... 77
2.11.3 General Purpose I/O.............................................................................................. 77
2.11.4 Universal Asynchronous Receiver/Transmitter...................................................... 77
2.11.5 Slowport................................................................................................................. 77
I/O Latency ......................................................................................................................... 78
Performance Monitor .......................................................................................................... 78
Introduction ......................................................................................................................... 79
Features.............................................................................................................................. 80
3.2.1 Multiply/ACcumulate (MAC)................................................................................... 80
3.2.2 Memory Management............................................................................................ 80
3.2.3 Instruction Cache................................................................................................... 81
3.2.4 Branch Target Buffer (BTB) ................................................................................... 81
3.2.5 Data Cache............................................................................................................ 81
3.2.6 Performance Monitoring ........................................................................................ 81

Hardware Reference Manual

Contents

3.3

3.4

3.5

3.6

3.2.7 Power Management............................................................................................... 81
3.2.8 Debugging ............................................................................................................. 81
3.2.9 JTAG......................................................................................................................81
Memory Management......................................................................................................... 82
3.3.1 Architecture Model ................................................................................................. 82
3.3.1.1 Version 4 versus Version 5 .................................................................... 82
3.3.1.2 Memory Attributes .................................................................................. 82
3.3.1.2.1 Page (P) Attribute Bit ............................................................. 82
3.3.1.2.2 Instruction Cache ................................................................... 83
3.3.1.2.3 Data Cache and Write Buffer ................................................. 83
3.3.1.2.4 Details on Data Cache and Write Buffer Behavior ................. 83
3.3.1.2.5 Memory Operation Ordering .................................................. 84
3.3.2 Exceptions ............................................................................................................. 84
3.3.3 Interaction of the MMU, Instruction Cache, and Data Cache................................. 85
3.3.4 Control ...................................................................................................................85
3.3.4.1 Invalidate (Flush) Operation................................................................... 85
3.3.4.2 Enabling/Disabling ................................................................................. 85
3.3.4.3 Locking Entries ...................................................................................... 86
3.3.4.4 Round-Robin Replacement Algorithm ................................................... 87
Instruction Cache ................................................................................................................ 88
3.4.1 Instruction Cache Operation .................................................................................. 89
3.4.1.1 Operation when Instruction Cache is Enabled ....................................... 89
3.4.1.2 Operation when Instruction Cache is Disabled ...................................... 90
3.4.1.3 Fetch Policy ........................................................................................... 90
3.4.1.4 Round-Robin Replacement Algorithm ................................................... 90
3.4.1.5 Parity Protection..................................................................................... 91
3.4.1.6 Instruction Cache Coherency.................................................................91
3.4.2 Instruction Cache Control ...................................................................................... 92
3.4.2.1 Instruction Cache State at Reset ........................................................... 92
3.4.2.2 Enabling/Disabling ................................................................................. 92
3.4.2.3 Invalidating the Instruction Cache .......................................................... 92
3.4.2.4 Locking Instructions in the Instruction Cache ........................................ 92
3.4.2.5 Unlocking Instructions in the Instruction Cache ..................................... 94
Branch Target Buffer (BTB) ................................................................................................ 94
3.5.1 Branch Target Buffer Operation............................................................................. 94
3.5.1.1 Reset...................................................................................................... 95
3.5.2 Update Policy......................................................................................................... 96
3.5.3 BTB Control ...........................................................................................................96
3.5.3.1 Disabling/Enabling ................................................................................. 96
3.5.3.2 Invalidation ............................................................................................. 96
Data Cache......................................................................................................................... 96
3.6.1 Overviews .............................................................................................................. 97
3.6.1.1 Data Cache Overview ............................................................................ 97
3.6.1.2 Mini-Data Cache Overview .................................................................... 98
3.6.1.3 Write Buffer and Fill Buffer Overview ..................................................... 99
3.6.2 Data Cache and Mini-Data Cache Operation ........................................................ 99
3.6.2.1 Operation when Caching is Enabled...................................................... 99
3.6.2.2 Operation when Data Caching is Disabled ............................................ 99
3.6.2.3 Cache Policies .....................................................................................100
3.6.2.3.1 Cacheability .........................................................................100
3.6.2.3.2 Read Miss Policy .................................................................100
3.6.2.3.3 Write Miss Policy..................................................................101

Hardware Reference Manual

Contents

3.7
3.8

3.9

3.10

3.11

3.6.2.3.4 Write-Back versus Write-Through........................................ 101
3.6.2.4 Round-Robin Replacement Algorithm ................................................. 102
3.6.2.5 Parity Protection................................................................................... 102
3.6.2.6 Atomic Accesses.................................................................................. 102
3.6.3 Data Cache and Mini-Data Cache Control .......................................................... 103
3.6.3.1 Data Memory State After Reset ........................................................... 103
3.6.3.2 Enabling/Disabling ............................................................................... 103
3.6.3.3 Invalidate and Clean Operations.......................................................... 103
3.6.3.3.1 Global Clean and Invalidate Operation ................................ 104
3.6.4 Reconfiguring the Data Cache as Data RAM ...................................................... 105
3.6.5 Write Buffer/Fill Buffer Operation and Control ..................................................... 106
Configuration .................................................................................................................... 106
Performance Monitoring ................................................................................................... 107
3.8.1 Performance Monitoring Events .......................................................................... 107
3.8.1.1 Instruction Cache Efficiency Mode....................................................... 108
3.8.1.2 Data Cache Efficiency Mode................................................................ 109
3.8.1.3 Instruction Fetch Latency Mode........................................................... 109
3.8.1.4 Data/Bus Request Buffer Full Mode .................................................... 109
3.8.1.5 Stall/Writeback Statistics...................................................................... 110
3.8.1.6 Instruction TLB Efficiency Mode .......................................................... 111
3.8.1.7 Data TLB Efficiency Mode ................................................................... 111
3.8.2 Multiple Performance Monitoring Run Statistics .................................................. 111
Performance Considerations ............................................................................................ 111
3.9.1 Interrupt Latency.................................................................................................. 112
3.9.2 Branch Prediction ................................................................................................ 112
3.9.3 Addressing Modes ............................................................................................... 113
3.9.4 Instruction Latencies............................................................................................ 113
3.9.4.1 Performance Terms ............................................................................. 113
3.9.4.2 Branch Instruction Timings .................................................................. 115
3.9.4.3 Data Processing Instruction Timings ................................................... 115
3.9.4.4 Multiply Instruction Timings.................................................................. 116
3.9.4.5 Saturated Arithmetic Instructions ......................................................... 117
3.9.4.6 Status Register Access Instructions .................................................... 118
3.9.4.7 Load/Store Instructions ........................................................................ 118
3.9.4.8 Semaphore Instructions ....................................................................... 118
3.9.4.9 Coprocessor Instructions ..................................................................... 119
3.9.4.10 Miscellaneous Instruction Timing......................................................... 119
3.9.4.11 Thumb Instructions .............................................................................. 119
Test Features.................................................................................................................... 119
3.10.1 IXP2800 Network Processor Endianness............................................................ 120
3.10.1.1 Read and Write Transactions Initiated by the Intel XScale® Core ...... 121
3.10.1.1.1 Reads Initiated by the Intel XScale® Core ........................ 121
3.10.1.1.2 The Intel XScale® Core Writing to the IXP2800
Network Processor .................................................................. 123
Intel XScale® Core Gasket Unit ....................................................................................... 125
3.11.1 Overview.............................................................................................................. 125
3.11.2 Intel XScale® Core Gasket Functional Description ............................................. 127
3.11.2.1 Command Memory Bus to Command Push/Pull Conversion .............. 127
3.11.3 CAM Operation .................................................................................................... 127
3.11.4 Atomic Operations ............................................................................................... 128
3.11.4.1 Summary of Rules for the Atomic Command Regarding I/O ............... 129
3.11.4.2 Intel XScale® Core Access to SRAM Q-Array..................................... 129

Hardware Reference Manual

Contents

3.12

3.11.5 I/O Transaction ....................................................................................................130
3.11.6 Hash Access ........................................................................................................130
3.11.7 Gasket Local CSR ...............................................................................................131
3.11.8 Interrupt ...............................................................................................................132
Intel XScale® Core Peripheral Interface...........................................................................134
3.12.1 XPI Overview .......................................................................................................134
3.12.1.1 Data Transfers .....................................................................................135
3.12.1.2 Data Alignment ....................................................................................135
3.12.1.3 Address Spaces for XPI Internal Devices ............................................136
3.12.2 UART Overview ...................................................................................................137
3.12.3 UART Operation ..................................................................................................138
3.12.3.1 UART FIFO OPERATION ....................................................................138
3.12.3.1.1 UART FIFO Interrupt Mode Operation –
Receiver Interrupt ....................................................................138
3.12.3.1.2 FIFO Polled Mode Operation .............................................139
3.12.4 Baud Rate Generator...........................................................................................139
3.12.5 General Purpose I/O (GPIO) ...............................................................................140
3.12.6 Timers ..................................................................................................................141
3.12.6.1 Timer Operation ...................................................................................141
3.12.7 Slowport Unit .......................................................................................................142
3.12.7.1 PROM Device Support.........................................................................143
3.12.7.2 Microprocessor Interface Support for the Framer ................................143
3.12.7.3 Slowport Unit Interfaces .......................................................................144
3.12.7.4 Address Space.....................................................................................145
3.12.7.5 Slowport Interfacing Topology .............................................................145
3.12.7.6 Slowport 8-Bit Device Bus Protocols ...................................................146
3.12.7.6.1 Mode 0 Single Write Transfer for Fixed-Timed Device ......147
3.12.7.6.2 Mode 0 Single Write Transfer for Self-Timing Device ........148
3.12.7.6.3 Mode 0 Single Read Transfer for Fixed-Timed Device ......149
3.12.7.6.4 Single Read Transfer for a Self-Timing Device..................150
3.12.7.7 SONET/SDH Microprocessor Access Support ....................................150
3.12.7.7.1 Mode 1: 16-Bit Microprocessor Interface Support with
16-Bit Address Lines................................................................151
3.12.7.7.2 Mode 2: Interface with 8 Data Bits and 11 Address Bits ....155
3.12.7.7.3 Mode 3: Support for the Intel and AMCC* 2488 Mbps
SONET/SDH Microprocessor Interface ...................................157

Microengines .............................................................................................................................167
4.1

4.2
4.3

Overview...........................................................................................................................167
4.1.1 Control Store........................................................................................................169
4.1.2 Contexts...............................................................................................................169
4.1.3 Datapath Registers ..............................................................................................171
4.1.3.1 General-Purpose Registers (GPRs) ....................................................171
4.1.3.2 Transfer Registers ...............................................................................171
4.1.3.3 Next Neighbor Registers ......................................................................172
4.1.3.4 Local Memory ......................................................................................172
4.1.4 Addressing Modes ...............................................................................................173
4.1.4.1 Context-Relative Addressing Mode .....................................................173
4.1.4.2 Absolute Addressing Mode ..................................................................174
4.1.4.3 Indexed Addressing Mode ...................................................................174
Local CSRs.......................................................................................................................174
Execution Datapath ..........................................................................................................174

Hardware Reference Manual

Contents

4.4
4.5

DRAM.......................................................................................................................................... 187
5.1
5.2
5.3
5.4
5.5

5.6

5.7
5.8
5.9
5.10

5.11
5.12
6

Overview........................................................................................................................... 187
Size Configuration ............................................................................................................ 188
DRAM Clocking ................................................................................................................ 189
Bank Policy ....................................................................................................................... 190
Interleaving ....................................................................................................................... 191
5.5.1 Three Channels Active (3-Way Interleave).......................................................... 191
5.5.2 Two Channels Active (2-Way Interleave) ............................................................ 193
5.5.3 One Channel Active (No Interleave) .................................................................... 193
5.5.4 Interleaving Across RDRAMs and Banks ............................................................ 194
Parity and ECC ................................................................................................................. 194
5.6.1 Parity and ECC Disabled ..................................................................................... 194
5.6.2 Parity Enabled ..................................................................................................... 195
5.6.3 ECC Enabled ....................................................................................................... 195
5.6.4 ECC Calculation and Syndrome .......................................................................... 196
Timing Configuration......................................................................................................... 196
Microengine Signals ......................................................................................................... 197
Serial Port ......................................................................................................................... 197
RDRAM Controller Block Diagram.................................................................................... 198
5.10.1 Commands .......................................................................................................... 199
5.10.2 DRAM Write......................................................................................................... 199
5.10.2.1 Masked Write ....................................................................................... 199
5.10.3 DRAM Read......................................................................................................... 200
5.10.4 CSR Write............................................................................................................ 200
5.10.5 CSR Read............................................................................................................ 200
5.10.6 Arbitration ............................................................................................................ 201
5.10.7 Reference Ordering ............................................................................................. 201
DRAM Push/Pull Arbiter ................................................................................................... 201
5.11.1 Arbiter Push/Pull Operation ................................................................................. 202
5.11.2 DRAM Push Arbiter Description .......................................................................... 203
DRAM Pull Arbiter Description.......................................................................................... 204

SRAM Interface .......................................................................................................................... 207
6.1
6.2

4.3.1 Byte Align............................................................................................................. 174
4.3.2 CAM..................................................................................................................... 176
CRC Unit........................................................................................................................... 179
Event Signals.................................................................................................................... 180
4.5.1 Microengine Endianness ..................................................................................... 181
4.5.1.1 Read from RBUF (64 Bits) ................................................................... 181
4.5.1.2 Write to TBUF ...................................................................................... 182
4.5.1.3 Read/Write from/to SRAM ................................................................... 182
4.5.1.4 Read/Write from/to DRAM ................................................................... 182
4.5.1.5 Read/Write from/to SHaC and Other CSRs......................................... 182
4.5.1.6 Write to Hash Unit................................................................................ 183
4.5.2 Media Access ...................................................................................................... 183
4.5.2.1 Read from RBUF ................................................................................. 184
4.5.2.2 Write to TBUF ...................................................................................... 185
4.5.2.3 TBUF to SPI-4 Transfer ....................................................................... 186

Overview........................................................................................................................... 207
SRAM Interface Configurations ........................................................................................ 208

Hardware Reference Manual

Contents

6.3
6.4

6.5
6.6
6.7
6.8
7

SHaC — Unit Expansion ...........................................................................................................225
7.1

6.2.1 Internal Interface ..................................................................................................209
6.2.2 Number of Channels ............................................................................................209
6.2.3 Coprocessor and/or SRAMs Attached to a Channel............................................209
SRAM Controller Configurations.......................................................................................209
Command Overview .........................................................................................................211
6.4.1 Basic Read/Write Commands..............................................................................211
6.4.2 Atomic Operations ...............................................................................................211
6.4.3 Queue Data Structure Commands ......................................................................213
6.4.3.1 Read_Q_Descriptor Commands ..........................................................216
6.4.3.2 Write_Q_Descriptor Commands ..........................................................216
6.4.3.3 ENQ and DEQ Commands ..................................................................217
6.4.4 Ring Data Structure Commands ..........................................................................217
6.4.5 Journaling Commands .........................................................................................217
6.4.6 CSR Accesses .....................................................................................................217
Parity.................................................................................................................................217
Address Map.....................................................................................................................218
Reference Ordering ..........................................................................................................219
6.7.1 Reference Order Tables ......................................................................................219
6.7.2 Microcode Restrictions to Maintain Ordering .......................................................220
Coprocessor Mode ...........................................................................................................221
Overview...........................................................................................................................225
7.1.1 SHaC Unit Block Diagram....................................................................................225
7.1.2 Scratchpad...........................................................................................................227
7.1.2.1 Scratchpad Description ........................................................................227
7.1.2.2 Scratchpad Interface............................................................................229
7.1.2.2.1 Command Interface .............................................................229
7.1.2.2.2 Push/Pull Interface ...............................................................229
7.1.2.2.3 CSR Bus Interface ...............................................................229
7.1.2.2.4 Advanced Peripherals Bus Interface (APB) .........................229
7.1.2.3 Scratchpad Block Level Diagram .........................................................229
7.1.2.3.1 Scratchpad Commands .......................................................230
7.1.2.3.2 Ring Commands ..................................................................231
7.1.2.3.3 Clocks and Reset .................................................................235
7.1.2.3.4 Reset Registers ...................................................................235
7.1.3 Hash Unit .............................................................................................................236
7.1.3.1 Hashing Operation ...............................................................................237
7.1.3.2 Hash Algorithm ....................................................................................239

Media and Switch Fabric Interface...........................................................................................241
8.1

8.2

Overview...........................................................................................................................241
8.1.1 SPI-4 ....................................................................................................................243
8.1.2 CSIX ....................................................................................................................246
8.1.3 CSIX/SPI-4 Interleave Mode................................................................................246
Receive.............................................................................................................................247
8.2.1 Receive Pins ........................................................................................................248
8.2.2 RBUF ...................................................................................................................248
8.2.2.1 SPI-4 ....................................................................................................250
8.2.2.2 CSIX.....................................................................................................253
8.2.3 Full Element List ..................................................................................................255
8.2.4 Rx_Thread_Freelist_# .........................................................................................255

Hardware Reference Manual

Contents

8.2.5
8.2.6
8.2.7

8.3

8.4
8.5

8.6

8.7

Rx_Thread_Freelist_Timeout_# .......................................................................... 256
Receive Operation Summary............................................................................... 256
Receive Flow Control Status ............................................................................... 258
8.2.7.1 SPI-4 .................................................................................................... 258
8.2.7.2 CSIX..................................................................................................... 259
8.2.7.2.1 Link-Level............................................................................. 259
8.2.7.2.2 Virtual Output Queue ........................................................... 260
8.2.8 Parity.................................................................................................................... 260
8.2.8.1 SPI-4 .................................................................................................... 260
8.2.8.2 CSIX..................................................................................................... 261
8.2.8.2.1 Horizontal Parity................................................................... 261
8.2.8.2.2 Vertical Parity....................................................................... 261
8.2.9 Error Cases.......................................................................................................... 261
Transmit............................................................................................................................ 262
8.3.1 Transmit Pins....................................................................................................... 262
8.3.2 TBUF ................................................................................................................... 263
8.3.2.1 SPI-4 .................................................................................................... 266
8.3.2.2 CSIX..................................................................................................... 267
8.3.3 Transmit Operation Summary.............................................................................. 268
8.3.3.1 SPI-4 .................................................................................................... 268
8.3.3.2 CSIX..................................................................................................... 269
8.3.3.3 Transmit Summary............................................................................... 270
8.3.4 Transmit Flow Control Status .............................................................................. 270
8.3.4.1 SPI-4 .................................................................................................... 271
8.3.4.2 CSIX..................................................................................................... 273
8.3.4.2.1 Link-Level............................................................................. 273
8.3.4.2.2 Virtual Output Queue ........................................................... 273
8.3.5 Parity.................................................................................................................... 273
8.3.5.1 SPI-4 .................................................................................................... 273
8.3.5.2 CSIX..................................................................................................... 274
8.3.5.2.1 Horizontal Parity................................................................... 274
8.3.5.2.2 Vertical Parity....................................................................... 274
RBUF and TBUF Summary .............................................................................................. 274
CSIX Flow Control Interface ............................................................................................. 275
8.5.1 TXCSRB and RXCSRB Signals .......................................................................... 275
8.5.2 FCIFIFO and FCEFIFO Buffers ........................................................................... 276
8.5.2.1 Full Duplex CSIX.................................................................................. 277
8.5.2.2 Simplex CSIX....................................................................................... 278
8.5.3 TXCDAT/RXCDAT, TXCSOF/RXCSOF, TXCPAR/RXCPAR,
and TXCFC/RXCFC Signals................................................................................ 280
Deskew and Training ........................................................................................................ 280
8.6.1 Data Training Pattern........................................................................................... 282
8.6.2 Flow Control Training Pattern .............................................................................. 282
8.6.3 Use of Dynamic Training ..................................................................................... 283
CSIX Startup Sequence.................................................................................................... 287
8.7.1 CSIX Full Duplex ................................................................................................. 287
8.7.1.1 Ingress IXP2800 Network Processor ................................................... 287
8.7.1.2 Egress IXP2800 Network Processor.................................................... 287
8.7.1.3 Single IXP2800 Network Processor..................................................... 288
8.7.2 CSIX Simplex....................................................................................................... 288
8.7.2.1 Ingress IXP2800 Network Processor ................................................... 288
8.7.2.2 Egress IXP2800 Network Processor.................................................... 289

Hardware Reference Manual

Contents

8.8

8.9

8.7.2.3 Single IXP2800 Network Processor .....................................................289
Interface to Command and Push and Pull Buses .............................................................290
8.8.1 RBUF or MSF CSR to Microengine S_TRANSFER_IN Register for Instruction:.291
8.8.2 Microengine S_TRANSFER_OUT Register to TBUF or
MSF CSR for Instruction: .....................................................................................291
8.8.3 Microengine to MSF CSR for Instruction: ............................................................291
8.8.4 From RBUF to DRAM for Instruction: ..................................................................291
8.8.5 From DRAM to TBUF for Instruction:...................................................................292
Receiver and Transmitter Interoperation with Framers and Switch Fabrics .....................292
8.9.1 Receiver and Transmitter Configurations ............................................................293
8.9.1.1 Simplex Configuration ..........................................................................293
8.9.1.2 Hybrid Simplex Configuration ..............................................................294
8.9.1.3 Dual Network Processor Full Duplex Configuration .............................295
8.9.1.4 Single Network Processor Full Duplex Configuration (SPI-4.2) ...........296
8.9.1.5 Single Network Processor, Full Duplex Configuration
(SPI-4.2 and CSIX-L1) .........................................................................297
8.9.2 System Configurations.........................................................................................297
8.9.2.1 Framer, Single Network Processor Ingress and Egress, and
Fabric Interface Chip............................................................................298
8.9.2.2 Framer, Dual Network Processor Ingress, Single
Network Processor Egress, and Fabric Interface Chip ........................298
8.9.2.3 Framer, Single Network Processor Ingress and Egress, and
CSIX-L1 Chips for Translation and Fabric Interface ............................299
8.9.2.4 CPU Complex, Network Processor, and Fabric Interface Chip ...........299
8.9.2.5 Framer, Single Network Processor, Co-Processor, and
Fabric Interface Chip............................................................................300
8.9.3 SPI-4.2 Support ...................................................................................................301
8.9.3.1 SPI-4.2 Receiver ..................................................................................301
8.9.3.2 SPI-4.2 Transmitter ..............................................................................302
8.9.4 CSIX-L1 Protocol Support ...................................................................................303
8.9.4.1 CSIX-L1 Interface Reference Model: Traffic Manager and Fabric
Interface Chip.......................................................................................303
8.9.4.2 Intel® IXP2800 Support of the CSIX-L1 Protocol ................................304
8.9.4.2.1 Mapping to 16-Bit Wide DDR LVDS ....................................304
8.9.4.2.2 Support for Dual Chip, Full-Duplex Operation .....................305
8.9.4.2.3 Support for Simplex Operation.............................................306
8.9.4.2.4 Support for Hybrid Simplex Operation .................................307
8.9.4.2.5 Support for Dynamic De-Skew Training...............................308
8.9.4.3 CSIX-L1 Protocol Receiver Support ....................................................309
8.9.4.4 CSIX-L1 Protocol Transmitter Support ................................................310
8.9.4.5 Implementation of a Bridge Chip to CSIX-L1 .......................................311
8.9.5 Dual Protocol (SPI and CSIX-L1) Support ...........................................................312
8.9.5.1 Dual Protocol Receiver Support...........................................................312
8.9.5.2 Dual Protocol Transmitter Support.......................................................312
8.9.5.3 Implementation of a Bridge Chip to CSIX-L1 and SPI-4.2 ...................313
8.9.6 Transmit State Machine .......................................................................................314
8.9.6.1 SPI-4.2 Transmitter State Machine ......................................................314
8.9.6.2 Training Transmitter State Machine .....................................................315
8.9.6.3 CSIX-L1 Transmitter State Machine ....................................................315
8.9.7 Dynamic De-Skew ...............................................................................................316
8.9.8 Summary of Receiver and Transmitter Signals ...................................................317

Hardware Reference Manual

Contents

PCI Unit....................................................................................................................................... 319
9.1
9.2

9.3

9.4

Overview........................................................................................................................... 319
PCI Pin Protocol Interface Block....................................................................................... 321
9.2.1 PCI Commands ................................................................................................... 322
9.2.2 IXP2800 Network Processor Initialization............................................................ 323
9.2.2.1 Initialization by the Intel XScale® Core................................................ 324
9.2.2.2 Initialization by a PCI Host ................................................................... 324
9.2.3 PCI Type 0 Configuration Cycles......................................................................... 325
9.2.3.1 Configuration Write .............................................................................. 325
9.2.3.2 Configuration Read .............................................................................. 325
9.2.4 PCI 64-Bit Bus Extension .................................................................................... 325
9.2.5 PCI Target Cycles................................................................................................ 326
9.2.5.1 PCI Accesses to CSR .......................................................................... 326
9.2.5.2 PCI Accesses to DRAM ....................................................................... 326
9.2.5.3 PCI Accesses to SRAM ....................................................................... 326
9.2.5.4 Target Write Accesses from the PCI Bus ............................................ 326
9.2.5.5 Target Read Accesses from the PCI Bus ............................................ 327
9.2.6 PCI Initiator Transactions .................................................................................... 327
9.2.6.1 PCI Request Operation........................................................................ 327
9.2.6.2 PCI Commands.................................................................................... 328
9.2.6.3 Initiator Write Transactions .................................................................. 328
9.2.6.4 Initiator Read Transactions .................................................................. 328
9.2.6.5 Initiator Latency Timer ......................................................................... 328
9.2.6.6 Special Cycle ....................................................................................... 329
9.2.7 PCI Fast Back-to-Back Cycles............................................................................. 329
9.2.8 PCI Retry ............................................................................................................. 329
9.2.9 PCI Disconnect .................................................................................................... 329
9.2.10 PCI Built-In System Test...................................................................................... 329
9.2.11 PCI Central Functions......................................................................................... 330
9.2.11.1 PCI Interrupt Inputs.............................................................................. 330
9.2.11.2 PCI Reset Output................................................................................. 330
9.2.11.3 PCI Internal Arbiter .............................................................................. 331
Slave Interface Block ........................................................................................................ 332
9.3.1 CSR Interface ...................................................................................................... 332
9.3.2 SRAM Interface ................................................................................................... 333
9.3.2.1 SRAM Slave Writes ............................................................................. 333
9.3.2.2 SRAM Slave Reads ............................................................................. 334
9.3.3 DRAM Interface ................................................................................................... 334
9.3.3.1 DRAM Slave Writes ............................................................................. 334
9.3.3.2 DRAM Slave Reads ............................................................................. 335
9.3.4 Mailbox and Doorbell Registers........................................................................... 336
9.3.5 PCI Interrupt Pin .................................................................................................. 339
Master Interface Block ...................................................................................................... 340
9.4.1 DMA Interface...................................................................................................... 340
9.4.1.1 Allocation of the DMA Channels .......................................................... 341
9.4.1.2 Special Registers for Microengine Channels ....................................... 341
9.4.1.3 DMA Descriptor.................................................................................... 342
9.4.1.4 DMA Channel Operation...................................................................... 343
9.4.1.5 DMA Channel End Operation .............................................................. 344
9.4.1.6 Adding Descriptor to an Unterminated Chain ...................................... 344
9.4.1.7 DRAM to PCI Transfer ......................................................................... 344
9.4.1.8 PCI to DRAM Transfer ......................................................................... 345

Hardware Reference Manual

Contents

9.4.2

9.5

9.6
10

Push/Pull Command Bus Target Interface...........................................................345
9.4.2.1 Command Bus Master Access to Local Configuration Registers ........345
9.4.2.2 Command Bus Master Access to Local Control and
Status Registers...................................................................................346
9.4.2.3 Command Bus Master Direct Access to PCI Bus ................................346
9.4.2.3.1 PCI Address Generation for IO and MEM Cycles ................346
9.4.2.3.2 PCI Address Generation for Configuration Cycles...............347
9.4.2.3.3 PCI Address Generation for Special and IACK Cycles ........347
9.4.2.3.4 PCI Enables .........................................................................347
9.4.2.3.5 PCI Command .....................................................................347
PCI Unit Error Behavior ....................................................................................................348
9.5.1 PCI Target Error Behavior ...................................................................................348
9.5.1.1 Target Access Has an Address Parity Error ........................................348
9.5.1.2 Initiator Asserts PCI_PERR_L in Response to One of Our Data
Phases .................................................................................................348
9.5.1.3 Discard Timer Expires on a Target Read.............................................348
9.5.1.4 Target Access to the PCI_CSR_BAR Space Has Illegal
Byte Enables ........................................................................................348
9.5.1.5 Target Write Access Receives Bad Parity PCI_PAR with the Data .....349
9.5.1.6 SRAM Responds with a Memory Error on One or More Data Phases
on a Target Read .................................................................................349
9.5.1.7 DRAM Responds with a Memory Error on One or More Data Phases
on a Target Read .................................................................................349
9.5.2 As a PCI Initiator During a DMA Transfer ............................................................349
9.5.2.1 DMA Read from DRAM (Memory-to-PCI Transaction) Gets a
Memory Error .......................................................................................349
9.5.2.2 DMA Read from SRAM (Descriptor Read) Gets a Memory Error ........350
9.5.2.3 DMA from DRAM Transfer (Write to PCI) Receives PCI_PERR_L on
PCI Bus................................................................................................350
9.5.2.4 DMA To DRAM (Read from PCI) Has Bad Data Parity .......................350
9.5.2.5 DMA Transfer Experiences a Master Abort (Time-Out) on PCI ...........351
9.5.2.6 DMA Transfer Receives a Target Abort Response During a
Data Phase ..........................................................................................351
9.5.2.7 DMA Descriptor Has a 0x0 Word Count (Not an Error) .......................351
9.5.3 As a PCI Initiator During a Direct Access from the Intel
XScale® Core or Microengine .............................................................................351
9.5.3.1 Master Transfer Experiences a Master Abort (Time-Out) on PCI ........351
9.5.3.2 Master Transfer Receives a Target Abort Response During
a Data Phase .......................................................................................351
9.5.3.3 Master from the Intel XScale® Core or Microengine Transfer
(Write to PCI) Receives PCI_PERR_L on PCI Bus .............................352
9.5.3.4 Master Read from PCI (Read from PCI) Has Bad Data Parity ............352
9.5.3.5 Master Transfer Receives PCI_SERR_L from the PCI Bus ................352
9.5.3.6 Intel XScale® Core Microengine Requests Direct Transfer when
the PCI Bus is in Reset ........................................................................352
PCI Data Byte Lane Alignment .........................................................................................352
9.6.1 Endian for Byte Enable ........................................................................................355

Clocks and Reset.......................................................................................................................359
10.1
10.2
10.3

Clocks ...............................................................................................................................359
Synchronization Between Frequency Domains ................................................................363
Reset ................................................................................................................................364
10.3.1 Hardware Reset Using nRESET or PCI_RST_L .................................................364

Hardware Reference Manual

Contents

10.4
10.5
11

Performance Monitor Unit ........................................................................................................ 375
11.1

11.2

11.3
11.4

10.3.2 PCI-Initiated Reset............................................................................................... 366
10.3.3 Watchdog Timer-Initiated Reset .......................................................................... 366
10.3.3.1 Slave Network Processor (Non-Central Function) ............................... 367
10.3.3.2 Master Network Processor (PCI Host, Central Function) .................... 367
10.3.3.3 Master Network Processor (Central Function)..................................... 367
10.3.4 Software-Initiated Reset ...................................................................................... 367
10.3.5 Reset Removal Operation Based on CFG_PROM_BOOT.................................. 368
10.3.5.1 When CFG_PROM_BOOT is 1 (BOOT_PROM is Present) ................ 368
10.3.5.2 When CFG_PROM_BOOT is 0 (BOOT_PROM is Not Present) ......... 368
10.3.6 Strap Pins ............................................................................................................ 368
10.3.7 Powerup Reset Sequence ................................................................................... 370
Boot Mode ........................................................................................................................ 370
10.4.1 Flash ROM........................................................................................................... 372
10.4.2 PCI Host Download ............................................................................................. 372
Initialization ....................................................................................................................... 373
Introduction ....................................................................................................................... 375
11.1.1 Motivation for Performance Monitors................................................................... 375
11.1.2 Motivation for Choosing CHAP Counters ............................................................ 376
11.1.3 Functional Overview of CHAP Counters.............................................................. 377
11.1.4 Basic Operation of the Performance Monitor Unit ............................................... 378
11.1.5 Definition of CHAP Terminology .......................................................................... 379
11.1.6 Definition of Clock Domains................................................................................. 380
Interface and CSR Description ......................................................................................... 380
11.2.1 APB Peripheral .................................................................................................... 381
11.2.2 CAP Description .................................................................................................. 381
11.2.2.1 Selecting the Access Mode.................................................................. 381
11.2.2.2 PMU CSR ............................................................................................ 381
11.2.2.3 CAP Writes .......................................................................................... 381
11.2.2.4 CAP Reads .......................................................................................... 381
11.2.3 Configuration Registers ....................................................................................... 382
Performance Measurements ............................................................................................ 382
Events Monitored in Hardware ......................................................................................... 385
11.4.1 Queue Statistics Events....................................................................................... 385
11.4.1.1 Queue Latency..................................................................................... 385
11.4.1.2 Queue Utilization.................................................................................. 385
11.4.2 Count Events ....................................................................................................... 385
11.4.2.1 Hardware Block Execution Count ........................................................ 385
11.4.3 Design Block Select Definitions ........................................................................... 386
11.4.4 Null Event ............................................................................................................ 387
11.4.5 Threshold Events................................................................................................. 388
11.4.6 External Input Events........................................................................................... 389
11.4.6.1 XPI Events Target ID(000001) / Design Block #(0100) ....................... 389
11.4.6.2 SHaC Events Target ID(000010) / Design Block #(0101).................... 393
11.4.6.3 IXP2800 Network Processor MSF Events Target ID(000011) /
Design Block #(0110)........................................................................... 396
11.4.6.4 Intel XScale® Core Events Target ID(000100) /
Design Block #(0111)........................................................................... 402
11.4.6.5 PCI Events Target ID(000101) / Design Block #(1000) ....................... 405
11.4.6.6 ME00 Events Target ID(100000) / Design Block #(1001).................... 409

Hardware Reference Manual

Contents

11.4.6.7 ME01 Events Target ID(100001) / Design Block #(1001) ....................410
11.4.6.8 ME02 Events Target ID(100010) / Design Block #(1001) ....................411
11.4.6.9 ME03 Events Target ID(100011) / Design Block #(1001) ....................411
11.4.6.10 ME04 Events Target ID(100100) / Design Block #(1001) ....................412
11.4.6.11 ME05 Events Target ID(100101) / Design Block #(1001) ....................412
11.4.6.12 ME06 Events Target ID(100110) / Design Block #(1001) ....................413
11.4.6.13 ME07 Events Target ID(100111) / Design Block #(1001) ....................413
11.4.6.14 ME10 Events Target ID(110000) / Design Block #(1010) ....................414
11.4.6.15 ME11 Events Target ID(110001) / Design Block #(1010) ....................414
11.4.6.16 ME12 Events Target ID(110010) / Design Block #(1010) ....................415
11.4.6.17 ME13 Events Target ID(110011) / Design Block #(1010) ....................415
11.4.6.18 ME14 Events Target ID(110100) / Design Block #(1010) ....................416
11.4.6.19 ME15 Events Target ID(110101) / Design Block #(1010) ....................416
11.4.6.20 ME16 Events Target ID(100110) / Design Block #(1010) ....................417
11.4.6.21 ME17 Events Target ID(110111) / Design Block #(1010) ....................417
11.4.6.22 SRAM DP1 Events Target ID(001001) / Design Block #(0010) ...........418
11.4.6.23 SRAM DP0 Events Target ID(001010) / Design Block #(0010) ...........418
11.4.6.24 SRAM CH3 Events Target ID(001011) / Design Block #(0010)...........420
11.4.6.25 SRAM CH2 Events Target ID(001100) / Design Block #(0010)...........421
11.4.6.26 SRAM CH1 Events Target ID(001101) / Design Block #(0010)...........421
11.4.6.27 SRAM CH0 Events Target ID(001110) / Design Block #(0010)...........422
11.4.6.28 DRAM DPLA Events Target ID(010010) / Design Block #(0011) ........423
11.4.6.29 DRAM DPSA Events Target ID(010011) / Design Block #(0011) ........424
11.4.6.30 IXP2800 Network Processor DRAM CH2 Events Target ID(010100) /
Design Block #(0011)...........................................................................425
11.4.6.31 IXP2800 Network Processor DRAM CH1 Events Target ID(010101) /
Design Block #(0011)...........................................................................429
11.4.6.32 IXP2800 Network Processor DRAM CH0 Events Target ID(010110) /
Design Block #(0011)...........................................................................429

Hardware Reference Manual

Contents

Figures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

IXP2800 Network Processor Functional Block Diagram ............................................................ 28
IXP2800 Network Processor Detailed Diagram.......................................................................... 29
Intel XScale® Core 4-GB (32-Bit) Address Space ..................................................................... 32
Microengine Block Diagram........................................................................................................ 34
Context State Transition Diagram .............................................................................................. 36
Byte-Align Block Diagram ........................................................................................................... 44
CAM Block Diagram ................................................................................................................... 46
Echo Clock Configuration ........................................................................................................... 52
Logical View of Rings ................................................................................................................. 57
Example System Block Diagram ................................................................................................ 59
Full-Duplex Block Diagram ......................................................................................................... 60
Simplified MSF Receive Section Block Diagram ........................................................................ 61
Simplified Transmit Section Block Diagram................................................................................ 65
Hash Unit Block Diagram ........................................................................................................... 70
DMA Descriptor Reads............................................................................................................... 72
Intel XScale® Core Architecture Features .................................................................................. 80
Example of Locked Entries in TLB ............................................................................................. 88
Instruction Cache Organization .................................................................................................. 89
Locked Line Effect on Round Robin Replacement..................................................................... 93
BTB Entry ................................................................................................................................... 95
Branch History ............................................................................................................................ 95
Data Cache Organization ........................................................................................................... 97
Mini-Data Cache Organization ................................................................................................... 98
Byte Steering for Read and Byte-Enable Generation by the Intel XScale® Core ..................... 122
Intel XScale® Core-Initiated Write to the IXP2800 Network Processor.................................... 124
Intel XScale® Core-Initiated Write to the IXP2800 Network Processor (Continued)................ 125
Global Buses Connection to the Intel XScale® Core Gasket ................................................... 126
Flow Through the Intel XScale® Core Interrupt Controller........................................................ 132
Interrupt Mask Block Diagram .................................................................................................. 133
XPI Interfaces for IXP2800 Network Processor........................................................................ 135
UART Data Frame .................................................................................................................... 138
GPIO Functional Diagram ........................................................................................................ 140
Timer Control Unit Interfacing Diagram .................................................................................... 141
Timer Internal Logic Diagram ................................................................................................... 142
Slowport Unit Interface Diagram............................................................................................... 144
Address Space Hole Diagram .................................................................................................. 145
Slowport Example Application Topology .................................................................................. 146
Mode 0 Single Write Transfer for a Fixed-Timed Device.......................................................... 147
Mode 0 Single Write Transfer for a Self-Timing Device ........................................................... 148
Mode 0 Single Read Transfer for Fixed-Timed Device ............................................................ 149
Mode 0 Single Read Transfer for a Self-Timing Device ........................................................... 150
An Interface Topology with Lucent* TDAT042G5 SONET/SDH............................................... 152
Mode 1 Single Write Transfer for Lucent* TDAT042G5 Device (B0) ....................................... 153
Mode 1 Single Read Transfer for Lucent* TDAT042G5 Device (B0) ....................................... 154
An Interface Topology with PMC-Sierra* PM5351 S/UNI-TETRA* .......................................... 155
Mode 2 Single Write Transfer for PMC-Sierra* PM5351 Device (B0) ...................................... 156
Mode 2 Single Read Transfer for PMC-Sierra* PM5351 Device (B0) ...................................... 157

Hardware Reference Manual

Contents

48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

An Interface Topology with Intel / AMCC* SONET/SDH Device ..............................................158
Mode 3 Second Interface Topology with Intel / AMCC* SONET/SDH Device..........................159
Mode 3 Single Write Transfer Followed by Read (B0) .............................................................160
Mode 3 Single Read Transfer Followed by Write (B0) .............................................................161
An Interface Topology with Intel / AMCC* SONET/SDH Device in Motorola* Mode ................162
Second Interface Topology with Intel / AMCC* SONET/SDH Device.......................................163
Mode 4 Single Write Transfer (B0) ...........................................................................................164
Mode 4 Single Read Transfer (B0) ...........................................................................................165
Microengine Block Diagram......................................................................................................168
Context State Transition Diagram.............................................................................................170
Byte Align Block Diagram .........................................................................................................175
CAM Block Diagram .................................................................................................................177
Read from RBUF (64 Bits)........................................................................................................181
Write to TBUF (64 Bits).............................................................................................................182
48-, 64-, and 128-Bit Hash Operand Transfers ........................................................................183
Bit, Byte, and Longword Organization in One RBUF Element..................................................184
Write to TBUF ...........................................................................................................................185
MSF Interface ...........................................................................................................................186
Clock Configuration ..................................................................................................................189
IXP2800 Clocking for RDRAM at 400 MHz ..............................................................................190
IXP2800 Clocking for RDRAM at 508 MHz ..............................................................................190
Address Mapping Flow .............................................................................................................191
RDRAM Controller Block Diagram............................................................................................198
DRAM Push/Pull Arbiter Functional Blocks ..............................................................................202
DRAM Push Arbiter Functional Blocks .....................................................................................204
DRAM Pull Arbiter Functional Blocks .......................................................................................205
SRAM Controller/Chassis Block Diagram.................................................................................208
SRAM Clock Connection on a Channel....................................................................................210
External Pipeline Registers Block Diagram ..............................................................................211
Queue Descriptor with Four Links ............................................................................................213
Enqueueing One Buffer at a Time ............................................................................................213
Previously Linked String of Buffers...........................................................................................214
First Step to Enqueue a String of Buffers to a Queue (ENQ_Tail_and_Link)...........................214
Second Step to Enqueue a String of Buffers to a Queue (ENQ_Tail) ......................................214
Connection to a Coprocessor Though Standard QDR Interface ..............................................221
Coprocessor with Memory Mapped FIFO Ports .......................................................................222
SHaC Top Level Diagram.........................................................................................................226
Scratchpad Block Diagram .......................................................................................................228
Ring Communication Logic Diagram ........................................................................................231
Hash Unit Block Diagram..........................................................................................................236
Example System Block Diagram ..............................................................................................242
Full-Duplex Block Diagram .......................................................................................................243
Receive and Transmit Clock Generation ..................................................................................245
Simplified Receive Section Block Diagram...............................................................................247
RBUF Element State Diagram..................................................................................................257
Extent of DIP-4 Codes ..............................................................................................................260
Simplified Transmit Section Block Diagram..............................................................................262
TBUF State Diagram ................................................................................................................270
Tx Calendar Block Diagram......................................................................................................271
CSIX Flow Control Interface — TXCSRB and RXCSRB ..........................................................275

Hardware Reference Manual

Contents

98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

CSIX Flow Control Interface — FCIFIFO and FCEFIFO in Full Duplex Mode ......................... 277
CSIX Flow Control Interface — FCIFIFO and FCEFIFO in Simplex Mode .............................. 278
MSF to Command and Push and Pull Buses Interface Block Diagram .................................... 290
Basic I/O Capability of the Intel® IXP2800 Network Processor................................................ 292
Simplex Configuration .............................................................................................................. 293
Hybrid Simplex Configuration ................................................................................................... 294
Dual Network Processor, Full Duplex Configuration ................................................................ 295
Single Network Processor, Full Duplex Configuration (SPI-4.2 Protocol) ................................ 296
Single Network Processor, Full Duplex Configuration (SPI-4.2 and CSIX-L1 Protocols) ......... 297
Framer, Single Network Processor Ingress and Egress, and Fabric Interface Chip ................ 298
Framer, Dual Processor Ingress, Single Processor Egress, and Fabric Interface Chip ........... 298
Framer, Single Network Processor Ingress, Single Network Processor Egress,
CSIX-L1 Translation Chip and CSIX-L1 Fabric Interface Chip ................................................. 299
CPU Complex, Network Processor, and Fabric Interface Chips .............................................. 299
Framer, Single Network Processor, Co-Processor, and Fabric Interface Chip ........................ 300
SPI-4.2 Interface Reference Model with Receiver and Transmitter Labels
Corresponding to Link Layer Device Functions ........................................................................ 301
CSIX-L1 Interface Reference Model with Receiver and Transmitter Labels
Corresponding to Fabric Interface Chip Functions ................................................................... 303
Reference Model for IXP2800 Support of the Simplex Configuration Using
Independent Ingress and Egress Interfaces............................................................................. 306
Reference Model for Hybrid Simplex Operation ....................................................................... 307
Block Diagram of Dual Protocol (SPI-4.2 and CSIX-L1) Bridge Chip ....................................... 313
Summary of Receiver and Transmitter Signaling ..................................................................... 317
PCI Functional Blocks .............................................................................................................. 320
Data Access Paths ................................................................................................................... 321
PCI Arbiter Configuration Using CFG_PCI_ARB(GPIO[2]) ...................................................... 331
Example of Target Write to SRAM of 68 Bytes ........................................................................ 333
Example of Target Write to DRAM of 68 Bytes ........................................................................ 335
Example of Target Read from DRAM Using 64-Byte Burst ...................................................... 336
Generation of the Doorbell Interrupts to PCI ............................................................................ 337
Generation of the Doorbell Interrupts to the Intel XScale® Core.............................................. 338
PCI Interrupts ........................................................................................................................... 339
DMA Descriptor Reads............................................................................................................. 342
PCI Address Generation for Command Bus Master to PCI...................................................... 346
PCI Address Generation for Command Bus Master to PCI Configuration Cycle ..................... 347
Overall Clock Generation and Distribution ............................................................................... 360
IXP2800 Network Processor Clock Generation........................................................................ 363
Synchronization Between Frequency Domains ........................................................................ 364
Reset Out Behavior .................................................................................................................. 365
Reset Generation ..................................................................................................................... 366
Boot Process ............................................................................................................................ 371
Performance Monitor Interface Block Diagram......................................................................... 376
Block Diagram of a Single CHAP Counter ............................................................................... 378
Basic Block Diagram of IXP2800 Network Processor with PMU .............................................. 379
CAP Interface to the APB ......................................................................................................... 380
Conceptual Diagram of Counter Array ..................................................................................... 382

Hardware Reference Manual

Contents

Tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

Data Terminology ....................................................................................................................... 26
Longword Formats...................................................................................................................... 26
IXP2800 Network Processor Microengine Bus Arrangement ..................................................... 35
Next Neighbor Write as a Function of CTX_ENABLE[NN_MODE] ............................................ 38
Registers Used By Contexts in Context-Relative Addressing Mode .......................................... 41
Align Value and Shift Amount ..................................................................................................... 43
Register Contents for Example 10..............................................................................................44
Register Contents for Example 11..............................................................................................45
RDRAM Sizes............................................................................................................................. 50
SRAM Controller Configurations................................................................................................. 52
Total Memory per Channel ......................................................................................................... 53
Address Reference Order...........................................................................................................55
Q_array Entry Reference Order.................................................................................................. 55
Ring Full Signal Use – Number of Contexts and Length versus Ring Size ................................ 58
TBUF SPI-4 Control Definition.................................................................................................... 66
TBUF CSIX Control Definition .................................................................................................... 67
DMA Descriptor Format .............................................................................................................. 72
Doorbell Interrupt Registers........................................................................................................ 75
I/O Latency ................................................................................................................................. 78
Data Cache and Buffer Behavior when X = 0 ............................................................................. 83
Data Cache and Buffer Behavior when X = 1 ............................................................................. 83
Memory Operations that Impose a Fence .................................................................................. 84
Valid MMU and Data/Mini-Data Cache Combinations................................................................ 85
Performance Monitoring Events ...............................................................................................107
Some Common Uses of the PMU.............................................................................................108
Branch Latency Penalty............................................................................................................112
Latency Example ......................................................................................................................114
Branch Instruction Timings (Predicted by the BTB)..................................................................115
Branch Instruction Timings (Not Predicted by the BTB) ...........................................................115
Data Processing Instruction Timings ........................................................................................115
Multiply Instruction Timings ......................................................................................................116
Multiply Implicit Accumulate Instruction Timings ......................................................................117
Implicit Accumulator Access Instruction Timings......................................................................117
Saturated Data Processing Instruction Timings........................................................................117
Status Register Access Instruction Timings .............................................................................118
Load and Store Instruction Timings ..........................................................................................118
Load and Store Multiple Instruction Timings.............................................................................118
Semaphore Instruction Timings ................................................................................................118
CP15 Register Access Instruction Timings...............................................................................119
CP14 Register Access Instruction Timings...............................................................................119
SWI Instruction Timings............................................................................................................119
Count Leading Zeros Instruction Timings .................................................................................119
Little-Endian Encoding..............................................................................................................120
Big-Endian Encoding ................................................................................................................120
Byte-Enable Generation by the Intel XScale® Core for Byte Transfers in Little- and
Big-Endian Systems .................................................................................................................121
46 Byte-Enable Generation by the Intel XScale® Core for 16-Bit Data Transfers in Littleand Big-Endian Systems ..........................................................................................................123

Hardware Reference Manual

Contents

47 Byte-Enable Generation by the Intel XScale® Core for Byte Writes in Little- and
Big-Endian Systems ................................................................................................................. 123
48 Byte-Enable Generation by the Intel XScale® Core for Word Writes in Little- and
Big-Endian Systems ................................................................................................................. 124
49 CMB Write Command to CPP Command Conversion ............................................................. 127
50 IXP2800 Network Processor SRAM Q-Array Access Alias Addresses .................................... 129
51 GCSR Address Map (0xd700 0000) ......................................................................................... 131
52 Data Transaction Alignment ..................................................................................................... 136
53 Address Spaces for XPI Internal Devices................................................................................. 136
54 8-Bit Flash Memory Device Density ......................................................................................... 143
55 SONET/SDH Devices ............................................................................................................... 143
56 Next Neighbor Write as a Function of CTX_Enable[NN_Mode] ............................................... 172
57 Registers Used by Contexts in Context-Relative Addressing Mode......................................... 173
58 Align Value and Shift Amount................................................................................................... 174
59 Register Contents for Example 23............................................................................................ 175
60 Register Contents for Example 24............................................................................................ 176
61 RDRAM Loading....................................................................................................................... 188
62 RDRAM Sizes........................................................................................................................... 188
63 Address Rearrangement for 3-Way Interleave (Sheet 1 of 2) .................................................. 192
64 Address Rearrangement for 3-Way Interleave (Sheet 2 of 2) (Rev B) .................................... 193
65 Address Bank Interleaving........................................................................................................ 194
66 RDRAM Timing Parameter Settings ......................................................................................... 196
67 Ordering of Reads and Writes to the Same Address for DRAM............................................... 201
68 DRAM Push Arbiter Operation ................................................................................................. 203
69 DPLA Description ..................................................................................................................... 204
70 SRAM Controller Configurations .............................................................................................. 209
71 Total Memory per Channel ....................................................................................................... 210
72 Atomic Operations .................................................................................................................... 212
73 Queue Format .......................................................................................................................... 215
74 Ring/Journal Format ................................................................................................................. 216
75 Ring Size Encoding .................................................................................................................. 216
76 Address Map ............................................................................................................................ 218
77 Address Reference Order......................................................................................................... 219
78 Q_array Entry Reference Order ............................................................................................... 220
79 Ring Full Signal Use – Number of Contexts and Length versus Ring Size .............................. 232
80 Head/Tail, Base, and Full Threshold – by Ring Size ................................................................ 233
81 Intel XScale® Core and Microengine Instructions .................................................................... 235
82 S_Transfer Registers Hash Operands ..................................................................................... 237
83 SPI-4 Control Word Format ...................................................................................................... 244
84 Order of Bytes within the SPI-4 Data Burst .............................................................................. 245
85 CFrame Types.......................................................................................................................... 246
86 Receive Pins Usage by Protocol .............................................................................................. 248
87 Order in which Received Data Is Stored in RBUF .................................................................... 248
88 Mapping of Received Data to RBUF Partitions ........................................................................ 249
89 Number of Elements per RBUF Partition.................................................................................. 249
90 RBUF SPIF-4 Status Definition ................................................................................................ 252
91 RBUF CSIX Status Definition ................................................................................................... 254
92 Rx_Thread_Freelist Use........................................................................................................... 255
93 Summary of SPI-4 and CSIX RBUF Operations ...................................................................... 258
94 Transmit Pins Usage by Protocol ............................................................................................. 262

Hardware Reference Manual

Contents

95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

Order in which Data is Transmitted from TBUF........................................................................263
Mapping of TBUF Partitions to Transmit Protocol ....................................................................263
Number of Elements per TBUF Partition ..................................................................................264
TBUF SPI-4 Control Definition..................................................................................................266
TBUF CSIX Control Definition ..................................................................................................267
Transmit SPI-4 Control Word....................................................................................................268
Transmit CSIX Header..............................................................................................................269
Summary of RBUF and TBUF Operations................................................................................274
SRB Definition by Clock Phase Number...................................................................................276
Data Deskew Functions............................................................................................................281
Calendar Deskew Functions.....................................................................................................281
Flow Control Deskew Functions ...............................................................................................281
Data Training Sequence ...........................................................................................................282
Flow Control Training Sequence ..............................................................................................282
Calendar Training Sequence ....................................................................................................283
IXP2800 Network Processor Requires Data Training...............................................................284
Switch Fabric or SPI-4 Framer Requires Data Training ...........................................................285
IXP2800 Network Processor Requires Flow Control Training ..................................................286
Switch Fabric Requires Flow Control Training..........................................................................286
SPI-4.2 Transmitter State Machine Transitions on 16-Bit Bus Transfers .................................314
Training Transmitter State Machine Transitions on 16-Bit Bus Transfers ................................315
CSIX-L1 Transmitter State Machine Transitions on CWord Boundaries ..................................315
PCI Block FIFO Sizes ...............................................................................................................322
Maximum Loading ....................................................................................................................322
PCI Commands ........................................................................................................................322
PCI BAR Programmable Sizes .................................................................................................324
PCI BAR Sizes with PCI Host Initialization ...............................................................................324
Legal Combinations of the Strap Pin Options...........................................................................330
Slave Interface Buffer Sizes .....................................................................................................332
Doorbell Interrupt Registers......................................................................................................337
IRQ Interrupt Options by Stepping............................................................................................339
DMA Descriptor Format ............................................................................................................342
PCI Maximum Burst Size..........................................................................................................345
Command Bus Master Configuration Transactions ..................................................................347
Command Bus Master Address Space Map to PCI..................................................................347
Byte Lane Alignment for 64-Bit PCI Data In (64 Bits PCI Little-Endian to Big-Endian
with Swap) ................................................................................................................................353
Byte Lane Alignment for 64-Bit PCI Data In (64 Bits PCI Big-Endian to Big-Endian
without Swap) ...........................................................................................................................353
Byte Lane Alignment for 32-Bit PCI Data In (32 Bits PCI Little-Endian to Big-Endian
with Swap) ................................................................................................................................353
Byte Lane Alignment for 32-Bit PCI Data In (32 Bits PCI Big-Endian to Big-Endian
without Swap) ...........................................................................................................................353
Byte Lane Alignment for 64-Bit PCI Data Out (Big-Endian to 64 Bits PCI Little
Endian with Swap) ....................................................................................................................354
Byte Lane Alignment for 64-Bit PCI Data Out (Big-Endian to 64 Bits PCI Big-Endian
without Swap) ...........................................................................................................................354
Byte Lane Alignment for 32-Bit PCI Data Out (Big-Endian to 32 Bits PCI Little
Endian with Swap) ....................................................................................................................354
Byte Lane Alignment for 32-Bit PCI Data Out (Big-Endian to 32 Bits PCI Big-Endian
without Swap) ...........................................................................................................................354

Hardware Reference Manual

Contents

138 Byte Enable Alignment for 64-Bit PCI Data In (64 Bits PCI Little-Endian to BigEndian with Swap).................................................................................................................... 355
139 Byte Enable Alignment for 64-Bit PCI Data In (64 Bits PCI Big-Endian to Big-Endian
without Swap) ........................................................................................................................... 355
140 Byte Enable Alignment for 32-Bit PCI Data In (32 bits PCI Little-Endian to BigEndian with Swap).................................................................................................................... 355
141 Byte Enable Alignment for 32-Bit PCI Data In (32 Bits PCI Big-Endian to Big-Endian
without Swap) ........................................................................................................................... 356
142 Byte Enable Alignment for 64-Bit PCI Data Out (Big-Endian to 64 Bits PCI Little
Endian with Swap).................................................................................................................... 356
143 Byte Enable Alignment for 64-Bit PCI Data Out (Big-Endian to 64 Bits PCI Big
Endian without Swap) ............................................................................................................... 356
144 Byte Enable Alignment for 32-Bit PCI Data Out (Big-Endian to 32 Bits PCI Little
Endian with Swap).................................................................................................................... 356
145 Byte Enable Alignment for 32-Bit PCI Data Out (Big-Endian to 32 Bits PCI Big
Endian without Swap) ............................................................................................................... 357
146 PCI I/O Cycles with Data Swap Enable .................................................................................... 358
147 Clock Usage Summary............................................................................................................. 360
148 Clock Rates Examples ............................................................................................................. 362
149 IXP2800 Network Processor Strap Pins................................................................................... 369
150 Supported Strap Combinations ................................................................................................ 370
151 APB Usage ............................................................................................................................... 381
152 Hardware Blocks and Their Performance Measurement Events.............................................. 383
153 PMU Design Unit Selection ...................................................................................................... 386
154 Chap Counter Threshold Events (Design Block # 0001) .......................................................... 388
155 XPI PMU Event List .................................................................................................................. 389
156 SHaC PMU Event List .............................................................................................................. 393
157 IXP2800 Network Processor MSF PMU Event List .................................................................. 396
158 Intel XScale® Core Gasket PMU Event List............................................................................. 402
159 PCI PMU Event List .................................................................................................................. 405
160 ME00 PMU Event List .............................................................................................................. 409
161 ME01 PMU Event List .............................................................................................................. 410
162 ME02 PMU Event List .............................................................................................................. 411
163 ME03 PMU Event List .............................................................................................................. 411
164 ME04 PMU Event List .............................................................................................................. 412
165 ME05 PMU Event List .............................................................................................................. 412
166 ME06 PMU Event List .............................................................................................................. 413
167 ME07 PMU Event List .............................................................................................................. 413
168 ME10 PMU Event List .............................................................................................................. 414
169 ME11 PMU Event List .............................................................................................................. 414
170 ME12 PMU Event List .............................................................................................................. 415
171 ME13 PMU Event List .............................................................................................................. 415
172 ME14 PMU Event List .............................................................................................................. 416
173 ME15 PMU Event List .............................................................................................................. 416
174 ME16 PMU Event List .............................................................................................................. 417
175 ME17 PMU Event List .............................................................................................................. 417
176 SRAM DP1 PMU Event List ..................................................................................................... 418
177 SRAM DP0 PMU Event List ..................................................................................................... 418
178 SRAM CH3 PMU Event List ..................................................................................................... 420
179 SRAM CH3 PMU Event List ..................................................................................................... 421
180 SRAM CH3 PMU Event List ..................................................................................................... 421

Hardware Reference Manual

Contents

181
182
183
184
185
186

SRAM CH0 PMU Event List .....................................................................................................422
IXP2800 Network Processor Dram DPLA PMU Event List.......................................................423
IXP2800 Network Processor Dram DPSA PMU Event List ......................................................424
IXP2800 Network Processor Dram CH2 PMU Event List.........................................................425
IXP2800 Network Processor Dram CH1 PMU Event List.........................................................429
IXP2800 Network Processor Dram CH0 PMU Event List.........................................................429

Hardware Reference Manual

Contents

Hardware Reference Manual

Intel® IXP2800 Network Processor
Introduction

Introduction
1.1

About This Document
This document is the hardware reference manual for the Intel® IXP2800 Network Processor.
This information is intended for use by developers and is organized as follows:
Section 2, “Technical Description” contains a hardware overview.
Section 3, “Intel XScale® Core” describes the embedded core.
Section 4, “Microengines” describes Microengine operation.
Section 5, “DRAM” describes the DRAM Unit.
Section 6, “SRAM Interface” describes the SRAM Unit.
Section 7, “SHaC — Unit Expansion” describes the Scratchpad, Hash Unit, and CSRs (SHaC).
Section 8, “Media and Switch Fabric Interface” describes the Media and Switch Fabric (MSF)
Interface used to connect the network processor to a physical layer device.
Section 9, “PCI Unit” describes the PCI Unit.
Section 10, “Clocks and Reset” describes the clocks, reset and initialization sequence.
Section 11, “Performance Monitor Unit” describes the PMU.

1.2

Related Documentation
Further information on the IXP2800 is available in the following documents:
IXP2800 Network Processor Datasheet – Contains summary information on the IXP2800 Network
Processor including a functional description, signal descriptions, electrical specifications, and
mechanical specifications.
IXP2400 and IXP2800 Network Processor Programmer’s Reference Manual – Contains detailed
programming information for designers.
IXP2400/IXP2800 Network Processor Development Tools User’s Guide – Describes the Developer
Workbench and the development tools you can access through the use of the Workbench GUI.

Hardware Reference Manual

Intel® IXP2800 Network Processor

Introduction

1.3

Terminology
Table 1 and Table 2 list the terminology used in this manual.

Table 1.

Data Terminology
Term

Table 2.

Words

Bytes

Bits

Byte

Word

Longword

Quadword

Longword Formats
Endian Type

32-Bit

64-Bit

Little-Endian

(0x12345678) arranged as {12 34 56 78}

64-bit data 0x12345678 9ABCDE56
arranged as {12 34 56 78 9A BC DE 56}

Big-Endian

(0x12345678) arranged as {78 56 34 12}

64-bit data 0x12345678 9ABCDE56
arranged as {78 56 34 12, 56 DE BC 9A}

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Technical Description
2.1

Overview
This section provides a brief overview of the IXP2800 Network Processor internal hardware.
This section is intended as an overall hardware introduction to the network processor.
The major blocks are:

• Intel XScale®core — General purpose 32-bit RISC processor (ARM* Version 5 Architecture
compliant) used to initialize and manage the network processor, and can be used for higher
layer network processing tasks.

• Intel XScale® technology Peripherals (XPI) — Interrupt Controller, Timers, UART, General

Purpose I/O (GPIO) and interface to low-speed off chip peripherals (such as maintenance port
of network devices) and Flash ROM.

• Microengines (MEs) — Sixteen 32-bit programmable engines specialized for Network
Processing. Microengines do the main data plane processing per packet.

• DRAM Controllers — Three independent controllers for Rambus* DRAM. Typically DRAM
is used for data buffer storage.

• SRAM Controllers — Four independent controllers for QDR SRAM. Typically SRAM is used
for control information storage.

• Scratchpad Memory — 16 Kbytes storage for general purpose use.
• Hash Unit — Polynomial hash accelerator. The Intel XScale® core and Microengines can use
it to offload hash calculations.

• Control and Status Register Access Proxy (CAP) — These provide special inter-processor

communication features to allow flexible and efficient inter-Microengine and Microengine to
Intel XScale® core communication.

• Media and Switch Fabric Interface (MSF) — Interface for network framers and/or Switch
Fabric. Contains receive and transmit buffers.

• PCI Controller — PCI Local Bus Specification, Version 2.2* interface for 64-bit 66-MHz I/O. PCI can
be used to either connect to a Host processor, or to attach PCI-compliant peripheral devices.

• Performance Monitor — Counters that can be programmed to count selected internal chip
hardware events, which can be used to analyze and tune performance.

Figure 1 is a simple block diagram of the network processor showing the major internal hardware
blocks. Figure 2 is a detailed diagram of the network processor units and buses.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 1. IXP2800 Network Processor Functional Block Diagram

Media Switch
Fabric (MSF)

Hash
Unit

Scratched
Memory

PCI
Controller

SRAM
Controller
0

CAP

SRAM
Controller
1

SRAM
Controller
2

SRAM
Controller
3

DRAM
Controller
0

ME
0x1

ME
0x0

ME
0x10

ME
0x11

ME
0x2

ME
0x3

ME
0x13

ME
0x12

ME
0x5

ME
0x4

ME
0x14

ME
0x15

ME
0x6

ME
0x7

ME
0x17

ME
0x16

ME Cluster 0

DRAM
Controller
1

DRAM
Controller
2

Intel
XScale®
Core
Peripherals
(XPI)

Intel
XScale®
Core

Performance
Monitor

ME Cluster 1
A9226-02

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 2. IXP2800 Network Processor Detailed Diagram

SRAM

SRAM
SRAM
Controller Controller

DRAM

DRAM
DRAM
Controller Controller

DRAM
Controller

D_Push_Bus
Arbiter

SRAM
Controller

DRAM

SHaC Unit
Scratch
Hash
CAP

D_Pull_Bus
Arbiter

SRAM

SP14/CSIX
Device

D_Push
D_Pull

Media
Controller

Cmd_1
Cmd_0

RBuf
TBuf
CSR

S_Pull
Arb 1

S_Push
Arb 1

S_Pull
Arb 0
S_Push
Arb 0

S_Pull_0
S_Push_0

S_Pull_1

PCI
Device

S_Push_1

mast/targ
CSRs
PCI
space
DMA
transfers
PCI
Controller

S in S out D in D out Cmd Cmd S in S out D in D out
xfer xfer xfer xfer FIFO FIFO xfer xfer xfer xfer
CSRs
CSRs

ME0x10-0x17
Cluster 1

ME0x0-0x7
Cluster 0

Gasket
Intel
XScale®
Core

Cmd_Arb_1

Cmd_Arb_0

(grant/request)

Command Bus
Arbiter 1

Command Bus
Arbiter 0

Notes:
Connected to the S_Push/Pull Buses
Connected to the S_Push/Pull Buses and D_Push/Pull Buses
= Chassis Components
A9750-03

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.2

Intel XScale® Core Microarchitecture
The Intel XScale® microarchitecture consists of a 32-bit general purpose RISC processor that
incorporates an extensive list of architecture features that allows it to achieve high performance.

2.2.1

ARM* Compatibility
The Intel XScale® microarchitecture is ARM* Version 5 (V5) Architecture compliant. It
implements the integer instruction set of ARM* V5, but does not provide hardware support of the
floating point instructions.
The Intel XScale® microarchitecture provides the Thumb instruction set (ARM V5T) and the
ARM V5E DSP extensions.
Backward compatibility with the first generation of StrongARM* products is maintained for usermode applications. Operating systems may require modifications to match the specific hardware
features of the Intel XScale® microarchitecture and to take advantage of the performance
enhancements added to the Intel XScale® core.

2.2.2

Features

2.2.2.1

Multiply/Accumulate (MAC)
The MAC unit supports early termination of multiplies/accumulates in two cycles and can sustain a
throughput of a MAC operation every cycle. Several architectural enhancements were made to the
MAC to support audio coding algorithms, which include a 40-bit accumulator and support for
16-bit packed values.

2.2.2.2

Memory Management
The Intel XScale® microarchitecture implements the Memory Management Unit (MMU)
Architecture specified in the ARM Architecture Reference Manual. The MMU provides access
protection and virtual to physical address translation.
The MMU Architecture also specifies the caching policies for the instruction cache and data
memory. These policies are specified as page attributes and include:

•
•
•
•
•
2.2.2.3

identifying code as cacheable or non-cacheable
selecting between the mini-data cache or data cache
write-back or write-through data caching
enabling data write allocation policy
and enabling the write buffer to coalesce stores to external memory

Instruction Cache
The Intel XScale® microarchitecture implements a 32-Kbyte, 32-way set associative instruction
cache with a line size of 32 bytes. All requests that “miss” the instruction cache generate a 32-byte
read request to external memory. A mechanism to lock critical code within the cache is also
provided.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.2.2.4

Branch Target Buffer
The Intel XScale® microarchitecture provides a Branch Target Buffer (BTB) to predict the
outcome of branch type instructions. It provides storage for the target address of branch type
instructions and predicts the next address to present to the instruction cache when the current
instruction address is that of a branch.
The BTB holds 128 entries.

2.2.2.5

Data Cache
The Intel XScale® microarchitecture implements a 32-Kbyte, 32-way set associative data cache
and a 2-Kbyte, 2-way set associative mini-data cache. Each cache has a line size of 32 bytes, and
supports write-through or write-back caching.
The data/mini-data cache is controlled by page attributes defined in the MMU Architecture and by
coprocessor 15.
The Intel XScale® microarchitecture allows applications to reconfigure a portion of the data cache
as data RAM. Software may place special tables or frequently used variables in this RAM.

2.2.2.6

Interrupt Controller
The Intel XScale® microarchitecture provides two levels of interrupt, IRQ and FIQ. They can be
masked via coprocessor 13. Note that there is also a memory-mapped interrupt controller described
with the Intel XScale® technology peripherals (see Section 3.12), which is used to mask and steer
many chip-wide interrupt sources.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.2.2.7

Address Map
Figure 3 shows the partitioning of the Intel XScale® core microarchitecture 4-Gbyte address space.

Figure 3. Intel XScale® Core 4-GB (32-Bit) Address Space

0XFFFF FFF

PCI MEM
(1/2 Gb)

3.5 Gb

0XE000 0000
0XDFFF FFF

PCI Local CSRs
PCI Config Regs
PCI Spec/IACK
PCI CFG (32 Mb)
PCI I/O (32 Mb)
Intel XScale® Core CSR

Other
(1/2 Gb)

0XDF00 0000
0XDE00 0000
0XDC00 0000
0XDA00 0000
0XD800 0000
0XD600 0000

RESERVED
(32 Mb x 2)

0XC000 0000
0XBFFF FFF

DRAM CSR (32 Mb)
SRAM Ring (32 Mb)
SRAM CSR & Queue
Scratch (32 Mb)
MSF (32 Mb)

SRAM
(1 Gb)

FLASH ROM
(64 Mb)
3.0 Gb

RESERVED
CAP-CSRs (32 Mb)

0XD000 0000
0XCE00 0000
0XCC00 0000
0XCA00 0000
0XC800 0000
0XC400 0000
0XC200 0000
0XC000 0000

0x8000 0000
0X7FFF FFFF

DRAM
and
Intel
XScale®
Core
FLASH ROM
(2 Gb)

0X0000 0000
A9693-02

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3

Microengines
The Microengines do most of the programmable pre-packet processing in the IXP2800 Network
Processor. There are 16 Microengines, connected as shown in Figure 1. The Microengines have
access to all shared resources (SRAM, DRAM, MSF, etc.) as well as private connections between
adjacent Microengines (referred to as “next neighbors”).
The block diagram in Figure 4 is used in the Microengine description. Note that this block diagram
is simplified for clarity; some blocks and connectivity have been omitted to make the diagram
more readable. Also, this block diagram does not show any pipeline stages, rather it shows the
logical flow of information.
Microengines provide support for software-controlled multi-threaded operation. Given the
disparity in processor cycle times versus external memory times, a single thread of execution often
blocks, waiting for external memory operations to complete. Multiple threads allow for threadinterleave operation, as there is often at least one thread ready to run while others are blocked.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 4. Microengine Block Diagram

NNData_In
(from previous ME)

640
Local
Mem

d
e
c
o
d
e

S_Push
(from SRAM
Scratchpad,
MSF, Hash,
PCI, CAP)

D_Push
(from DRAM)

128
GPRs
(A Bank)

128
GPRs
(B Bank)

128
Next
Neighbor

128
D
XFER
In

128
S
XFER
In

Control
Store

Lm_addr_1
Lm_addr_0
A_Src
B_Src
T_Index
NN_Get
CRC_Remainder

Immed

CRC Unit
A_Operand

B_Operand

Execution
Datapath
(Shift, Add, Subtract, Multiply Logicals,
Find First Bit, CAM)
ALU_Out
Dest

S_Push

NN_Data_Out
(to next ME)
128
D
XFER
Out

Local
CSRs

CMD
FIFO
(4)

128
S
XFER
Out

Command

D_Pull

Control
Data

S_Pull
B1670-01

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3.1

Microengine Bus Arrangement
The IXP2800 Network Processor supports a single D_Push/D_Pull bus, and both Microengine
clusters interface to the same bus. Also, it supports two command buses, and two sets of
S_Push/S_Pull buses connected as shown in Table 3, which also shows the next neighbor
relationship between the Microengine.

Table 3.

IXP2800 Network Processor Microengine Bus Arrangement
Microengine
Cluster

Microengine
Number

Next
Neighbor

Previous
Neighbor

0x00

0x01

0x02

0x00

0x02

0x03

0x01

0x03

0x04

0x02

0x04

0x05

0x03

0x05

0x06

0x04

0x06

0x07

0x05

0x07

0x10

0x06

0x10

0x11

0x07

0x11

0x12

0x10

0x12

0x13

0x11

0x13

0x14

0x12

0x14

0x15

0x13

0x15

0x16

0x14

0x16

0x17

0x15

0x17

0x16

2.3.2

Command
Bus

S_Push and
S_Pull Bus

Control Store
The Control Store is a RAM that holds the program that is executed by the Microengine. It holds
8192 instructions, each of which is 40 bits wide. It is initialized by the Intel XScale® core, which
writes to USTORE_ADDR and USTORE_DATA Local CSRs.
The Control Store is protected by parity against soft errors. Parity checking is enabled by
CTX_ENABLE[CONTROL STORE PARITY ENABLE]. A parity error on an instruction read
will halt the Microengine and assert an interrupt to the Intel XScale® core.

2.3.3

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Each of the eight Contexts is in one of four states.
1. Inactive — Some applications may not require all eight contexts. A Context is in the Inactive
state when its CTX_ENABLE CSR enable bit is a 0.
2. Executing — A Context is in Executing state when its context number is in
ACTIVE_CTX_STS CSR. The executing Context’s PC is used to fetch instructions from the
Control Store. A Context will stay in this state until it executes an instruction that causes it to
go to Sleep state (there is no hardware interrupt or preemption; Context swapping is
completely under software control). At most one Context can be in Executing state at any time.
3. Ready — In this state, a Context is ready to execute, but is not because a different Context is
executing. When the Executing Context goes to the Sleep state, the Microengine’s context
arbiter selects the next Context to go to the Executing state from among all the Contexts in the
Ready state. The arbitration is round robin.
4. Sleep — Context is waiting for external event(s) specified in the
INDIRECT_WAKEUP_EVENTS CSR to occur (typically, but not limited to, an I/O access).
In this state the Context does not arbitrate to enter the Executing state.
The state diagram in Figure 5 illustrates the Context state transitions. Each of the eight Contexts
will be in one of these states. At most one Context can be in Executing state at a time; any number
of Contexts can be in any of the other states.
Figure 5. Context State Transition Diagram

CTX_ENABLE bit is set by
Intel XScale® Core
Reset

Inactive

CTX_ENABLE
bit is cleared

Sleep

CTX_ENABLE bit is cleared

l
rn a

en
Ev

t Si

Ready

s
rrive
al a

Context executes
CTX Arbitration instruction

Executing Context goes
to Sleep state, and this
Context is the highest
round-robin priority.

Executing

Note:
After reset, the Intel XScale® Core processor must load the starting address of the CTX_PC, load the
CTX_WAKEUP_EVENTS to 0x1 (voluntary), and then set the appropriate CTX_ENABLE bits to begin
executing Context(s).

A9352-03

The Microengine is in Idle state whenever no Context is running (all Contexts are in either Inactive
or Sleep states). This state is entered:
1. After reset (CTX_ENABLE Local CSR is clear, putting all Contexts into Inactive states).
2. When a context swap is executed, but no context is ready to wake up.
3. When a ctx_arb[bpt] instruction is executed by the Microengine (this is a special case of
condition 2 above, since the ctx_arb[bpt] clears CTX_ENABLE, putting all Contexts into
Inactive states).

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

The Microengine provides the following functionality during the Idle state:
1. The Microengine continuously checks if a Context is in Ready state. If so, a new Context
begins to execute. If no Context is Ready, the Microengine remains in the Idle state.
2. Only the ALU instructions are supported. They are used for debug via special hardware
defined in number 3 below.
3. A write to the USTORE_ADDR Local CSR with the USTORE_ADDR[ECS] bit set, causing
the Microengine to repeatedly execute the instruction pointed by the address specified in the
USTORE_ADDR CSR. Only the ALU instructions are supported in this mode. Also, the result
of the execution is written to the ALU_OUT Local CSR rather than a destination register.
4. A write to the USTORE_ADDR Local CSR with the USTORE_ADDR[ECS] bit set, followed
by a write to the USTORE_DATA Local CSR loads an instruction into the Control Store. After
the Control Store is loaded, execution proceeds as described in number 3 above.

2.3.4

Datapath Registers
As shown in the block diagram in Figure 4, each Microengine contains four types of 32-bit
datapath registers:
1. 256 General Purpose registers
2. 512 Transfer registers
3. 128 Next Neighbor registers
4. 640 32-bit words of Local Memory

2.3.4.1

General-Purpose Registers (GPRs)
GPRs are used for general programming purposes. They are read and written exclusively under
program control. GPRs, when used as a source in an instruction, supply operands to the execution
datapath. When used as a destination in an instruction, they are written with the result of the
execution datapath. The specific GPRs selected are encoded in the instruction.
The GPRs are physically and logically contained in two banks, GPR A, and GPR B, defined in
Table 5.

2.3.4.2

Transfer Registers
Transfer (abbreviated as Xfer) registers are used for transferring data to and from the Microengine
and locations external to the Microengine, (for example DRAMs, SRAMs etc.). There are four
types of transfer registers.

•
•
•
•

S_TRANSFER_IN
S_TRANSFER_OUT
D_TRANSFER_IN
D_TRANSFER_OUT

TRANSFER_IN registers, when used as a source in an instruction, supply operands to the
execution datapath. The specific register selected is either encoded in the instruction, or selected
indirectly via T_INDEX. TRANSFER_IN registers are written by external units (A typical case is
when the external unit returns data in response to read instructions. However, there are other

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

methods to write TRANSFER_IN registers, for example a read instruction executed by one
Microengine may cause the data to be returned to a different Microengine. Details are covered in
the instruction set descriptions).
TRANSFER_OUT registers, when used as a destination in an instruction, are written with the
result from the execution datapath. The specific register selected is encoded in the instruction, or
selected indirectly via T_INDEX. TRANSFER_OUT registers supply data to external units
(for example, write data for an SRAM write).
The S_TRANSFER_IN and S_TRANSFER_OUT registers connect to the S_PUSH and S_PULL
buses, respectively.
The D_TRANSFER_IN and D_TRANSFER_OUT Transfer registers connect to the D_PUSH and
D_PULL buses, respectively.
Typically, the external units access the Transfer registers in response to instructions executed by the
Microengines. However, it is possible for an external unit to access a given Microengine’s Transfer
registers either autonomously, or under control of a different Microengine, or the Intel XScale®
core, etc. The Microengine interface signals controlling writing/reading of the TRANSFER_IN
and TRANSFER_OUT registers are independent of the operation of the rest of the Microengine,
therefore the data movement does not stall or impact other instruction processing
(it is the responsibility of software to synchronize usage of read data).

2.3.4.3

Next Neighbor Registers
Next Neighbor registers, when used as a source in an instruction, supply operands to the execution
datapath. They are written in two different ways:
1. By an adjacent Microengine (the “Previous Neighbor”).
2. By the same Microengine they are in, as controlled by CTX_ENABLE[NN_MODE].
The specific register is selected in one of two ways:
1. Context-relative, the register number is encoded in the instruction.
2. As a Ring, selected via NN_GET and NN_PUT CSR registers.
The usage is configured in CTX_ENABLE[NN_MODE].

• When CTX_ENABLE[NN_MODE] is ‘0’ — when Next Neighbor is a destination in an
instruction, the result is sent out of the Microengine, to the Next Neighbor Microengine.

• When CTX_ENABLE[NN_MODE] is ‘1’ — when Next Neighbor is used as a destination in

an instruction, the instruction result data is written to the selected Next Neighbor register in the
same Microengine. Note that there is a 5-instruction latency until the newly written data may
be read. The data is not sent out of the Microengine as it would be when
CTX_ENABLE[NN_MODE] is ‘0’.

Table 4.

Next Neighbor Write as a Function of CTX_ENABLE[NN_MODE]
Where the Write Goes
NN_MODE

External?

NN Register in this
Microengine?

Yes

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3.4.4

Local Memory
Local Memory is addressable storage within the Microengine. Local Memory is read and written
exclusively under program control. Local Memory supplies operands to the execution datapath as a
source, and receives results as a destination. The specific Local Memory location selected is based
on the value in one of the LM_ADDR registers, which are written by local_csr_wr instructions.
There are two LM_ADDR registers per Context and a working copy of each. When a Context goes
to the Sleep state, the value of the working copies is put into the Context’s copy of LM_ADDR.
When the Context goes to the Executing state, the value in its copy of LM_ADDR are put into the
working copies. The choice of LM_ADDR_0 or LM_ADDR_1 is selected in the instruction.
It is also possible to make use of both or one LM_ADDRs as global by setting
CTX_ENABLE[LM_ADDR_0_GLOBAL] and/or CTX_ENABLE[LM_ADDR_1_GLOBAL].
When used globally, all Contexts use the working copy of LM_ADDR in place of their own
Context specific one; the Context specific ones are unused. There is a three-instruction latency
when writing a new value to the LM_ADDR, as shown in Example 1.

Example 1. Three-Cycle Latency when Writing a New Value to LM_ADDR
;some instruction to compute the address into gpr_m
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_m]; put gpr_m into lm_addr
;unrelated instruction 1
;unrelated instruction 2
;unrelated instruction 3
alu[dest_reg, *l$index0, op, src_reg]
;dest_reg can be used as a source in next instruction

LM_ADDR can also be incremented or decremented in parallel with use as a source and/or
destination (using the notation *l$index#++ and *l$index#--), as shown in Example 2, where three
consecutive Local Memory locations are used in three consecutive instructions.
Example 2. Using LM_ADDR in Consecutive Instructions
alu[dest_reg1, src_reg1, op, *l$index0++]
alu[dest_reg2, src_reg2, op, *l$index0++]
alu[dest_reg3, src_reg3, op, *l$index0++]

Local Memory is written by selecting it as a destination. Example 3 shows copying a section of
Local Memory to another section. Each instruction accesses the next sequential Local Memory
location from the previous instruction.
Example 3. Copying One Section of Local Memory to Another Section
alu[*l$index1++, --, B, *l$index0++]
alu[*l$index1++, --, B, *l$index0++]
alu[*l$index1++, --, B, *l$index0++]

Example 4 shows loading and using both Local Memory addresses.
Example 4. Loading and Using Both Local Memory Addresses
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_m]
local_csr_wr[INDIRECT_LM_ADDR_1, gpr_n]
;unrelated instruction 1
;unrelated instruction 2
alu[dest_reg1, *l$index0, op, src_reg1]
alu[dest_reg2, *l$index1, op, src_reg2]

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

As shown in Example 1, there is a latency in loading LM_ADDR. Until the new value is loaded,
the old value is still usable. Example 5 shows the maximum pipelined usage of LM_ADDR.
Example 5. Maximum Pipelined Usage of LM_ADDR
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_m]
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_n]
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_o]
local_csr_wr[INDIRECT_LM_ADDR_0, gpr_p]
alu[dest_reg1, *l$index0, op, src_reg1]
alu[dest_reg2, *l$index0, op, src_reg2]
alu[dest_reg3, *l$index0, op, src_reg3]
alu[dest_reg4, *l$index0, op, src_reg4]

;
;
;
;

uses
uses
uses
uses

address
address
address
address

from
from
from
from

gpr_m
gpr_n
gpr_o
gpr_p

LM_ADDR can also be used as the base of a 16 32-bit word region of memory, with the instruction
specifying the offset from that base, as shown in Example 6. The source and destination can use
different offsets.
Example 6. LM_ADDR Used as Base of a 16 32-Bit Word Region of Local Memory
alu[*l$index0[3], *l$index0[4], +, 1]

Note:

Local Memory has 640 32-bit words. The local memory pointers (LM_ADDR) have an addressing
range of up to 1K longwords. However, only 640 longwords are currently populated with RAM.
Therefore:
0 – 639 (0x0 – 0x27F) are addressable as local memory.
640 – 1023 (0x280 – 0x3FF) are addressable, but not populated with RAM.
To the programmer, all instructions using Local Memory act as follows, including
read/modify/write instructions like immed_w0, ld_field, etc.
1. Read LM_ADDR location (if LM_ADDR is specified as source).
2. Execute logic function.
3. Write LM_ADDR location (if LM_ADDR is specified as destination).
4. If specified, increment or decrement LM_ADDR.
5. Proceed to next instruction.
Example 7 is legal because lm_addr_0[2] does not post-modify LM_ADDR.

Example 7. LM_ADDR Use as Source and Destination
alu[*l$index0[2], --, ~B, *l$index0]

In Example 7, the programmer sees:
1. Read Local Memory memory location pointed to by LM_ADDR.
2. Invert the data.
3. Write the data into the address pointed to by LM_ADDR with the value of 2 that is OR’ed into
the lower bits.
4. Increment LM_ADDR.
5. Proceed to next instruction.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

In Example 8, the second instruction will access the Local Memory location one past the source/
destination of the first.
Example 8. LM_ADDR Post-Increment
alu[*l$index0++, --, ~B, gpr_n]
alu[gpr_m, --, ~B, *l$index0]

2.3.5

Addressing Modes
GPRs can be accessed in either a context-relative or an absolute addressing mode. Some
instructions can specify either mode; other instructions can specify only Context-Relative mode.
Transfer and Next Neighbor registers can be accessed in Context-Relative and Indexed modes, and
Local Memory is accessed in Indexed mode. The addressing mode in use is encoded directly into
each instruction, for each source and destination specifier.

2.3.5.1

Context-Relative Addressing Mode
The GPRs are logically subdivided into equal regions such that each Context has relative access to
one of the regions. The number of regions is configured in the CTX_ENABLE CSR, and can be
either 4 or 8. Thus a Context-Relative register number is actually associated with multiple different
physical registers. The actual register to be accessed is determined by the Context making the
access request (the Context number is concatenated with the register number specified in the
instruction). Context-Relative addressing is a powerful feature that enables eight (or four) different
contexts to share the same code image, yet maintain separate data.
Table 5 shows how the Context number is used in selecting the register number in relative mode.
The register number in Table 5 is the Absolute GPR address, or Transfer or Next Neighbor Index
number to use to access the specific Context-Relative register. For example, with eight active
Contexts, Context-Relative Register 0 for Context 2 is Absolute Register Number 32.

Table 5.

Registers Used By Contexts in Context-Relative Addressing Mode
Number of
Active
Contexts

Active
Context
Number

GPR
Absolute Register Numbers

S_Transfer or
Neighbor
Index Number

D_Transfer
Index Number

0 – 15

A Port

B Port

0 – 15

16 – 31

32 – 47

(Instruction
always specifies
registers in
range 0 – 15)

48 – 63

64 – 79

80 – 95

96 – 111

112 – 127

0 – 31

(Instruction
always specifies
registers in
range 0 – 31)

32 – 63

64 – 95

96 – 127

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3.5.2

Absolute Addressing Mode
With Absolute addressing, any GPR can be read or written by any of the eight Contexts in a
Microengine. Absolute addressing enables register data to be shared among all of the Contexts,
e.g., for global variables or for parameter passing. All 256 GPRs can be read by Absolute address.

2.3.5.3

Indexed Addressing Mode
With Indexed addressing, any Transfer or Next Neighbor register can be read or written by any one
of the eight Contexts in a Microengine. Indexed addressing enables register data to be shared
among all of the Contexts. For indexed addressing the register number comes from the T_INDEX
register for Transfer registers or NN_PUT and NN_GET registers (for Next Neighbor registers).
Example 9 shows the Index Mode usage. Assume that the numbered bytes have been moved into
the S_TRANSFER_IN registers as shown.

Example 9. Use of Indexed Addressing Mode
Data

Transfer
Register
Number

31:24

23:16

15:8

7:0

0x00

0x01

0x02

0x03

0x04

0x05

0x06

0x07

0x08

0x09

0x0a

0x0b

0x0c

0x0d

0x0e

0x0f

0x10

0x11

0x12

0x013

0x14

0x15

0x16

0x17

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

If the software wants to access a specific byte that is known at compile-time, it will normally use
context-relative addressing. For example to access the word in transfer register 3:
alu[dest, --, B, $xfer3] ; move the data from s_transfer 3 to gpr dest

If the location of the data is found at run-time, indexed mode can be used, e.g., if the start of an
encapsulated header depends on an outer header value (the outer header byte is in a fixed location).
; Check byte 2 of transfer 0
; If value==5 header starts on byte 0x9, else byte 0x14
br=byte[$0, 2, 0x5, L1#], defer_[1]
local_csr_wr[t_index_byte_index, 0x09]
local_csr_wr[t_index_byte_index, 0x14]
nop ; wait for index registers to be loaded
L1#:
; Move bytes right justified into destination registers
nop ; wait for index registers to be loaded
nop ;
byte_align_be[dest1, *$index++]
byte_align_be[dest2, *$index++] ;etc.
; The t_index and byte_index registers are loaded by the same instruction.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3.6

Local CSRs
Local Control and Status registers (CSRs) are external to the Execution Datapath, and hold specific
data. They can be read and written by special instructions (local_csr_rd and local_csr_wr) and are
accessed less frequently than datapath registers.
Because Local CSRs are not built in the datapath, there is a write-to-use delay of three instructions,
and a read-to-consume penalty of two instructions.

2.3.7

Execution Datapath
The Execution Datapath can take one or two operands, perform an operation, and optionally write
back a result. The sources and destinations can be GPRs, Transfer registers, Next Neighbor
registers, and Local Memory. The operations are shifts, add/subtract, logicals, multiply, byte align,
and find first one bit.

2.3.7.1

Byte Align
The datapath provides a mechanism to move data from source register(s) to any destination
register(s) with byte aligning. Byte aligning takes four consecutive bytes from two concatenated
values (8 bytes), starting at any of four byte boundaries (0, 1, 2, 3), and based on the endian-type
(which is defined in the instruction opcode), as shown in Example 5. The four bytes are taken from
two concatenated values. Four bytes are always supplied from a temporary register that always
holds the A or B operand from the previous cycle, and the other four bytes from the B or A operand
of the Byte Align instruction.
The operation is described below, using the block diagram in Figure 6. The alignment is controlled
by the two LSBs of the BYTE_INDEX Local CSR.

Table 6.

Align Value and Shift Amount
Align Value
(in Byte_Index[1:0])

Right Shift Amount (Number of Bits)
(Decimal)
Little-Endian

Big-Endian

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 6. Byte-Align Block Diagram

Prev_B

Prev_A

. . .

. . .
A_Operand

B_Operand

Shift

Byte_Index

Result
A9353-01

Example 10 shows a big-endian align sequence of instructions and the value of the various
operands. Table 7 shows the data in the registers for this example. The value in
BYTE_INDEX[1:0] CSR (which controls the shift amount) for this example is 2.
Table 7.

Byte 3
[31:24]

Byte 2
[23:16]

Byte 1
[15:8]

Byte 0
[7:0]

Example 10. Big-Endian Align
Instruction

Prev B

A Operand

B Operand

Result

0123

Byte_align_be[dest1, r1]

0123

4567

2345

Byte_align_be[dest2, r2]

4567

89AB

6789

Byte_align_be[dest3, r3]

89AB

CDEF

ABCD

Byte_align_be[--, r0]

NOTE: A Operand comes from Prev_B register during byte_align_be instructions.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Example 11 shows a little-endian sequence of instructions and the value of the various operands.
Table 8 shows the data in the registers for this example. The value in BYTE_INDEX[1:0] CSR
(which controls the shift amount) for this example is 2.
Table 8.

Byte 3
[31:24]

Byte 2
[23:16]

Byte 1
[15:8]

Byte 0
[7:0]

Example 11. Little-Endian Align
Instruction

A Operand

B Operand

Prev A

Result

Byte_align_le[--, r0]

3210

Byte_align_le[dest1, r1]

7654

3210

5432

Byte_align_le[dest2, r2]

BA98

7654

9876

Byte_align_le[dest3, r3]

FEDC

BA98

DCBA

NOTE: B Operand comes from Prev_A register during byte_align_le instructions.

As the examples show, byte aligning “n” words takes “n+1” cycles due to the first instruction
needed to start the operation.
Another mode of operation is to use the T_INDEX register with post-increment, to select the
source registers. T_INDEX operation is described later in this chapter.

2.3.7.2

CAM
The block diagram in Figure 7 is used to explain the CAM operation.
The CAM has 16 entries. Each entry stores a 32-bit value, which can be compared against a source
operand by instruction:
CAM_Lookup[dest_reg, source_reg]

All entries are compared in parallel, and the result of the lookup is a 9-bit value that is written into
the specified destination register in bits 11:3, with all other bits of the register 0 (the choice of bits
11:3 is explained below). The result can also optionally be written into either of the LM_Addr
registers (see below in this section for details).
The 9-bit result consists of four State bits (dest_reg[11:8]), concatenated with a 1-bit Hit/Miss
indication (dest_reg[7]), concatenated with 4-bit entry number (dest_reg[6:3]). All other bits of
dest_reg are written with 0. Possible results of the lookup are:

• miss (0) — lookup value is not in CAM, entry number is Least Recently Used entry (which
can be used as a suggested entry to replace), and State bits are 0000.

• hit (1) — lookup value is in CAM, entry number is entry that has matched; State bits are the
value from the entry that has matched.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 7. CAM Block Diagram

Lookup Value
(from A port)

Tag

State

Tag

State

Tag

State

Tag

State

Match
Match
Match

Status
and
LRU
Logic

Match

Lookup Status
(to Dest Req)

State

Status

Entry Number

0000

Miss 0

LRU Entry

State

Hit Entry
A9354-01

Note:

The State bits are data associated with the entry. The use is only by software. There is no
implication of ownership of the entry by any Context. The State bits hardware function is:

• the value is set by software (at the time the entry is loaded, or changed in an already loaded
entry).

• its value is read out on a lookup that hits, and used as part of the status written into the
destination register.

• its value can be read out separately (normally only used for diagnostic or debug).
The LRU (Least Recently Used) Logic maintains a time-ordered list of CAM entry usage. When an
entry is loaded, or matches on a lookup, it is marked as MRU (Most Recently Used). Note that a
lookup that misses does not modify the LRU list.
The CAM is loaded by instruction:
CAM_Write[entry_reg, source_reg, state_value]

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

The value in the State bits for an entry can be written, without modifying the Tag, by instruction:
CAM_Write_State[entry_reg, state_value]

Note:

CAM_Write_State

does not modify the LRU list.

One possible way to use the result of a lookup is to dispatch to the proper code using instruction:
jump[register, label#],defer [3]

where the register holds the result of the lookup. The State bits can be used to differentiate cases
where the data associated with the CAM entry is in flight, or is pending a change, etc. Because the
lookup result was loaded into bits[11:3] of the destination register, the jump destinations are spaced
eight instructions apart. This is a balance between giving enough space for many applications to
complete their task without having to jump to another region, versus consuming too much Control
Store. Another way to use the lookup result is to branch on just the hit miss bit, and use the entry
number as a base pointer into a block of Local Memory.
When enabled, the CAM lookup result is loaded into Local_Addr as follows:
LM_Addr[5:0] = 0 ([1:0] are read-only bits)
LM_Addr[9:6] = lookup result [6:3] (entry number)
LM_Addr[11:10] = constant specified in instruction
This function is useful when the CAM is used as a cache, and each entry is associated with a block
of data in Local Memory. Note that the latency from when CAM_Lookup executes until the
LM_Addr is loaded is the same as when LM_Addr is written by a Local_CSR_Wr instruction.
The Tag and State bits for a given entry can be read by instructions:
CAM_Read_Tag[dest_reg, entry_reg]
CAM_Read_State[dest_reg, entry_reg]

The Tag value and State bits value for the specified entry is written into the destination register,
respectively for the two instructions (the State bits are placed into bits [11:8] of dest_reg, with all
other bits 0). Reading the tag is useful in the case where an entry needs to be evicted to make room
for a new value—the lookup of the new value results in a miss, with the LRU entry number
returned as a result of the miss. The CAM_Read_Tag instruction can then be used to find the value
that was stored in that entry. An alternative would be to keep the tag value in a GPR. These two
instructions can also be used by debug and diagnostic software. Neither of these modify the state of
the LRU pointer.
Note:

The following rules must be adhered to when using the CAM.

• CAM is not reset by Microengine reset. Software must either do a CAM_clear prior to using
the CAM to initialize the LRU and clear the tags to 0, or explicitly write all entries with
CAM_write.

• No two tags can be written to have same value. If this rule is violated, the result of a lookup
that matches that value will be unpredictable, and LRU state is unpredictable.

The value 0x00000000 can be used as a valid lookup value. However, note that CAM_clear
instruction puts 0x00000000 into all tags. To avoid violating rule 2 after doing CAM_clear, it is
necessary to write all entries to unique values prior to doing a lookup of 0x00000000.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

An algorithm for debug software to find out the contents of the CAM is shown in Example 12.
Example 12. Algorithm for Debug Software to Find out the Contents of the CAM
; First read each of the tag entries. Note that these reads
; don’t modify the LRU list or any other CAM state.
tag[0] = CAM_Read_Tag(entry_0);
......
tag[15] = CAM_Read_Tag(entry_15);
; Now read each of the state bits
state[0] = CAM_Read_State(entry_0);
...
state[15] = CAM_Read_State(entry_15);
; Knowing what tags are in the CAM makes it possible to
; create a value that is not in any tag, and will therefore
; miss on a lookup.
; Next loop through a sequence of 16 lookups, each of which will
; miss, to obtain the LRU values of the CAM.
for (i = 0; i < 16; i++)
BEGIN_LOOP
; Do a lookup with a tag not present in the CAM. On a
; miss, the LRU entry will be returned. Since this lookup
; missed the LRU state is not modified.
LRU[i] = CAM_Lookup(some_tag_not_in_cam);
; Now do a lookup using the tag of the LRU entry. This
; lookup will hit, which makes that entry MRU.
; This is necessary to allow the next lookup miss to
; see the next LRU entry.
junk = CAM_Lookup(tag[LRU[i]]);
END_LOOP
; Because all entries were hit in the same order as they were
; LRU, the LRU list is now back to where it started before the
; loop executed.
; LRU[0] through LRU[15] holds the LRU list.

The CAM can be cleared with CAM_Clear instruction. This instruction writes 0x00000000
simultaneously to all entries tag, clears all the state bits, and puts the LRU into an initial state
(where entry 0 is LRU, ..., entry 15 is MRU).

2.3.8

CRC Unit
The CRC Unit operates in parallel with the Execution Datapath. It takes two operands, performs a
CRC operation, and writes back a result. CRC-CCITT, CRC-32, CRC-10, CRC-5, and iSCSI
polynomials are supported. One of the operands is the CRC_Remainder Local CSR, and the other
is a GPR, Transfer_In register, Next Neighbor, or Local Memory, specified in the instruction and
passed through the Execution Datapath to the CRC Unit.
The instruction specifies the CRC operation type, whether to swap bytes and or bits, and which
bytes of the operand to include in the operation. The result of the CRC operation is written back
into CRC_Remainder. The source operand can also be written into a destination register (however
the byte/bit swapping and masking do not affect the destination register; they only affect the CRC
computation). This allows moving data, for example, from S_TRANSFER_IN registers to
S_TRANSFER_OUT registers at the same time as computing the CRC.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.3.9

Event Signals
Event Signals are used to coordinate a program with completion of external events. For example,
when a Microengine executes an instruction to an external unit to read data (which will be written
into a Transfer_In register), the program must insure that it does not try to use the data until the
external unit has written it. This time is not deterministic due to queuing delays and other
uncertainty in the external units (for example, DRAM refresh). There is no hardware mechanism to
flag that a register write is pending, and then prevent the program from using it. Instead the
coordination is under software control, with hardware support.
In the instructions that use external units (i.e., SRAM, DRAM, etc.) there are fields that direct the
external unit to supply an indication (called an Event Signal) that the command has been
completed. There are 15 Event Signals per Context that can be used, and Local CSRs per Context
to track which Event Signals are pending and which have been returned. The Event Signals can be
used to move a Context from Sleep state to Ready state, or alternatively, the program can test and
branch on the status of Event Signals.
Event Signals can be set in nine different ways.
1. When data is written into S_TRANSFER_IN registers
2. When data is written into D_TRANSFER_IN registers
3. When data is taken from S_TRANSFER_OUT registers
4. When data is taken from D_TRANSFER_OUT registers
5. By a write to INTERTHREAD_SIGNAL register
6. By a write from Previous Neighbor Microengine to NEXT_NEIGHBOR_SIGNAL
7. By a write from Next Neighbor Microengine to PREVIOUS_NEIGHBOR_SIGNAL
8. By a write to SAME_ME_SIGNAL Local CSR
9. By Internal Timer
Any or all Event Signals can be set by any of the above sources.
When a Context goes to the Sleep state (executes a ctx_arb instruction, or an instruction with
ctx_swap token), it specifies which Event Signal(s) it requires to be put in Ready state.
The ctx_arb instruction also specifies if the logical AND or logical OR of the Event Signal(s) is
needed to put the Context into Ready state.
When all of the Context’s Event Signals arrive, the Context goes to Ready state, and then
eventually to Executing state. In the case where the Event Signal is linked to moving data into or
out of Transfer registers (numbers 1 through 4 in the list above), the code can safely use the
Transfer register as the first instruction (for example, using a Transfer_In register as a source
operand will get the new read data). The same is true when the Event Signal is tested for branches
(br_=signal or br_!signal instructions).
The ctx_arb instruction, CTX_SIG_EVENTS, and ACTIVE_CTX_WAKEUP_#_EVENTS
Local CSR descriptions provide details.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.4

DRAM
The IXP2800 Network Processor has controllers for three Rambus* DRAM (RDRAM) channels.
Each of the controllers independently accesses its own RDRAMs, and can operate concurrently
with the other controllers (i.e., they are not operating as a single, wider memory). DRAM provides
high-density, high-bandwidth storage and is typically used for data buffers.

• RDRAM sizes of 64, 128, 256, or 512 Mbytes, and 1 Gbyte are supported; however, each of
the channels must have the same number, size, and speed of RDRAMs populated. Refer to
Section 5.2 for supported size and loading configurations.

• Up to two Gbytes of DRAM is supported. If less than two Gbytes of memory is present, the

upper part of the address space is not used. It is also possible, for system cost and area savings,
to have Channels 0 and 1 populated with Channel 2 empty, or Channel 0 populated with
Channels 1and 2 empty.

• Reads and writes to RDRAM are generated by Microengines, The Intel XScale® core, and PCI
(external Bus Masters and DMA Channels). The controllers also do refresh and calibration
cycles to the RDRAMs, transparently to software.

• RDRAM Powerdown and Nap modes are not supported.
• Hardware interleaving (also known as striping) of addresses is done to provide balanced

access to all populated channels. The interleave size is 128 bytes. Interleaving helps to
maintain utilization of available bandwidth by spreading consecutive accesses to multiple
channels. The interleaving is done in the hardware in such a way that the three channels appear
to software as a single contiguous memory space.

• ECC (Error Correcting Code) is supported, but can be disabled. Enabling ECC requires that
x18 RDRAMs be used. If ECC is disabled x16 RDRAMs can be used. ECC can detect and
correct all single-bit errors, and detect all double-bit errors. When ECC is enabled, partial
writes (writes of less than 8 bytes) must be done as read-modify-writes.

2.4.1

Size Configuration
Each channel can be populated with anywhere from one-to-four RDRAMs (Short Channel Mode).
Refer to Section 5.2 for supported size and loading configurations. The RAM technology used will
determine the increment size and maximum memory per channel as shown in Table 9.

Table 9.

RDRAM Sizes
RDRAM Technology1

Increment Size

Maximum per Channel

64/72 MB

8 MB

256 MB

128/144 MB

16 MB

512 MB

256/288 MB

32 MB

1 GB2

512/576 MB

64 MB

2 GB2

NOTES:
1. The two numbers shown for each technology indicate x16 parts and x18 parts.
2. The maximum memory that can be addressed across all channels is 2 GB. This limitation is based on the
partitioning of the 4-GB address space (32-bit addresses). Therefore, if all three channels are used, each
can be populated up to a maximum of 768 MB. Two channels can be populated to a maximum of
1 GB each. A single channel can be populated to a maximum of 2 GB.

RDRAMs with 1 x 16 or 2 x 16 dependent banks, and 4 independent banks are supported.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.4.2

Read and Write Access
The minimum DRAM physical access length is 16 bytes. Software (and PCI) can read or write as
little as a single byte, however the time (and bandwidth) taken at the DRAMs is the same as for an
access of 16 bytes. Therefore, the best utilization of DRAM bandwidth will be for accesses that are
multiples of 16 bytes.
If ECC is enabled, writes of less than 8 bytes must do read-modify-writes, which take two 16-byte
time accesses (one for the read and one for the write).

2.5

SRAM
The IXP2800 Network Processor has four independent SRAM controllers, which each support
pipelined QDR synchronous static RAM (SRAM) and/or a coprocessor that adheres to QDR
signaling. Any or all controllers can be left unpopulated if the application does not need to use
them. SRAM are accessible by the Microengines, the Intel XScale® core, and the PCI Unit
(external bus masters and DMA).
The memory is logically four bytes (32-bits) wide; physically the data pins are two bytes wide and
are double clocked. Byte parity is supported. Each of the four bytes has a parity bit, which is
written when the byte is written and checked when the data is read. There are byte-enables that
select which bytes to write for writes of less than 32 bits.
Each of the 4 QDR ports are QDR and QDRII compatible. Each port implements the “_K” and
“_C” output clocks and “_CQ” as an input and their inversions. (Note: the “_C” and “_CQ” clocks
are optional). Extensive work has been performed providing impedance controls within the
IXP2800 Network Processor for processor-initiated signals driving to QDR parts. Providing a
clean signaling environment is critical to achieving 200 – 250 MHz QDRII data transfers.
The configuration assumptions for the IXP2800 Network Processor I/O driver/receiver
development includes four QDR loads and the IXP2800 Network Processor. The IXP2800
Network Processor supports bursts of two SRAMs, but does not support bursts of four SRAMs.
The SRAM controller can also be configured to interface to an external coprocessor that adheres to
the QDR electricals and protocol. Each SRAM controller may also interface to an external
coprocessor through its standard QDR interface. This interface enables the cohabitation of both
SRAM devices and coprocessors to operate on the same bus. The coprocessor behaves as a
memory-mapped device on the SRAM bus.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.5.1

QDR Clocking Scheme
The controller drives out two pairs of K clock (K and K#). It also drives out two pairs of C clock
(C and C#). Both C/C# clocks externally return to the controller for reading data. Figure 8 shows
the clock diagram if the clocking scheme for QDR interface driving four SRAM chips.

Figure 8. Echo Clock Configuration

Clam-shelled SRAMS
Termination

CQ/CQ#
Package Balls

QDRn_CIN[0]

K/K#
C/C#

Intel®
QDRn_K[0]
IXP2800
QDRn_C[0]
Network
Processor QDRn_C[1]
QDRn_K[1]

C/C#

*QDRn_CIN[1]

K/K#

Package Balls
Termination

CQ/CQ#

*The CIN[1] pin is not used internally to capture the READ data; however, the I/O Pad can be used
to terminate the signal.
B3664-01

2.5.2

SRAM Controller Configurations
Each channel has enough address pins (24) to support up to 64 Mbytes of SRAM. The SRAM
controllers can directly generate multiple port enables (up to four pairs) to allow for depth
expansion. Two pairs of pins are dedicated for port enables. Smaller RAMs use fewer address
signals than the number provided to accommodate the largest RAMs, so some address pins (23:20)
are configurable as either address or port enable based on CSR setting as shown in Table 10.
Note that all of the SRAMs on a given channel must be the same size.

Table 10. SRAM Controller Configurations (Sheet 1 of 2)

SRAM
Configuration

SRAM Size

Addresses Needed
to Index SRAM

Addresses Used
as Port Enables

Total Number of Port
Select Pairs Available

512K x 18

1 MB

17:0

23:22, 21:20

1M x 18

2 MB

18:0

23:22, 21:20

2M x 18

4 MB

19:0

23:22, 21:20

4M x 18

8 MB

20:0

23:22

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Table 10. SRAM Controller Configurations (Sheet 2 of 2)
SRAM
Configuration

SRAM Size

Addresses Needed
to Index SRAM

Addresses Used
as Port Enables

Total Number of Port
Select Pairs Available

8M x 18

16 MB

21:0

23:22

16M x 18

32 MB

22:0

None

32M x 18

64 MB

23:0

None

Each channel can be expanded by depth according to the number of port enables available. If
external decoding is used, then the number of SRAMs used is not limited by the number of port
enables generated by the SRAM controller.
Note:

Doing external decoding may require external pipeline registers to account for the decode time,
depending on the desired frequency.
Maximum SRAM system sizes are shown in Table 11. Shaded entries require external decoding,
because they use more port enables than the SRAM controller can supply directly.

Table 11. Total Memory per Channel
Number of SRAMs on Channel
SRAM Size
512K x 18

2.5.3

1 MB

2 MB

3 MB

4 MB

5 MB

6 MB

7 MB

8 MB

1M x 18

2 MB

4 MB

6 MB

8 MB

10 MB

12 MB

14 MB

16 MB

2M x 18

4 MB

8 MB

12 MB

16 MB

20 MB

24 MB

28 MB

32 MB

4M x 18

8 MB

16 MB

24 MB

32 MB

64 MB

8M x 18

16 MB

32 MB

48 MB

64 MB

16M x 18

32 MB

64 MB

32M x 18

64 MB

SRAM Atomic Operations
In addition to normal reads and writes, SRAM supports the following atomic operations.
Microengines have specific instructions to do each atomic operation; Intel XScale®
microarchitecture uses aliased address regions to do atomic operations.

•
•
•
•
•
•

bit set
bit clear
increment
decrement
add
swap

The SRAM does read-modify-writes for the atomic operations, the pre-modified data can also be
returned if desired. The atomic operations operate on a single 32-bit word.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.5.4

Queue Data Structure Commands
The ability to enqueue and dequeue data buffers at a fast rate is key to meeting line-rate
performance. This is a difficult problem as it involves dependent memory references that must be
turned around very quickly. The SRAM controller includes a data structure (called the Q_array)
and associated control logic to perform efficient enqueue and dequeue operations. The Q_array has
64 entries, each of which can be used in one of four ways.

•
•
•
•

Linked-list queue descriptor (resident queues)
Cache of recently used linked-list queue descriptors (backing store for the cache is in SRAM)
Ring descriptor
Journal

The commands provided are:
For Linked-list queues or Cache of recently used linked-list queue descriptors

•
•
•
•
•
•
•
•

Read_Q_Descriptor_Head(address, length, entry, xfer_addr)
Read_Q_Descriptor_Tail(address, length, entry)
Read_Q_Descriptor_Other(address, entry)
Write_Q_Descriptor(address, entry)
Write_Q_Descriptor_Count(address, entry)
ENQ(buff_desc_adr, cell_count, EOP, entry)
ENQ_tail(buff_desc_adr, entry)
DEQ(entry, xfer_addr)

For Rings

•
•

Get(entry, length, xfer_addr)
Put(entry, length, xfer_addr)

For Journals

•
•
Note:

2.5.5

Journal(entry, length, xfer_addr)
Fast_journal(entry)

The Read_Q_Descriptor_Head, Read_Q_Descriptor_Tail, etc.) are used to initialize the rings and
journals but not used to perform the ring and journal function.

Reference Ordering
This section covers the ordering between accesses to any one SRAM controller.

2.5.5.1

Reference Order Tables
Table 12 shows the architectural guarantees of order to access to the SAME SRAM address
between a reference of any given type (shown in the column labels) and a subsequent reference of
any given type (shown in the row labels). The definition of first and second is defined by the order
they are received by the SRAM controller.

Note:

A given Network Processor version may implement a superset of these order guarantees. However,
that superset may not be supported in future implementations.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Verification is required to test only the order rules shown in Table 12 and Table 13).
Note:

A blank entry in Table 12 means that no order is enforced.

Table 12. Address Reference Order
1st ref
2nd ref

Memory
Read

CSR Read

Memory Read

Memory
Write

CSR Write

Memory
RMW

Queue /
Ring /
Q_Descr
Commands

Order

CSR Read

Order

Memory Write

Order

CSR Write

Order

Memory RMW

Order

Queue / Ring / Q_
Descr Commands

See
Table 13.

Table 13 shows the architectural guarantees of order to access to the SAME SRAM Q_array entry
between a reference of any given type (shown in the column labels) and a subsequent reference of
any given type (shown in the row labels). The definition of first and second is defined by the order
they are received by the SRAM controller. The same caveats apply as for Table 12.

Table 13. Q_array Entry Reference Order
1st ref
2nd ref

Read_Q
_Descr
head,
tail

Read_
Q_Des
cr
other

Read_Q_Descr
head,tail
Read_Q_
Descr other

Write_Q
_Descr

Enqueue

Dequeue

Put

Get

Journal

Order
Order

Write_Q_
Descr
Enqueue

Order

Dequeue

Order

Put
Get
Journal

Hardware Reference Manual

Order
Order
Order
Order
Order

Intel® IXP2800 Network Processor
Technical Description

2.5.5.2

Microengine Software Restrictions to Maintain Ordering
It is the Microengine programmer’s job to ensure order where the program flow finds order to be
necessary and where the architecture does not guarantee that order. The signaling mechanism can
be used to do this. For example, say that microcode needs to update several locations in a table. A
location in SRAM is used to “lock” access to the table. Example 13 is the code for the table update.

Example 13. Table Update Code
IMMED [$xfer0, 1]
SRAM [write, $xfer0,
; At this point, the
the table updates.
SRAM [write, $xfer1,
SRAM [write, $xfer3,
CTX_ARB [SIG_DONE_3,
; At this point, the
flag to allow access
IMMED [$xfer0, 0]
SRAM [write, $xfer0,

flag_address, 0, 1], ctx_swap [SIG_DONE_2]
write to flag_address has passed the point of coherency. Do
table_base, offset1, 2] , sig_done [SIG_DONE_3]
table_base, offset2, 2] , sig_done [SIG_DONE_4]
SIG_DONE_4]
table writes have passed the point of coherency. Clear the
by other threads.
flag_address, 0, 1, ctx_swap [SIG_DONE_2]

Other rules:

• All accesses to atomic variables should be via read-modify-write instructions.
• If the flow must know that a write is completed (actually in the SRAM itself), follow the write

with a read to the same address. The write is guaranteed to be complete when the read data has
been returned to the Microengine.

• With the exception of initialization, never do WRITE commands to the first three longwords
of a queue_descriptor data structure (these are the longwords that hold head, tail, and count,
etc.). All accesses to this data must be via the Q commands.

• To initialize the Q_array registers, perform a memory write of at least three longwords,

followed by a memory read to the same address (to guarantee that the write completed).
Then, for each entry in the Q_array, perform a read_q_descriptor_head followed by a
read_q_descriptor_other using the address of the same three longwords.

2.6

Scratchpad Memory
The IXP2800 Network Processor contains a 16 Kbytes of Scratchpad Memory, organized as 4K
32-bit words, that is accessible by Microengines and the Intel XScale® core. The Scratchpad
Memory provides the following operations:

• Normal reads and writes. 1–16 32-bit words can be read/written with a single Microengine

instruction. Note that Scratchpad is not byte-writable (each write must write all four bytes).

• Atomic read-modify-write operations, bit-set, bit-clear, increment, decrement, add, subtract,
and swap. The RMW operations can also optionally return the pre-modified data.

• Sixteen Hardware Assisted Rings for interprocess communication. (A ring is a FIFO that uses
a head and tail pointer to store/read information in Scratchpad memory.)

Scratchpad Memory is provided as a third memory resource (in addition to SRAM and DRAM)
that is shared by the Microengines and the Intel XScale® core. The Microengines and the Intel
XScale® core can distribute memory accesses between these three types of memory resources to
provide a greater number of memory accesses occurring in parallel.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.6.1

Scratchpad Atomic Operations
In addition to normal reads and writes, the Scratchpad Memory supports the following atomic
operations. Microengines have specific instructions to do each atomic operation; the Intel XScale®
microarchitecture uses aliased address regions to do atomic operations.

•
•
•
•
•
•
•

bit set
bit clear
increment
decrement
add
subtract
swap

The Scratchpad Memory does read-modify-writes for the atomic operations, the pre-modified data
can also be returned if desired. The atomic operations operate on a single 32-bit word.

2.6.2

Ring Commands
The Scratchpad Memory provides sixteen Rings used for interprocess communication. The rings
provide two operations.

•
•

Get(ring, length)
Put(ring, length)

Ring is the number of the ring (0 through 15) to get or put from, and length specifies the
number of 32-bit words to transfer. A logical view of one of the rings is shown in Table 9.
Figure 9. Logical View of Rings

Address
Decoder

Scratchpad RAM

Read / Write / Atomic
Addresses

1 of 16

Head

Tail

Count

Size

Full
A9355-01

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Head, Tail, and Size are registers in the Scratchpad Unit. Head and Tail point to the actual ring data,
which is stored in the Scratchpad RAM. The count of how many entries are on the Ring is
determined by hardware using the Head and Tail. For each Ring in use, a region of Scratchpad
RAM must be reserved for the ring data.
Note:

The reservation is by software convention. The hardware does not prevent other accesses to the
region of Scratchpad Memory used by the Ring. Also the regions of Scratchpad Memory allocated
to different Rings must not overlap.
Head points to the next address to be read on a get, and Tail points to the next address to be written
on a put. The size of each Ring is selectable from the following choices: 128, 256, 512, or 1024
32-bit words.

Note:

The region of Scratchpad used for a Ring is naturally aligned to it size.
When the Ring is near full, it asserts an output signal, which is used as a state input to the
Microengines. They must use that signal to test (by doing Branch on Input State) for room on the
Ring before putting data onto it. There is a lag in time from a put instruction executing to the Full
signal being updated to reflect that put. To guarantee that a put will not overfill the ring there is a
bound on the number of Contexts and the number of 32-bit words per write based on the size of the
ring, as shown in Table 14. Each Context should test the Full signal, then do the put if not Full, and
then wait until the Context has been signaled that the data has been pulled before testing the Full
signal again.
An alternate usage method is to have Contexts allocate and deallocate entries from a shared count
variable, using the atomic subtract to allocate and atomic add to deallocate. In this case the
Full signal is not used.

Table 14. Ring Full Signal Use – Number of Contexts and Length versus Ring Size
Number of
Contexts

Ring Size
128

256

512

1024

Illegal

128

Illegal

NOTES:
1. Number in each table entry is the largest length that should be put. 16 is the largest length that a single put
instruction can generate.
2. Illegal -- With that number of Contexts, even a length of one could cause the Ring to overfill.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7

Media and Switch Fabric Interface
The Media and Switch Fabric (MSF) Interface is used to connect the IXP2800 Network Processor
to a physical layer device (PHY) and/or to a Switch Fabric. the MSF consists of separate receive
and transmit interfaces. Each of the receive and transmit interfaces can be separately configured for
either SPI-4 Phase 2 (System Packet Interface) for PHY devices or CSIX-L1 protocol for Switch
Fabric Interfaces.
The receive and transmit ports are unidirectional and independent of each other. Each port has 16
data signals, a clock, a control signal, and a parity signal, all of which use LVDS (differential)
signaling, and are sampled on both edges of the clock. There is also a flow control port consisting
of a clock, data, and ready status bits, and used to communicate between two IXP2800 Network
Processors, or the IXP2800 Network Processor chip and a Switch Fabric Interface. These are also
LVDS, dual-edge data transfer. All of the high speed LVDS interfaces support dynamic deskew
training.
The block diagram in Figure 10 shows a typical configuration.

Figure 10. Example System Block Diagram

Receive protocol is SPI-4
Transmit mode is CSIX

Ingress
Intel® IXP2800
Network Processor
RDAT
TDAT
Framing/MAC
Device
(PHY)

RSTAT
Flow Control

SPI-4
Protocol

Egress
Intel® IXP2800
Network Processor

Optional
Gasket
(Note 1 )

Switch
Fabric
CSIX
Protocol

TSTAT
RDAT
TDAT
Receive protocol is CSIX
Transmit mode is SPI-4
Notes:
1. Gasket is used to convert 16-bit, dual-data IXP2800 signals to wider single edge CWord signals
used by Switch Fabric, if required.
2. Per the CSIX specification, the terms "egress" and ingress" are with respect to the Switch Fabric.
So the egress processor handles traffic received from the Switch Fabric and the ingress
processor handles traffic sent to the Switch Fabric.
A9356-03

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

An alternate system configuration is shown in the block diagram in Figure 11. In this case, a single
IXP2800 Network Processor is used for both Ingress and Egress. The bit rate supported would be
less than in Figure 10. A hypothetical Bus Converter chip, external to the IXP2800 Network
Processor is used. The block diagram in Figure 11 is only an illustrative example.
Figure 11. Full-Duplex Block Diagram

Receive and transmit protocol
is SPI-4 and CSIX on transferby-transfer basis.

Intel® IXP2800
Network Processor
RDAT

Framing/MAC
Device
(PHY)

TDAT

Switch
Fabric

Bus Converter
UTOPIA-3
or IXBUS
Protocol

CSIX
Protocol

Notes:
The Bus Converter chip receives and transmits both SPI-4 and CSIX protocols from/to Intel
IXP2800 Network Processor. It steers the data, based on protocol, to either PHY device or
Switch Fabric. PHY interface can be UTOPIA-3, IXBUS, or any other required protocol.
A9357-02

2.7.1

SPI-4
SPI-4 is an interface for packet and cell transfer between a physical layer (PHY) device and a link
layer device (the IXP2800 Network Processor), for aggregate bandwidths of OC-192 ATM and
Packet over SONET/SDH (POS), as well as 10 Gb/s Ethernet applications.
The Optical Internetworking Forum (OIF), www.oiforum.com, controls the SPI-4 Implementation
Agreement document.
SPI-4 protocol transfers data in variable length bursts. Associated with each burst is information
such as Port number (for a multi-port device such as a 10 x 1 GbE), SOP, and EOP. This
information is collected by the MSF and passed to the Microengines.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7.2

CSIX
CSIX-L1 (Common Switch Interface) defines an interface between a Traffic Manager (TM) and a
Switch Fabric (SF) for ATM, IP, MPLS, Ethernet, and similar data communications applications.
The Network Processor Forum (NPF) www.npforum.org, controls the CSIX-L1 specification.
The basic unit of information transferred between Traffic Managers and Switch Fabrics is called a
CFrame. There are three categories of CFrames:

• Data
• Control
• Flow Control
Associated with each CFrame is information such as length, type, address. This information is
collected by MSF and passed to Microengines.
MSF also contains a number of hardware features related to flow control.

2.7.3

Receive
Figure 12 is a simplified block diagram of the MSF receive section.

Figure 12. Simplified MSF Receive Section Block Diagram

Checksum

RDAT
RCTL
RPAR

CSIX
Protocol
Logic

SPI-4
Protocol
Logic

RCLK
RCLK REF

Full
Element
List

SPI-4
Flow
Control

Clock
for
Receive
Functions

- 128
-

Buffers

(to MEs)

(to DRAM)

Full Indication to Flow Control

RPROT

RSTAT

RBUF
- - - - - - - - - - - - -

Control

Receive
Thread
Freelists

CSR Write

CSIX CFrames mapped by RX_Port_Map CSR
(normally Flow Control CFrames are mapped here)
FCEFIFO
- - - - - - - - - - - - - - - - - - - - TXCFC
(FCIFIFO full)
TXCDAT
A9365-01

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7.3.1

RBUF
RBUF is a RAM that holds received data. It stores received data in sub-blocks (referred to as
elements), and is accessed by Microengines or the Intel XScale® core reading the received
information. Details of how RBUF elements are allocated and filled is based on the receive data
protocol. When data is received, the associated status is put into the FULL_ELEMENT_LIST
FIFO and subsequently sent to Microengines to process. FULL_ELEMENT_LIST insures that
received elements are sent to Microengines in the order that the data was received.
RBUF contains a total of 8 KB of data. The element size is programmable as either 64 bytes,
128 bytes, or 256 bytes per element. In addition, RBUF can be programmed to be split into one,
two, or three partitions depending on application. For receiving SPI-4, one partition would be used.
For receiving CSIX, two partitions are used (Control CFrames and Data CFrames). When both
protocols are being used, the RBUF can be split into three partitions. For both SPI-4 and CSIX,
three partitions are used.
Microengines can read data from the RBUF to Microengine S_TRANSFER_IN registers using the
instruction where they specify the starting byte number (which must be aligned to 4
bytes), and number of 32-bit words to read. The number in the instruction can be either the number
of 32-bit words, or number of 32-bit word pairs, using the single and double instruction modifiers,
respectively.

msf[read]

Microengines can move data from RBUF to DRAM using the dram instruction where they specify
the starting byte number (which must be aligned to 4 bytes), the number of 32-bit words to read,
and the address in DRAM to write the data.
For both types of RBUF read, reading an element does not modify any RBUF data, and does not
free the element, so buffered data can be read as many times as desired. This allows, for example, a
processing pipeline to have different Microengines handle different protocol layers, with each
Microengine reading only the specific header information it requires.
2.7.3.1.1

SPI-4 and the RBUF
SPI-4 data is placed into RBUF with each SPI-4 burst allocating an element. If a SPI-4 burst is
larger than the element size, another element is allocated. The status information for the element
contains the following information:
2
3

2
2

Element

6
2

6
1

6
0

5
9

5
8

2
1

2
0

1
9

1
8

1
7

1
6

Byte Count

5
7

5
6

5
5

Reserved

5
4

5
3

5
2

5
1

5
0

4
9

4
8

1
5

1
4

1
3

1
2

4
7

1
1

1
0

4
6

4
5

4
4

Null

2
4

Type

2
5

Par Err

2
6

Abort Err

2
7

Len Err

RPROT
6
3

2
8

Err

2
9

EOP

3
0

SOP

3
1

4
3

4
2

4
1

4
0

3
4

3
3

3
2

ADR

3
9

3
8

3
7

3
6

3
5

Checksum

The definitions of the fields are shown in Table 90, “RBUF SPIF-4 Status Definition” on page 252.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7.3.1.2

CSIX and RBUF
CSIX CFrames are placed into either RBUF with each CFrame allocating an element. Unlike
SPI-4, a single CFrame must not spill over into another element. Since CSIX spec specifies a
maximum CFrame size of 256 bytes, this can be done by programming the element size to 256
bytes. However, if the Switch Fabric uses a smaller CFrame size, then a smaller RBUF element
size can be used.
Flow Control CFrames are put into the FCEFIFO, to be sent to the Ingress IXP2800 Network
Processor where a Microengine will read them to manage flow control information to the Switch
Fabric.
The status information for the element contains the following information:
2
6

2
5

2
4

2
3

2
2

6
1

6
0

5
9

5
8

1
9

1
8

1
7

1
6

Payload Length

5
7

5
6

5
5

5
4

5
3

5
2

5
1

5
0

4
9

4
8

1
5

1
4

1
3

1
2

1
1

1
0

4
4

4
3

4
2

4
1

6
2

2
0

Err

Element

2
1

Null

2
7

RPROT
6
3

2
8

VP Err

2
9

HP Err

3
0

Len Err

3
1

4
7

4
6

4
5

Reserved

4
0

3
9

3
8

3
7

Type

3
6

3
5

3
4

3
3

3
2

Extension Header

The definitions of the fields are shown in Table 91, “RBUF CSIX Status Definition” on page 254.

2.7.3.2

Full Element List
Receive control hardware maintains the FULL_ELEMENT_LIST to hold the status of valid RBUF
elements, in the order in which they were received. When an RBUF element is filled, its status is
added to the tail of the FULL_ELEMENT_LIST. When a Microengine is notified of element
arrival (by having the status written to its S_Transfer register), it is removed from the head of the
FULL_ELEMENT_LIST.

2.7.3.3

RX_THREAD_FREELIST
RX_THREAD_FREELIST is a FIFO that indicates Microengine Contexts that are awaiting an
RBUF element to process. This allows the Contexts to indicate their ready status prior to the
reception of the data, as a way to eliminate latency. Each entry added to a Freelist also has an
associated S_TRANSFER register and signal number. There are three RX_THREAD_FREELISTS
that correspond to the RBUF partitions.
To be added as ready to receive an element, a Microengine does an msf[write] or an
msf[fast_write] to the RX_THREAD_FREELIST address; the write data is the Microengine/
CONTEXT/S_TRANSFER register number to add to the Freelist.
When there is valid status at the head of the Full Element List, it will be pushed to a Microengine.
The receive control logic pushes the status information (which includes the element number) to the
Microengine in the head entry of RX_THREAD_FREELIST, and sends an Event Signal to the
Microengine. It then removes that entry from the RX_THREAD_FREELIST, and removes the
status from Full Element List.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Each RX_THREAD_FREELIST has an associated countdown timer. If the timer expires and no
new receive data is available yet, the receive logic will autopush a Null Receive Status Word to the
next thread on the RX_THREAD_FREELIST. A Null Receive Status Word has the “Null” bit set,
and does not have any data or RBUF entry associated with it.
The RX_THREAD_FREELIST timer is useful for certain applications. Its primary purpose is to
keep the receive processing pipeline (implemented as code running on the Microengines) moving
even when the line has gone idle.
It is especially useful if the pipeline is structured to handle mpackets in groups, i.e., eight mpackets
at a time. If seven mpackets are received, then the line goes idle, then the timeout will trigger the
autopush of a null Receive Status Word, filling the eighth slot and allowing the pipeline to advance.
Another example is if one valid mpacket is received before the line goes idle for a long period;
seven null Receive Status Words will be autopushed, allowing the pipeline to proceed. Typically
the timeout interval is programmed to be slightly larger than the minimum arrival time of the
incoming cells or packets.
The timer is controlled using the RX_THREAD_FREELIST_TIMEOUT_# CSR. The timer may
be enabled or disabled, and the timeout value specified using this CSR.

2.7.3.4

Receive Operation Summary
During receive processing, received CFrames, and SPI-4 cells and packets (which in this context
are all called mpackets) are placed into the RBUF, and then handed off to a Microengine to process.
Normally, by application design, some number of Microengine Contexts will be assigned to
receive processing. Those Contexts will have their number added to the proper
RX_THREAD_FREELIST (via msf[write]or msf[fast_write]), and then will go to sleep to
wait for arrival of an mpacket (or alternatively poll waiting for arrival of an mpacket).
When an mpacket arrives, MSF receive control logic will autopush eight bytes of information for
the element to the Microengine/CONTEXT/S_TRANSFER registers at the head of
RX_THREAD_FREELIST. The information pushed is:

• Status Word (SPI-4) or Header Status (CSIX) — see Table 90, “RBUF SPIF-4 Status
Definition” on page 252 for more information.

• Checksum (SPI-4) or Extension Header (CSIX) — see Table 91, “RBUF CSIX Status
Definition” on page 254 for more information.

To handle the case where the receive Contexts temporarily fall behind and
RX_THREAD_FREELIST is empty, all received element numbers are held in the
FULL_ELEMENT_LIST. In that case, as soon as an RX_THREAD_FREELIST entry is entered,
the status of the head element of FULL_ELEMENT_LIST will be pushed to it.
The Microengines may read part of (or the entire) RBUF element to their S_TRANSFER registers
(via an msf[read] instruction) for header processing, etc., and may also move the element data to
DRAM (via a dram[rbuf_rd] instruction).
When a Context is done with an element, it does an msf[write]or msf[fast_write] to
RBUF_ELEMENT_DONE address; the write data is the element number. This marks the element
as free and available to be re-used. There is no restriction on the order in which elements are freed;
Contexts can do different amounts of processing per element based on the contents of the element
— therefore elements can be returned in a different order than they were handed to Contexts.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7.4

Transmit
Figure 13 is a simplified Block Diagram of the MSF transmit section.

From ME

From DRAM

TBUF
- - - - - - - - - - - - -

SPI-4
Protocol
Logic

CSIX
Protocol
Logic

Byte Align

Figure 13. Simplified Transmit Section Block Diagram

TDAT
TCTL
TPAR

Control
Valid
Element
Logic
ME Reads
(S_Push_Bus)

From Other CSRs

TCLK

TCLK REF

RXCFC
(FCIFIFO full)

2.7.4.1

FCIFIFO
- - - - - - - - - - - - - - - - -

Internal Clock
for Transmit
Logic
RXCSRB
(Ready Bits)

Internal
Clock

RXCDAT
A9366-01

TBUF
TBUF is a RAM that holds data and status to be transmitted. The data is written into sub-blocks
referred to as elements, by Microengines or the Intel XScale® core.
TBUF contains a total of 8 Kbytes of data. The element size is programmable as either 64 bytes,
128 bytes, or 256 bytes per element. In addition, TBUF can be programmed to be split into one,
two, or three partitions depending on application. For transmitting SPI-4, one partition would be
used. For transmitting CSIX, two partitions are used (Control CFrames and Data CFrames). For
both SPI-4 and CSIX, three partitions are used.
Microengines can write data from Microengine S_TRANSFER_OUT registers to the TBUF using
the msf[write] instruction where they specify the starting byte number (which must be aligned to
4 bytes), and number of 32-bit words to write. The number in the instruction can be either the
number of 32-bit words, or number of 32-bit word pairs, using the single and double instruction
modifiers, respectively.
Microengines can move data from DRAM to TBUF using the dram instruction where they specify
the starting byte number (which must be aligned to 4 bytes), the number of 32-bit words to write,
and the address in DRAM of the data.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

All elements within a TBUF partition are transmitted in the order. Control information associated
with the element defines which bytes are valid. The data from the TBUF will be shifted and byte
aligned as required to be transmitted.
2.7.4.1.1

SPI-4 and TBUF
For SPI-4, data is put into the data portion of the element, and information for the SPI-4 Control
Word that will precede the data is put into the Element Control Word.
When the Element Control Word is written, the information is:
2
6

2
5

2
4

2
3

6
2

6
1

6
0

5
9

5
8

2
1

Prepend
Offset

Payload Length

6
3

2
2

5
7

5
6

5
5

5
4

5
3

2
0

1
9

1
8

1
7

1
6

1
2

1
1

1
0

Payload
Offset

EOP

2
7

SOP

2
8

Res

2
9

Skip

3
0

Res

3
1

1
5

4
7

4
4

4
3

4
2

4
1

4
0

Prepend Length

5
2

5
1

5
0

4
9

4
8

1
4

4
6

1
3

4
5

3
4

3
3

3
2

ADR

3
9

3
8

3
7

3
6

3
5

Res

The definitions of the fields are shown in Table 15.
Table 15. TBUF SPI-4 Control Definition
Field

Payload Length

Definition
Indicates the number of Payload bytes, from 1 to 256, in the element. The value of 0x00
means 256 bytes. The sum of Prepend Length and Payload Length will be sent. That
value will also control the EOPS field (1 or 2 bytes valid indicated) of the Control Word
that will succeed the data transfer. Note 1.

Prepend Offset

Indicates the first valid byte of Prepend, from 0 to 7

Prepend Length

Indicates the number of bytes in Prepend, from 0 to 31.

Payload Offset

Indicates the first valid byte of Payload, from 0 to 7.

Skip

Allows software to allocate a TBUF element and then not transmit any data from it.
0—transmit data according to other fields of Control Word.
1—free the element without transmitting any data.

SOP

Indicates if the element is the start of a packet. This field will be sent in the SOPC field of
the Control Word that will precede the data transfer.

EOP

Indicates if the element is the end of a packet. This field will be sent in the EOPS field of
the Control Word that will succeed the data transfer. Note 1.

ADR

The port number to which the data is directed. This field will be sent in the ADR field of the
Control Word that will precede the data transfer.

NOTE:
1. Normally EOPS is sent on the next Control Word (along with ADR and SOP) to start the next element. If
there is no valid element pending at the end of sending the data, the transmit logic will insert an Idle Control
Word with the EOPS information.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.7.4.1.2

CSIX and TBUF
For CSIX, payload information is put into the data area of the element, and Base and Extension
Header information is put into the Element Control Word.
When the Element Control Word is written, the information is:
2
6

2
5

2
4

2
3

6
2

6
1

6
0

5
9

5
8

2
1

Prepend
Offset

Payload Length

6
3

2
2

5
7

5
6

5
5

5
4

5
3

2
0

1
9

1
8

1
7

1
6

Prepend Length

5
2

5
1

5
0

4
9

4
8

1
2

1
1

1
0

Payload
Offset

2
7

2
8

Res

2
9

Skip

3
0

Res

3
1

1
5

1
4

4
7

4
4

4
3

4
2

4
1

4
0

4
6

1
3

4
5

Res

3
9

3
8

3
7

Type

3
6

3
5

3
4

3
3

3
2

Extension Header

The definitions of the fields are shown in Table 16.
Table 16. TBUF CSIX Control Definition
Field

2.7.4.2

Definition

Payload Length

Indicates the number of Payload bytes, from 1 to 256, in the element. The value of 0x00
means 256 bytes. The sum of Prepend Length and Payload Length will be sent, and also
put into the CSIX Base Header Payload Length field. Note that this length does not
include any padding that may be required. Padding is inserted by transmit hardware as
needed.

Prepend Offset

Indicates the first valid byte of Prepend, from 0 to 7.

Prepend Length

Indicates the number of bytes in Prepend, from 0 to 31.

Payload Offset

Indicates the first valid byte of Payload, from 0 to 7.

Skip

Allows software to allocate a TBUF element and then not transmit any data from it.
0—transmit data according to other fields of Control Word.
1—free the element without transmitting any data.

CR (CSIX Reserved) bit to put into the CSIX Base Header.

P (Private) bit to put into the CSIX Base Header.

Type

Type Field to put into the CSIX Base Header. Idle type is not legal here.

Extension Header

The Extension Header to be sent with the CFrame. The bytes are sent in big-endian
order; byte 0 is in bits 63:56, byte 1 is in bits 55:48, byte 2 is in bits 47:40, and byte 3 is in
bits 39:32.

Transmit Operation Summary
During transmit processing data to be transmitted is placed into the TBUF under Microengine
control. The Microengine allocates an element in software; the transmit hardware processes TBUF
elements within a partition in strict sequential order so the software can track which element to
allocate next.
Microengines may write directly into an element by an msf[write] instruction, or have data from
DRAM written into the element by a dram[tbuf_wr] instruction. Data can be merged into the
element by doing both.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

There is a Transmit Valid bit per element, that marks the element as ready to be transmitted.
Microengines move all data into the element, by either or both of msf[write] and
dram[tbuf_wr] instructions to the TBUF. Microengines also write the element Transmit Control
Word with information about the element. When all of the data movement is complete, the
Microengine sets the element valid bit.
1. Move data into TBUF by either or both of msf[write] and dram[tbuf_wr] instructions to
the TBUF.
2. Wait for 1 to complete.
3. Write Transmit Control Word at TBUF_ELEMENT_CONTROL_# address. Using this
address sets the Transmit Valid bit.

2.7.5

The Flow Control Interface
The MSF provides flow control support for SPI-4 and CSIX.

2.7.5.1

SPI-4
SPI-4 uses a FIFO Status Channel to provide flow control information. MSF receives the
information from the PHY device and stores it so that Microengines can read the information on a
per-port basis. It can then use that information to determine when to transmit data to a given port.
The MSF also sends status to the PHY based on the amount of available space in the RBUF —
i.e., done by hardware without Microengines.

2.7.5.2

CSIX
CSIX provides two types of flow control — link level and per queue.

• The link level control is handled by hardware. MSF will stop transmission is response to link
level flow control received from the Switch Fabric. MSF will assert link level flow control
based on the amount of available space in the RBUF.

• Per queue flow control information is put into the FCIFIFO and handled by Microengine
software. Also, if required, Microengines can send Flow Control CFrames to the Switch
Fabric under software control.

In both cases, for a full-duplex configuration, information is passed from the Switch Fabric to the
Egress IXP2800 Network Processor, which then passes it to the Ingress IXP2800 Network
Processor over a proprietary flow control interface.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.8

Hash Unit
The IXP2800 Network Processor contains a Hash Unit that can take 48-, 64-, or 128-bit data and
produce a 48-, 64-, or a 128-bit hash index, respectively. The Hash Unit is accessible by the
Microengines and the Intel XScale® core, and is useful in doing table searches with large keys, for
example L2 addresses. Figure 14 is a block diagram of the Hash Unit.
Up to three hash indexes can be created using a single Microengine instruction. This helps to
minimize command overhead. The Intel XScale® core can only do a single hash at a time.
A Microengine initiates a hash operation by writing the hash operands into a contiguous set of
S_TRANSFER_OUT registers and then executing the hash instruction. The Intel XScale® core
initiates a hash operation by writing a set of memory-mapped HASH_OP registers, which are built
in the Intel XScale® core gasket, with the data to be used to generate the hash index. There are
separate registers for 48-, 64-, and 128-bit hashes. The data is written from MSB to LSB, with the
write to LSB triggering the Hash Operation. In both cases, the Hash Unit reads the operand into an
input buffer, performs the hash operation, and returns the result.
The Hash Unit uses a hard-wired polynomial algorithm and a programmable hash multiplier to
create hash indexes. Three separate multipliers are supported, one for 48-bit hash operations, one
for 64-bit hash operations and one for 128-bit hash operations. The multiplier is programmed
through Control registers in the Hash Unit.
The multiplicand is shifted into the hash array, 16 bits at a time. The hash array performs a
1’s-complement multiply and polynomial divide, using the multiplier and 16 bits of the
multiplicand. The result is placed into an output buffer register and also feeds back into the array.
This process is repeated three times for a 48-bit hash (16 bits x 3 = 48), four times for a 64-bit hash
(16 bits x 4 = 64), and eight times for a 128-bit hash (16 x 8 = 128). After the multiplicand has been
passed through the hash array, the resulting hash index is placed into a two-stage output buffer.
After each hash index is completed, the Hash Unit returns the hash index to the Microengines’
S_TRANSFER_IN registers, or the Intel XScale® core HASH_OP registers. For Microengine
initiated hash operations, the Microengine is signaled after all the hashes specified in the
instruction have been completed.
For the Intel XScale® core initiated hash operations, the Intel XScale® core reads the results from
the memory-mapped HASH_OP registers. The addresses of Hash Results are the same as the
HASH_OP registers. Because of queuing delays at the Hash Unit, the time to complete an
operation is not fixed. The Intel XScale® core can do one of two operations to get the hash results.

• Poll the HASH_DONE register. This register is cleared when the HASH_OP registers are

written. Bit [0] of HASH_DONE register is set when the HASH_OP registers get the return
result from the Hash Unit (when the last word of the result is returned). The Intel XScale® core
software can poll on HASH_DONE, and read HASH_OP when HASH_DONE is equal to
0x00000001.

• Read HASH_OP directly. The interface hardware will acknowledge the read only when the
result is valid. This method will result in the Intel XScale® core stalling if the result is not
valid when the read happens.

The number of clock cycles required to perform a single hash operation equals: two or four cycles
through the input buffers, three, four or eight cycles through the hash array, and two or four cycles
through the output buffers. Because of the pipeline characteristics of the Hash Unit, performance is
improved if multiple hash operations are initiated with a single instruction rather than separate hash
instructions for each hash operation.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

Figure 14. Hash Unit Block Diagram

Data Used to Create Hash
Index from S_Transfer_Out

Multiplicand 3
2-Stage Input Buffer
Multiplicand 2
128

Multiplicand 1
16

shift
Hash_Multiplier_48

Hash Array

Hash_Multiplier_64

128

Hashed Multiplicand 3

Hash_Multiplier_128

48-bit, 64-bit or 128-bit Hash Select

128

Hashed Multiplicand 2

2-Stage Output Buffer

Hashed Multiplicand 1

Hash Indexes to S_Transfer_In
Registers
A9367-02

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.9

PCI Controller
The PCI Controller provides a 64-bit, 66 MHz capable PCI Local Bus Revision 2.2 interface, and is
compatible to 32-bit or 33 MHz PCI devices. The PCI controller provides the following functions:

•
•
•
•
•

Target Access (external Bus Master access to SRAM, DRAM, and CSRs)
Master Access (the Intel XScale® core access to PCI Target devices)
Two DMA Channels
Mailbox and Doorbell registers for the Intel XScale® core to Host communication
PCI arbiter

The IXP2800 Network Processor can be configured to act as PCI central function (for use in a
stand-alone system), where it provides the PCI reset signal, or as an add-in device, where it uses the
PCI reset signal as the chip reset input. The choice is made by connecting the cfg_rst_dir input pin
low or high.

2.9.1

Target Access
There are three Base Address Registers (BARs) to allow PCI Bus Masters to access SRAM,
DRAM, and CSRs, respectively. Examples of PCI Bus Masters include a Host Processor (for
example a Pentium® processor), or an I/O device such as an Ethernet controller, SCSI controller, or
encryption coprocessor.
The SRAM BAR can be programmed to sizes of 16, 32, 64, 128, or 256 Mbytes, or no access.
The DRAM BAR can be programmed to sizes of 128, 256, or 512 Mbytes or 1 Gbyte, or no access.
The CSR BAR is 8 KB.
PCI Boot Mode is supported, in which the Host downloads the Intel XScale® core boot image into
DRAM, while holding the Intel XScale® core in reset. Once the boot image has been loaded, the
Intel XScale® core reset is deasserted. The alternative is to provide the boot image in a Flash ROM
attached to the Slowport.

2.9.2

Master Access
The Intel XScale® core and Microengines can directly access the PCI bus. The Intel XScale® core
can do loads and stores to specific address regions to generate all PCI command types.
Microengines use PCI instruction, and also use address regions to generate different PCI
commands.

2.9.3

DMA Channels
There are two DMA Channels, each of which can move blocks of data from DRAM to the PCI or
from the PCI to DRAM. The DMA channels read parameters from a list of descriptors in SRAM,
perform the data movement to or from DRAM, and stop when the list is exhausted. The descriptors
are loaded from predefined SRAM entries or may be set directly by CSR writes to DMA Channel
registers. There is no restriction on byte alignment of the source address or the destination address.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

For PCI to DRAM transfers, the PCI command is Memory Read, Memory Read line, or Memory
Read Multiple. For DRAM to PCI transfers, the PCI command is Memory Write. Memory Write
Invalidate is not supported.
Up to two DMA channels are running at a time with three descriptors outstanding. Effectively, the
active channels interleave bursts to or from the PCI Bus.
Interrupts are generated at the end of DMA operation for the Intel XScale® core. However,
Microengines do not provide an interrupt mechanism. The DMA Channel will instead use an Event
Signal to notify the particular Microengine on completion of DMA.

2.9.3.1

DMA Descriptor
Each descriptor uses four 32-bit words in SRAM, aligned on a 16-byte boundary. The DMA
channels read the descriptors from SRAM into working registers once the control register has been
set to initiate the transaction. This control must be set explicitly; this starts the DMA transfer.
Register names for DMA channels are listed in Figure 15 and Table 17 lists the descriptor contents.

Figure 15. DMA Descriptor Reads

Working Register
Local SRAM
Last Descriptor

Next Descriptor

DMA Channel Register

Channel Register Name

Byte Count Register

CHAN_X_BYTE_COUNT

PCI Address Register

CHAN_X_PCI_ADDR

DRAM Address REgister

CHAN_X_DRAM_ADDR

Descriptor Pointer Register

CHAN_X_DESC_PTR

(X can be 1, 2, or 3)

Control Register
1
2
Prior Descriptor

DMA Channel Register

Channel Register Name

Control Register

CHAN_X_CONTROL

(X can be 1, 2, or 3)

Current Descriptor

A9368-01

After a descriptor is processed, the next descriptor is loaded in the working registers. This process
repeats until the chain of descriptors is terminated (i.e., the End of Chain bit is set).
Table 17. DMA Descriptor Format
Offset from Descriptor Pointer

Description

0x0

Byte Count

0x4

PCI Address

0x8

DRAM Address

0xC

Next Descriptor Address

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.9.3.2

DMA Channel Operation
The DMA channel can be set up to read the first descriptor in SRAM, or with the first descriptor
written directly to the DMA channel registers. When descriptors and the descriptor list are in
SRAM, the procedure is as follows:
1. The DMA channel owner writes the address of the first descriptor into the DMA Channel
Descriptor Pointer register (DESC_PTR).
2. The DMA channel owner writes the DMA Channel Control register (CONTROL) with
miscellaneous control information and also sets the channel enable bit (bit 0). The channel
initial descriptor bit (bit 4) in the CONTROL register must also be cleared to indicate that the
first descriptor is in SRAM.
3. Depending on the DMA channel number, the DMA channel reads the descriptor block into the
corresponding DMA registers, BYTE_COUNT, PCI_ADDR, DRAM_ADDR, and
DESC_PTR.
4. The DMA channel transfers the data until the byte count is exhausted, and then sets the
channel transfer done bit in the CONTROL register.
5. If the end of chain bit (bit 31) in the BYTE_COUNT register is clear, the channel checks the
Chain Pointer value. If the Chain Pointer value is not equal to 0. it reads the next descriptor
and transfers the data (step 3 and 4 above). If the Chain Pointer value is equal to 0, it waits for
the Descriptor Added bit of the Channel Control register to be set before reading the next
descriptor and transfers the data (step 3 and 4 above). If bit 31 is set, the channel sets the
channel chain done bit in the CONTROL register and then stops.
6. Proceed to the Channel End Operation.
When single descriptors are written into the DMA channel registers, the procedure is as follows:
1. The DMA channel owner writes the descriptor values directly into the DMA channel registers.
The end of chain bit (bit 31) in the BYTE_COUNT register must be set, and the value in the
DESC_PTR register is not used.
2. The DMA channel owner writes the base address of the DMA transfer into the PCI_ADDR to
specify the PCI starting address.
3. When the first descriptor is in the BYTE_COUNT register, the DRAM_ADDR register must
be written with the address of the data to be moved.
4. The DMA channel owner writes the CONTROL register with miscellaneous control
information, along with setting the channel enable bit (bit 0). The channel initial descriptor in
register bit (bit 4) in the CONTROL register must also be set to indicate that the first descriptor
is already in the channel descriptor registers.
5. The DMA channel transfers the data until the byte count is exhausted, and then sets the
channel transfer done bit (bit 2) in the CONTROL register.
6. Since the end of the chain bit (bit 31) in the BYTE_CONT register is set, the channel sets the
channel chain done bit (bit 7) in the CONTROL register and then stops.
7. Proceed to the Channel End Operation.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.9.3.3

DMA Channel End Operation
1. Channel owned by PCI:
If not masked via the PCI Outbound Interrupt Mask register, the DMA channel interrupts the
PCI host after the setting of the DMA done bit in the CHAN_X_CONTROL register, which is
readable in the PCI Outbound Interrupt Status register.
2. Channel owned by the Intel XScale® core:
If enabled via the Intel XScale® core Interrupt Enable registers, the DMA channel interrupts
the Intel XScale® core by setting the DMA channel done bit in the CHAN_X_CONTROL
register, which is readable in the Intel XScale® core Interrupt Status register.
3. Channel owned by Microengine:
If enabled via the Microengine Auto-Push Enable registers, the DMA channel signals the
Microengine after setting the DMA channel done bit in the CHAN_X_CONTROL register,
which is readable in the Microengine Auto-Push Status register.

2.9.3.4

Adding Descriptors to an Unterminated Chain
It is possible to add a descriptor to a chain while a channel is running. To do so, the chain should be
left unterminated, i.e., the last descriptor should have End of Chain clear, and the Chain Pointer
value equal to 0. A new descriptor (or linked list of descriptors) can be added to the chain by
overwriting the Chain Pointer value of the unterminated descriptor (in SRAM) with the Local
Memory address of the (first) added descriptor (the added descriptor must actually be valid in
Local Memory prior to that). After updating the Chain Pointer field, the software must write a 1 to
the Descriptor Added bit of the Channel Control register. This is necessary for the case where the
channel was paused to reactivate the channel. However, software need not check the state of the
channel before writing that bit; there is no side-effect of writing that bit in the case where the
channel had not yet read the unlinked descriptor.
If the channel was paused or had read an unlinked Pointer, it will re-read the last descriptor
processed (i.e., the one that originally had the 0 value for Chain Pointer) to get the address of the
newly added descriptor.
A descriptor cannot be added to a descriptor that has End of Chain set.

2.9.4

Mailbox and Message Registers
Mailbox and Doorbell registers provide hardware support for communication between the Intel
XScale® core and a device on the PCI Bus.
Four 32-bit mailbox registers are provided so that messages can be passed between the Intel
XScale® core and a PCI device. All four registers can be read and written with byte resolution from
both the Intel XScale® core and PCI. How the registers are used is application dependent and the
messages are not used internally by the PCI Unit in any way. The mailbox registers are often used
with the Doorbell interrupts.
Doorbell interrupts provide an efficient method of generating an interrupt as well as encoding the
purpose of the interrupt. The PCI Unit supports a 32-bit the Intel XScale® core DOORBELL
register that is used by a PCI device to generate an the Intel XScale® core interrupt, and a separate
32-bit PCI DOORBELL register that is used by the Intel XScale® core to generate a PCI interrupt.
A source generating the Doorbell interrupt can write a software defined bitmap to the register to
indicate a specific purpose. This bitmap is translated into a single interrupt signal to the destination

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

(either a PCI interrupt or an Intel XScale® core interrupt). When an interrupt is received, the
DOORBELL registers can be read and the bit mask can be interpreted. If a larger bit mask is
required than that is provided by the DOORBELL register, the MAILBOX registers can be used to
pass up to 16 bytes of data.
The doorbell interrupts are controlled through the registers shown in Table 18.
Table 18. Doorbell Interrupt Registers
Register Name

2.9.5

Description
XScale®

XSCALE DOORBELL

Used to generate the Intel

XSCALE DOORBELL
SETUP

Used to initialize the Intel XScale® core Doorbell register and for diagnostics.

PCI DOORBELL

Used to generate the PCI Doorbell interrupts.

PCI DOORBELL SETUP

Used to initialize the PCI Doorbell register and for diagnostics.

core Doorbell interrupts.

PCI Arbiter
The PCI unit contains a PCI bus arbiter that supports two external masters in addition to the PCI
Unit’s initiator interface. If more than two external masters are used in the system, the aribter can
be disabled and an external (to the IXP2800 Network Processor used. In that case, the IXP2800
Network Processor will provide its PCI request signal to the external aribter, and use that arbiters
grant signal.
The arbiter uses a simple round-robin priority algorithm; it asserts the grant signal corresponding to
the next request in the round-robin during the current executing transaction on the PCI bus (this is
also called hidden arbitration). If the arbiter detects that an initiator has failed to assert frame_l
after 16 cycles of both grant assertion and PCI bus idle condition, the arbiter deasserts the grant.
That master does not receive any more grants until it deasserts its request for at least one PCI clock
cycle. Bus parking is implemented in that the last bus grant will stay asserted if no request is
pending.
To prevent bus contention, if the PCI bus is idle, the arbiter never asserts one grant signal in the
same PCI cycle in which it deasserts another, It deasserts one grant, and then asserts the next grant
after one full PCI clock cycle has elapsed to provide for bus driver turnaround.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.10

Control and Status Register Access Proxy
The Control and Status Register Access Proxy (CAP) contains a number of chip-wide control and
status registers. Some provide miscellaneous control and status, while others are used for interMicroengine or Microengine to the Intel XScale® core communication (note that rings in
Scratchpad Memory and SRAM can also be used for inter-process communication). These include:

• INTERTHREAD SIGNAL — Each thread (or context) on a Microengine can send a signal to
any other thread by writing to InterThread_Signal register. This allows a thread to go to sleep
waiting completion of a task by a different thread.

• THREAD MESSAGE — Each thread has a message register where it can post a software-

specific message. Other Microengine threads, or the Intel XScale® core, can poll for
availability of messages by reading theTHREAD_MESSAGE_SUMMARY register. Both the
THREAD_MESSAGE and corresponding THREAD_MESSAGE_SUMMARY clear upon a
read of the message; this eliminates a race condition when there are multiple message readers.
Only one reader will get the message.

• SELF DESTRUCT — This register provides another type of communication. Microengine

software can atomically set individual bits in the SELF_DESTRUCT registers; the registers
clear upon read. The meaning of each bit is software-specific. Clearing the register upon read
eliminates a race condition when there are multiple readers.

• THREAD INTERRUPT — Each thread can interrupt the Intel XScale® core on two different
interrupts; the usage is software-specific. Having two interrupts allows for flexibility, for
example, one can be assigned to normal service requests and one can be assigned to error
conditions. If more information needs to be associated with the interrupt, mailboxes or Rings
in Scratchpad Memory or SRAM could be used.

• REFLECTOR — CAP provides a function (called “reflector”) where any Microengine thread

can move data between its registers and those of any other thread. In response to a single write
or read instruction (with the address in the specific reflector range) CAP will get data from the
source Microengine and put it into the destination Microengine. Both the sending and
receiving threads can optionally be signaled upon completion of the data movement.

2.11

Intel XScale® Core Peripherals

2.11.1

Interrupt Controller
The Interrupt Controller provides the ability to enable or mask interrupts from a number of chip
wide sources, for example:

•
•
•
•

Timers (normally used by Real-Time Operating System).
Interrupts generated by Microengine software to request services from the Intel XScale® core.
External agents such as PCI devices.
Error conditions, such as DRAM ECC error, or SPI-4 parity error.

Interrupt status is read as memory mapped registers; the state of an interrupt signal can be read
even if it is masked from interrupting. Enabling and masking of interrupts is done as writes to
memory mapped registers.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

2.11.2

Timers
The IXP2800 Network Processor contains four programmable 32-bit timers, which can be used for
software support. Each timer can be clocked by the internal clock, by a divided version of the
clock, or by a signal on an external GPIO pin. Each timer can be programmed to generate a
periodic interrupt after a programmed number of clocks. The range is from several ns to several
minutes depending on the clock frequency.
In addition, timer 4 can be used as a watchdog timer. In this use, software must periodically reload
the timer value; if it fails to do so and the timer counts to 0, it will reset the chip. This can be used
to detect if software “hangs” or for some other reason fails to reload the timer.

2.11.3

General Purpose I/O
The IXP2800 Network Processor contains eight General Purpose I/O (GPIO) pins. These can be
programmed as either input or output and can be used for slow speed I/O such as LEDs or input
switches. They can also be used as interrupts to the Intel XScale® core, or to clock the
programmable timers.

2.11.4

Universal Asynchronous Receiver/Transmitter
The IXP2800 Network Processor contains a standard RS-232 compatible Universal Asynchronous
Receiver/Transmitter (UART), which can be used for communication with a debugger or
maintenance console. Modem controls are not supported; if they are needed, GPIO pins can be
used for that purpose.
The UART performs serial-to-parallel conversion on data characters received from a peripheral
device and parallel-to-serial conversion on data characters received from the processor. The
processor can read the complete status of the UART at any time during operation. Available status
information includes the type and condition of the transfer operations being performed by the
UART and any error conditions (parity, overrun, framing or break interrupt).
The serial ports can operate in either FIFO or non-FIFO mode. In FIFO mode, a 64-byte transmit
FIFO holds data from the processor to be transmitted on the serial link and a 64-byte receive FIFO
buffers data from the serial link until read by the processor.
The UART includes a programmable baud rate generator that is capable of dividing the internal
clock input by divisors of 1 to 216 - 1 and produces a 16X clock to drive the internal transmitter
logic. It also drives the receive logic. The UART can be operated in polled or in interrupt driven
mode as selected by software.

2.11.5

Slowport
The Slowport is an external interface to the IXP2800 Network Processor, used for Flash ROM
access and 8, 16, or 32-bit asynchronous device access. It allows the Intel XScale® core do read/
write data transfers to these slave devices.
The address bus and data bus are multiplexed to reduce the pin count. In addition, 24 bits of
address are shifted out on three clock cycles. Therefore, an external set of buffers is needed to latch
the address. Two chip selects are provided.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Technical Description

The access is asynchronous. Insertion of delay cycles for both data setup and hold time is
programmable via internal Control registers. The transfer can also wait for a handshake
acknowledge signal from the external device.

2.12

I/O Latency
Table 19 shows the latencies for transferring data between the Microengine and the other subsystem components. The latency is measured in 1.4 GHz cycles.

Table 19. I/O Latency
Sub-system
DRAM
(RDR)
Transfer Size
Average Read
Latency
Average Write
Latency

SRAM
(QDR)

Scratch

MSF

4 bytes

8 bytes

100 (light load) –
160 (heavy load)

~ 100 cycles
(range 53 – 152)

range 53 – 120

(note 3)
~ 53 cycles

~ 53 cycles

~ 40 cycles

8 bytes – 16 bytes
(note 2)
~ 295 cycles

(RBUF)
~ 48 cycles
(TBUF)

Note1: RDR, QDR, MSF, and Scratch values are extracted from a simulation model.
Note 2: Minimum DRAM burst size on pins is 16 bytes. Transfers less than 16 bytes incur the same as a
16-byte transfer.
Note 3: At 1016 MHz, read latency should be ~ 240 cycles.

2.13

Performance Monitor
The Intel XScale® core hardware provides two 32-bit performance counters that allow two unique
events to be monitored simultaneously. In addition, the Intel XScale® core implements a 32-bit
clock counter that can be used in conjunction with the performance counters; its sole purpose is to
count the number of core clock cycles, which is useful in measuring total execution time.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Intel XScale® Core

This section contains information describing the Intel XScale® core, Intel XScale® core gasket, and
Intel XScale® core Peripherals (XPI).
For additional information about the Intel XScale® architecture refer to the Intel XScale® Core
Developers Manual available on Intel’s Developers web site (http://www.developer.intel.com).

3.1

Introduction
The Intel XScale® core is an ARM* V5TE compliant microprocessor. It has been designed for high
performance and low-power; leading the industry in mW/MIPs. The Intel XScale® core
incorporates an extensive list of architecture features that allows it to achieve high performance.
Many of the architectural features added to the Intel XScale® core help hide memory latency that
often is a serious impediment to high performance processors.
This includes:

• The ability to continue instruction execution even while the data cache is retrieving data from
external memory.

•
•
•
•

A write buffer.
Write-back caching.
Various data cache allocation policies that can be configured different for each application.
Cache locking.

All these features improve the efficiency of the memory bus external to the core.
ARM* Version 5 (V5) Architecture added floating point instructions to ARM* Version 4. The Intel
XScale® core implements the integer instruction set architecture of ARM* V5, but does not
provide hardware support of the floating point instructions.
The Intel XScale® core provides the Thumb instruction set (ARM* V5T) and the ARM* V5E DSP
extensions.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.2

Features
Figure 16 shows the major functional blocks of the Intel XScale® core.

Figure 16. Intel XScale® Core Architecture Features

Data Cache

Instruction Cache

Max 32 Kbytes
32 ways
wr-back or
wr-through
Hit under
Data RAM
miss
Max 28 Kbytes
Re-map of
data cache

32 Kbytes
32 ways
Lockable by line

Branch Target
Buffer
128 entries

Performance
Monitoring
Debug
Hardware Breakpoint
Branch History Table

IMMU

DMMU

32 entry TLB
Fully associative
Lockable by entry

Power
Management

2 Kbytes
2 ways

Fill Buffer

32 entry TLB
Fully associative
Lockable by entry

4 - 8 entries

Write Buffer

MAC

Idle
Drowsy
Sleep

Mini-Data
Cache

Single Cycle
Throughput (16*32)
16-bit SIMD
40-bit Accumulator

8 entries
Full coalescing

JTAG
A9642-01

3.2.1

Multiply/ACcumulate (MAC)
The MAC unit supports early termination of multiplies/accumulates in two cycles and can sustain a
throughput of a MAC operation every cycle. Architectural enhancements to the MAC support
audio coding algorithms, including a 40-bit accumulator and support for 16-bit packed data.

3.2.2

Memory Management
The Intel XScale® core implements the Memory Management Unit (MMU) Architecture specified
in the ARM* Architecture Reference Manual (see the ARM* website at http://www.arm.com).
The MMU provides access protection and virtual to physical address translation. The MMU
Architecture also specifies the caching policies for the instruction cache and data memory.
These policies are specified as page attributes and include:

•
•
•
•
•

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.2.3

Instruction Cache
The Intel XScale® core implements a 32-Kbyte, 32-way set associative instruction cache with a
line size of 32 bytes. All requests that “miss” the instruction cache generate a 32-byte read request
to external memory. A mechanism to lock critical code within the cache is also provided.

3.2.4

Branch Target Buffer (BTB)
The Intel XScale® core provides a Branch Target Buffer to predict the outcome of branch type
instructions. It provides storage for the target address of branch type instructions and predicts the
next address to present to the instruction cache when the current instruction address is that of a
branch.
The BTB holds 128 entries.

3.2.5

Data Cache
The Intel XScale® core implements a 32-Kbyte, a 32-way set associative data cache and a 2-Kbyte,
2-way set associative mini-data cache. Each cache has a line size of 32 bytes, and supports writethrough or write-back caching.
The data/mini-data cache is controlled by page attributes defined in the MMU Architecture and by
coprocessor 15. The Intel XScale® core allows applications to reconfigure a portion of the data
cache as data RAM. Software may place special tables or frequently used variables in this RAM.

3.2.6

Performance Monitoring
Two performance monitoring counters have been added to the Intel XScale® core that can be
configured to monitor various events. These events allow a software developer to measure cache
efficiency, detect system bottlenecks, and reduce the overall latency of programs.

3.2.7

Power Management
The Intel XScale® core incorporates a power and clock management unit that can assist in
controlling clocking and managing power.

3.2.8

Debugging
The Intel XScale® core supports software debugging through two instruction address breakpoint
registers, one data-address breakpoint register, one data-address/mask breakpoint register, and a
trace buffer.

3.2.9

JTAG
Testability is supported on the Intel XScale® core through the Test Access Port (TAP) Controller
implementation, which is based on IEEE 1149.1 (JTAG) Standard Test Access Port and BoundaryScan Architecture. The purpose of the TAP controller is to support test logic internal and external
to the Intel XScale® core such as built-in self-test, boundary-scan, and scan.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.3

Memory Management
The Intel XScale® core implements the Memory Management Unit (MMU) Architecture specified
in the ARM Architecture Reference Manual. To accelerate virtual to physical address translation,
the Intel XScale® core uses both an instruction Translation Look-aside Buffer (TLB) and a data
TLB to cache the latest translations. Each TLB holds 32 entries and is fully-associative. Not only
do the TLBs contain the translated addresses, but also the access rights for memory references.
If an instruction or data TLB miss occurs, a hardware translation-table-walking mechanism is
invoked to translate the virtual address to a physical address. Once translated, the physical address
is placed in the TLB along with the access rights and attributes of the page or section. These
translations can also be locked down in either TLB to guarantee the performance of critical
routines.
The Intel XScale® core allows system software to associate various attributes with regions of
memory:

•
•
•
•
•
•
•
•
Note:

cacheable
bufferable
line allocate policy
write policy
I/O
mini Data Cache
Coalescing
P bit

The virtual address with which the TLBs are accessed may be remapped by the PID register.

3.3.1

Architecture Model

3.3.1.1

Version 4 versus Version 5
ARM* MMU Version 5 Architecture introduces the support of tiny pages, which are 1 Kbyte in
size. The reserved field in the first-level descriptor (encoding 0b11) is used as the fine page table
base address.

3.3.1.2

Memory Attributes
The attributes associated with a particular region of memory are configured in the memory
management page table and control the behavior of accesses to the instruction cache, data cache,
mini-data cache and the write buffer. These attributes are ignored when the MMU is disabled.
To allow compatibility with older system software, the new Intel XScale® core attributes take
advantage of encoding space in the descriptors that was formerly reserved.

3.3.1.2.1

Page (P) Attribute Bit
The P bit assigns a page attribute to a memory region. Refer to the Intel® IXP2400 and IXP2800
Network Processor Programmer’s Reference Manual for details about the P bit.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.3.1.2.2

Instruction Cache
When examining these bits in a descriptor, the Instruction Cache only utilizes the C bit. If the C bit
is clear, the Instruction Cache considers a code fetch from that memory to be non-cacheable, and
will not fill a cache entry. If the C bit is set, then fetches from the associated memory region will be
cached.

3.3.1.2.3

Data Cache and Write Buffer
All of these descriptor bits affect the behavior of the Data Cache and the Write Buffer.
If the X bit for a descriptor is 0 (see Table 20), the C and B bits operate as mandated by the ARM*
architecture. If the X bit for a descriptor is one, the C and B bits’ meaning is extended, as detailed
in Table 21.

Table 20. Data Cache and Buffer Behavior when X = 0
Cacheable?

Bufferable?

Write Policy

Line
Allocation
Policy

—

Write Through

Read Allocate

Write Back

Read Allocate

Notes
Stall until complete1

Normally, the processor will continue executing after a data access if no dependency on that access is encountered. With
this setting, the processor will stall execution until the data access completes. This guarantees to software that the data access has taken effect by the time execution of the data access instruction completes. External data aborts from such accesses will be imprecise.

Table 21. Data Cache and Buffer Behavior when X = 1
Cacheable?

Bufferable?

Write Policy

Line
Allocation
Policy

—

Unpredictable; do not use

—

Writes will not coalesce into
buffers1

(Mini Data
Cache)

—

Cache policy is determined
by MD field of Auxiliary
Control register

Write Back

Read/Write
Allocate

3.3.1.2.4

Notes

Normally, bufferable writes can coalesce with previously buffered data in the same address range

Details on Data Cache and Write Buffer Behavior
If the MMU is disabled all data accesses will be non-cacheable and non-bufferable. This is the
same behavior as when the MMU is enabled, and a data access uses a descriptor with X, C, and B
all set to 0.
The X, C, and B bits determine when the processor should place new data into the Data Cache. The
cache places data into the cache in lines (also called blocks). Thus, the basis for making a decision
about placing new data into the cache is a called a “Line Allocation Policy.”

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

If the Line Allocation Policy is read-allocate, all load operations that miss the cache request a
32-byte cache line from external memory and allocate it into either the data cache or mini-data
cache (this is assuming the cache is enabled). Store operations that miss the cache will not cause a
line to be allocated.
If read/write-allocate is in effect, load or store operations that miss the cache will request a 32-byte
cache line from external memory if the cache is enabled.
The other policy determined by the X, C, and B bits is the Write Policy. A write-through policy
instructs the Data Cache to keep external memory coherent by performing stores to both external
memory and the cache. A write-back policy only updates external memory when a line in the cache
is cleaned or needs to be replaced with a new line. Generally, write-back provides higher
performance because it generates less data traffic to external memory.
3.3.1.2.5

Memory Operation Ordering
A fence memory operation (memop) is one that guarantees all memops issued prior to the fence
will execute before any memop issued after the fence. Thus software may issue a fence to impose a
partial ordering on memory accesses.
Table 22 shows the circumstances in which memops act as fences.
Any swap (SWP or SWPB) to a page that would create a fence on a load or store is a fence.

Table 22. Memory Operations that Impose a Fence

3.3.2

operation

load

—

store

load or store

Exceptions
The MMU may generate prefetch aborts for instruction accesses and data aborts for data memory
accesses.
Data address alignment checking is enabled by setting bit 1 of the Control register (CP15,
register 1). Alignment faults are still reported even if the MMU is disabled. All other MMU
exceptions are disabled when the MMU is disabled.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.3.3

Interaction of the MMU, Instruction Cache, and Data Cache
The MMU, instruction cache, and data/mini-data cache may be enabled/disabled independently.
The instruction cache can be enabled with the MMU enabled or disabled. However, the data cache
can only be enabled when the MMU is enabled. Therefore only three of the four combinations of
the MMU and data/mini-data cache enables are valid (see Table 23). The invalid combination will
cause undefined results.

Table 23. Valid MMU and Data/Mini-Data Cache Combinations
MMU

Data/Mini-data Cache

Off

3.3.4

Control

3.3.4.1

Invalidate (Flush) Operation
The entire instruction and data TLB can be invalidated at the same time with one command or they
can be invalidated separately. An individual entry in the data or instruction TLB can also be
invalidated.
Globally invalidating a TLB will not affect locked TLB entries. However, the invalidate-entry
operations can invalidate individual locked entries. In this case, the locked remains in the TLB, but
will never “hit” on an address translation. Effectively, a hole is in the TLB. This situation may be
rectified by unlocking the TLB.

3.3.4.2

Enabling/Disabling
The MMU is enabled by setting bit 0 in coprocessor 15, register 1 (Control register). When the
MMU is disabled, accesses to the instruction cache default to cacheable and all accesses to data
memory are made non-cacheable. A recommended code sequence for enabling the MMU is shown
in Example 14.

Example 14. Enabling the MMU
;
;
;
;
;
;
;
;

This routine provides software with a predictable way of enabling the MMU.
After the CPWAIT, the MMU is guaranteed to be enabled. Be aware
that the MMU will be enabled sometime after MCR and before the instruction
that executes after the CPWAIT.
Programming Note: This code sequence requires a one-to-one virtual to
physical address mapping on this code since
the MMU may be enabled part way through. This would allow the instructions
after MCR to execute properly regardless the state of the MMU.

MRC P15,0,R0,C1,C0,0; Read CP15, register 1
ORR R0, R0, #0x1; Turn on the MMU
MCR P15,0,R0,C1,C0,0; Write to CP15, register 1
; The MMU is guaranteed to be enabled at this point; the next instruction or
; data address will be translated.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.3.4.3

Locking Entries
Individual entries can be locked into the instruction and data TLBs. If a lock operation finds the
virtual address translation already resident in the TLB, the results are unpredictable. An invalidate
by entry command before the lock command will ensure proper operation. Software can also
accomplish this by invalidating all entries, as shown in Example 15.
Locking entries into either the instruction TLB or data TLB reduces the available number of entries
(by the number that was locked down) for hardware to cache other virtual to physical address
translations.
A procedure for locking entries into the instruction TLB is shown in Example 15.
If a MMU abort is generated during an instruction or data TLB lock operation, the Fault Status
register is updated to indicate a Lock Abort, and the exception is reported as a data abort.

Example 15. Locking Entries into the Instruction TLB
; R1, R2 and R3 contain the virtual addresses to translate and lock into
; the instruction TLB.
; The value in R0 is ignored in the following instruction.
; Hardware guarantees that accesses to CP15 occur in program order
MCR P15,0,R0,C8,C5,0

; Invalidate the entire instruction TLB

MCR P15,0,R1,C10,C4,0 ;
;
MCR P15,0,R2,C10,C4,0 ;
;
MCR P15,0,R3,C10,C4,0 ;
;

Translate virtual address (R1) and lock into
instruction TLB
Translate
virtual address (R2) and lock into instruction TLB
Translate virtual address (R3) and lock into
instruction TLB

CPWAIT
; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked instruction TLB entries.

Note:

If exceptions are allowed to occur in the middle of this routine, the TLB may end up caching a
translation that is about to be locked. For example, if R1 is the virtual address of an interrupt
service routine and that interrupt occurs immediately after the TLB has been invalidated, the lock
operation will be ignored when the interrupt service routine returns back to this code sequence.
Software should disable interrupts (FIQ or IRQ) in this case.
As a general rule, software should avoid locking in all other exception types.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

The proper procedure for locking entries into the data TLB is shown in Example 16.
Example 16. Locking Entries into the Data TLB
; R1, and R2 contain the virtual addresses to translate and lock into the data TLB
MCR

P15,0,R1,C8,C6,1

MCR

P15,0,R1,C10,C8,0

;
;
;
;

Invalidate the data TLB entry specified by the
virtual address in R1
Translate virtual address (R1) and lock into
data TLB

; Repeat sequence for virtual address in R2
MCR P15,0,R2,C8,C6,1
; Invalidate the data TLB entry specified by the
; virtual address in R2
MCR P15,0,R2,C10,C8,0
; Translate virtual address (R2) and lock into
; data TLB
CPWAIT

; wait for locks to complete

; The MMU is guaranteed to be updated at this point; the next instruction will
; see the locked data TLB entries.

Note:

3.3.4.4

Care must be exercised here when allowing exceptions to occur during this routine whose handlers
may have data that lies in a page that is trying to be locked into the TLB.

Round-Robin Replacement Algorithm
The line replacement algorithm for the TLBs is round-robin; there is a round-robin pointer that
keeps track of the next entry to replace. The next entry to replace is the one sequentially after the
last entry that was written. For example, if the last virtual to physical address translation was
written into entry 5, the next entry to replace is entry 6.
At reset, the round-robin pointer is set to entry 31. Once a translation is written into entry 31, the
round-robin pointer gets set to the next available entry, beginning with entry 0 if no entries have
been locked down. Subsequent translations move the round-robin pointer to the next sequential
entry until entry 31 is reached, where it will wrap back to entry 0 upon the next translation.
A lock pointer is used for locking entries into the TLB and is set to entry 0 at reset. A TLB lock
operation places the specified translation at the entry designated by the lock pointer, moves the
lock pointer to the next sequential entry, and resets the round-robin pointer to entry 31. Locking
entries into either TLB effectively reduces the available entries for updating. For example, if the
first three entries were locked down, the round-robin pointer would be entry 3 after it rolled over
from entry 31.
Only entries 0 through 30 can be locked in either TLB; entry 31can never be locked. If the lock
pointer is at entry 31, a lock operation will update the TLB entry with the translation and ignore the
lock. In this case, the round-robin pointer will stay at entry 31.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 17 illustrates locked entries in TLB.

entry 0
entry 1

Locked

Figure 17. Example of Locked Entries in TLB

entry 7
entry 8

entry 22
entry 23

entry 30
entry 31
Note: 8 entries locked, 24 entries available for round robin replacement
A9684-01

3.4

Instruction Cache
The Intel XScale® core instruction cache enhances performance by reducing the number of
instruction fetches from external memory. The cache provides fast execution of cached code. Code
can also be locked down when guaranteed or fast access time is required.
Figure 18 shows the cache organization and how the instruction address is used to access the cache.
The instruction cache is a 32-Kbyte, 32-way set associative cache; this means there are 32 sets with
each set containing 32 ways. Each way of a set contains eight 32-bit words and one valid bit, which
is referred to as a line. The replacement policy is a round-robin algorithm and the cache also
supports the ability to lock code in at a line granularity.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 18. Instruction Cache Organization

Set 31
way 0
way 1
Set Index

8 Words (cache line)

CAM
Set 0
way 0
way 1

This example
shows Set 0 being
selected by the
Set Index

Set 1
way 0
way 1

Data
8 Words (cache line)

8 Words (cache line)

CAM

Data
Data

way 31
Tag
Word Select

Instruction Word
(4 bytes)
Instruction Address (Virtual)
31

10 9
Tag

21 0

Set Index Word

Note: CAM = Content Addressable Memory

A9685-01

The instruction cache is virtually addressed and virtually tagged. The virtual address presented to
the instruction cache may be remapped by the PID register.

3.4.1

Instruction Cache Operation

3.4.1.1

Operation when Instruction Cache is Enabled
When the cache is enabled, it compares every instruction request address to the addresses of
instructions that it is holding in cache. If the requested instruction is found, the access “hits” the
cache, which returns the requested instruction. If the instruction is not found, the access “misses”
the cache, which requests a fetch from external memory of the 8-word line (32 bytes) that contains
the instruction (using the fetch policy). As the fetch returns instructions to the cache, they are put in
one of two fetch buffers and the requested instruction is delivered to the instruction decoder. A
fetched line is written into the cache if it is cacheable (code is cacheable if the MMU is disabled or
if the MMU is enabled and the cacheable (C) bit is set to 1 in its corresponding page).

Note:

An instruction fetch may “miss” the cache but “hit” one of the fetch buffers. If this happens, the
requested instruction is delivered to the instruction decoder in the same manner as a cache “hit.”

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.4.1.2

Operation when Instruction Cache is Disabled
Disabling the cache prevents any lines from being written into the instruction cache. Although the
cache is disabled, it is still accessed and may generate a “hit” if the data is already in the cache.
Disabling the instruction cache does not disable instruction buffering that may occur within the
instruction fetch buffers. Two 8-word instruction fetch buffers will always be enabled in the cache
disabled mode. As instruction fetches continue to “hit” within either buffer (even in the presence of
forward and backward branches), no external fetches for instructions are generated. A miss causes
one or the other buffer to be filled from external memory using the fill policy.

3.4.1.3

Fetch Policy
An instruction-cache “miss” occurs when the requested instruction is not found in the instruction
fetch buffers or instruction cache; a fetch request is then made to external memory. The instruction
cache can handle up to two “misses.” Each external fetch request uses a fetch buffer that holds
32-bytes and eight valid bits, one for each word. A miss causes the following:
1. A fetch buffer is allocated.
2. The instruction cache sends a fetch request to the external bus. This request is for a 32-byte line.
3. Instructions words are returned back from the external bus, at a maximum rate of 1 word per
core cycle. As each word returns, the corresponding valid bit is set for the word in the fetch
buffer.
4. As soon as the fetch buffer receives the requested instruction, it forwards the instruction to the
instruction decoder for execution.
5. When all words have returned, the fetched line will be written into the instruction cache if it is
cacheable and if the instruction cache is enabled. The line chosen for update in the cache is
controlled by the round-robin replacement algorithm. This update may evict a valid line at that
location.
6. Once the cache is updated, the eight valid bits of the fetch buffer are invalidated.

3.4.1.4

Round-Robin Replacement Algorithm
The line replacement algorithm for the instruction cache is round-robin. Each set in the instruction
cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next
line to replace in a set is the one after the last line that was written. For example, if the line for the
last external instruction fetch was written into way 5-set 2, the next line to replace for that set
would be way 6. None of the other round-robin pointers for the other sets are affected in this case.
After reset, way 31 is pointed to by the round-robin pointer for all the sets. Once a line is written
into way 31, the round-robin pointer points to the first available way of a set, beginning with way0
if no lines have been locked into that particular set. Locking lines into the instruction cache
effectively reduces the available lines for cache updating. For example, if the first three lines of a
set were locked down, the round-robin pointer would point to the line at way 3 after it rolled over
from way 31.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.4.1.5

Parity Protection
The instruction cache is protected by parity to ensure data integrity. Each instruction cache word
has 1 parity bit. (The instruction cache tag is not parity protected.) When a parity error is detected
on an instruction cache access, a prefetch abort exception occurs if the Intel XScale® core attempts
to execute the instruction. Before servicing the exception, hardware place a notification of the error
in the Fault Status register (Coprocessor 15, register 5).
A software exception handler can recover from an instruction cache parity error. This can be
accomplished by invalidating the instruction cache and the branch target buffer and then returning
to the instruction that caused the prefetch abort exception. A simplified code example is shown in
Example 17. A more complex handler might choose to invalidate the specific line that caused the
exception and then invalidate the BTB.

Example 17. Recovering from an Instruction Cache Parity Error
; Prefetch abort handler
MCR P15,0,R0,C7,C5,0
; Invalidate the instruction cache and branch target
; buffer
CPWAIT

; wait for effect
;

SUBS PC,R14,#4

; Returns to the instruction that generated the
; parity error

; The Instruction Cache is guaranteed to be invalidated at this point

If a parity error occurs on an instruction that is locked in the cache, the software exception handler
needs to unlock the instruction cache, invalidate the cache and then re-lock the code in before it
returns to the faulting instruction.

3.4.1.6

Instruction Cache Coherency
The instruction cache does not detect modification to program memory by loads, stores or actions
of other bus masters. Several situations may require program memory modification, such as
uploading code from disk.
The application program is responsible for synchronizing code modification and invalidating the
cache. In general, software must ensure that modified code space is not accessed until modification
and invalidating are completed.
To achieve cache coherence, instruction cache contents can be invalidated after code modification
in external memory is complete.
If the instruction cache is not enabled, or code is being written to a non-cacheable region, software
must still invalidate the instruction cache before using the newly-written code. This precaution
ensures that state associated with the new code is not buffered elsewhere in the processor, such as
the fetch buffers or the BTB.
Naturally, when writing code as data, care must be taken to force it completely out of the processor
into external memory before attempting to execute it. If writing into a non-cacheable region,
flushing the write buffers is sufficient precaution. If writing to a cacheable region, then the data
cache should be submitted to a Clean/Invalidate operation to ensure coherency.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.4.2

Instruction Cache Control

3.4.2.1

Instruction Cache State at Reset
After reset, the instruction cache is always disabled, unlocked, and invalidated (flushed).

3.4.2.2

Enabling/Disabling
The instruction cache is enabled by setting bit 12 in coprocessor 15, register 1 (Control register).
This process is illustrated in Example 18.

Example 18. Enabling the Instruction Cache
; Enable the ICache
MRC P15, 0, R0, C1, C0, 0
ORR R0, R0, #0x1000
MCR P15, 0, R0, C1, C0, 0

; Get the control register
; set bit 12 -- the I bit
; Set the control register

CPWAIT

3.4.2.3

Invalidating the Instruction Cache
The entire instruction cache along with the fetch buffers are invalidated by writing to
coprocessor 15, register 7. This command does not unlock any lines that were locked in the
instruction cache nor does it invalidate those locked lines. To invalidate the entire cache including
locked lines, the unlock instruction cache command needs to be executed before the invalidate
command.
There is an inherent delay from the execution of the instruction cache invalidate command to
where the next instruction will see the result of the invalidate. The routine in Example 19 can be
used to guarantee proper synchronization.

Example 19. Invalidating the Instruction Cache
MCR P15,0,R1,C7,C5,0

; Invalidate the instruction cache and branch
; target buffer

CPWAIT
; The instruction cache is guaranteed to be invalidated at this point; the next
; instruction sees the result of the invalidate command.

The Intel XScale® core also supports invalidating an individual line from the instruction cache.

3.4.2.4

Locking Instructions in the Instruction Cache
Software has the ability to lock performance critical routines into the instruction cache. Up to
28 lines in each set can be locked; hardware will ignore the lock command if software is trying to
lock all the lines in a particular set (i.e., ways 28 – 31can never be locked). When this happens, the
line is still allocated into the cache, but the lock will be ignored. The round-robin pointer will stay
at way 31 for that set.
Lines can be locked into the instruction cache by initiating a write to coprocessor 15. Register Rd
contains the virtual address of the line to be locked into the cache.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

There are several requirements for locking down code:
1. The routine used to lock lines down in the cache must be placed in non-cacheable memory,
which means the MMU is enabled. As a corollary: no fetches of cacheable code should occur
while locking instructions into the cache.
2. The code being locked into the cache must be cacheable.
3. The instruction cache must be enabled and invalidated prior to locking down lines.
Failure to follow these requirements will produce unpredictable results when accessing the
instruction cache.
System programmers should ensure that the code to lock instructions into the cache does not reside
closer than 128 bytes to a non-cacheable/cacheable page boundary. If the processor fetches ahead
into a cacheable page, then the first requirement noted above could be violated.
Lines are locked into a set starting at way 0 and may progress up to way 27; which set a line gets
locked into depends on the set index of the virtual address. Figure 19 is an example of where lines
of code may be locked into the cache along with how the round-robin pointer is affected.
Figure 19. Locked Line Effect on Round Robin Replacement

set 2

set 31

Locked

set 1

Locked

way 7
way 8

Locked

set 0
way 0
way 1

way 22
way 23

way 30
way 31
Notes:
set 0: 8 ways locked, 24 ways available for round robin replacement
set 1: 23 ways locked, 9 ways available for round robin replacement
set 2: 28 ways locked, only way 28-31 available for replacement
set 31: all 32 ways available for round robin replacement
A9686-01

Software can lock down several different routines located at different memory locations. This may
cause some sets to have more locked lines than others as shown in Figure 19.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Example 20 shows how a routine, called “lockMe” in this example, might be locked into the
instruction cache. Note that it is possible to receive an exception while locking code.
Example 20. Locking Code into the Cache
lockMe:

; This is the code that will be locked into the cache

mov r0, #5
add r5, r1, r2
. . .
lockMeEnd:
. . .
codeLock:
; here is the code to lock the “lockMe” routine
ldr r0, =(lockMe AND NOT 31); r0 gets a pointer to the first line we
should lock
ldr r1, =(lockMeEnd AND NOT 31); r1 contains a pointer to the last line we
should lock
lockLoop:
mcr
cmp
add
bne

3.4.2.5

p15, 0, r0, c9, c1, 0; lock next line of code into ICache
r0, r1
; are we done yet?
r0, r0, #32
; advance pointer to next line
lockLoop
; if not done, do the next line

Unlocking Instructions in the Instruction Cache
The Intel XScale® core provides a global unlock command for the instruction cache. Writing to
coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves them
valid. These lines then become available for the round-robin replacement algorithm.

3.5

Branch Target Buffer (BTB)
The Intel XScale® core uses dynamic branch prediction to reduce the penalties associated with
changing the flow of program execution. The Intel XScale® core features a branch target buffer
that provides the instruction cache with the target address of branch type instructions. The branch
target buffer is implemented as a 128-entry, direct mapped cache.

3.5.1

Branch Target Buffer Operation
The BTB stores the history of branches that have executed along with their targets. Figure 20
shows an entry in the BTB, where the tag is the instruction address of a previously executed branch
and the data contains the target address of the previously executed branch along with two bits of
history information.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 20. BTB Entry

TAG

DATA

Branch Address[31:9,1]

History
Bits[1:0]

Target Address[31:1]

A9687-01

The BTB takes the current instruction address and checks to see if this address is a branch that was
previously seen. It uses bits [8:2] of the current address to read out the tag and then compares this
tag to bits [31:9,1] of the current instruction address. If the current instruction address matches the
tag in the cache and the history bits indicate that this branch is usually taken in the past, the BTB
uses the data (target address) as the next instruction address to send to the instruction cache.
Bit[1] of the instruction address is included in the tag comparison to support Thumb execution.
This organization means that two consecutive Thumb branch (B) instructions, with instruction
address bits[8:2] the same, will contend for the same BTB entry. Thumb also requires 31 bits for
the branch target address. In ARM* mode, bit[1] is 0.
The history bits represent four possible prediction states for a branch entry in the BTB. Figure 21
shows these states along with the possible transitions. The initial state for branches stored in the
BTB is Weakly-Taken (WT). Every time a branch that exists in the BTB is executed, the history
bits are updated to reflect the latest outcome of the branch, either taken or not-taken.
The BTB does not have to be managed explicitly by software; it is disabled by default after reset
and is invalidated when the instruction cache is invalidated.
Figure 21. Branch History

T aken
Taken

Taken

Tak

Not Ta

ken

ST
N ot

N ot T aken

N ot Ta

T aken

ken

Notes:
SN: Strongly Not Take
WN: Weakly Not Taken

ST: Strongly Taken
WT: Weakly Taken
A9688-01

3.5.1.1

Reset
After Processor Reset, the BTB is disabled and all entries are invalidated.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.5.2

Update Policy
A new entry is stored into the BTB when the following conditions are met:

• The branch instruction has executed
• The branch was taken
• The branch is not currently in the BTB
The entry is then marked valid and the history bits are set to WT. If another valid branch exists at
the same entry in the BTB, it will be evicted by the new branch.
Once a branch is stored in the BTB, the history bits are updated upon every execution of the branch
as shown in Figure 21.

3.5.3

BTB Control

3.5.3.1

Disabling/Enabling
The BTB is always disabled with Reset. Software can enable the BTB through a bit in a
coprocessor register.
Before enabling or disabling the BTB, software must invalidate it (described in the following
section). This action will ensure correct operation in case stale data is in the BTB. Software should
not place any branch instruction between the code that invalidates the BTB and the code that
enables/disables it.

3.5.3.2

Invalidation
There are four ways the contents of the BTB can be invalidated.
1. Reset.
2. Software can directly invalidate the BTB via a CP15, register 7 function.
3. The BTB is invalidated when the Process ID register is written.
4. The BTB is invalidated when the instruction cache is invalidated via CP15, register 7
functions.

3.6

Data Cache
The Intel XScale® core data cache enhances performance by reducing the number of data accesses
to and from external memory. There are two data cache structures in the Intel XScale® core, a 32Kbyte data cache and a 2-Kbyte mini-data cache. An eight entry write buffer and a four entry fill
buffer are also implemented to decouple the Intel XScale® core instruction execution from external
memory accesses, which increases overall system performance.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.1

Overviews

3.6.1.1

Data Cache Overview
The data cache is a 32-Kbyte, 32-way set associative cache, i.e., there are 32 sets and each set has
32 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit. There also exist
two dirty bits for every line, one for the lower 16 bytes and the other one for the upper 16 bytes.
When a store hits the cache, the dirty bit associated with it is set. The replacement policy is a
round-robin algorithm and the cache also supports the ability to reconfigure each line as data RAM.
Figure 22 shows the cache organization and how the data address is used to access the cache.
Cache policies may be adjusted for particular regions of memory by altering page attribute bits in
the MMU descriptor that controls that memory.
The data cache is virtually addressed and virtually tagged. It supports write-back and write-through
caching policies. The data cache always allocates a line in the cache when a cacheable read miss
occurs and will allocate a line into the cache on a cacheable write miss when write allocate is
specified by its page attribute. Page attribute bits determine whether a line gets allocated into the
data cache or mini-data cache.

Figure 22. Data Cache Organization

Set 31
way 0
way 1
Set Index

32 bytes (cache line)

CAM
Set 0
way 0
way 1

This example
shows Set 0 being
selected by the
Set Index

Set 1
way 0
way 1

CAM

Data
32 bytes (cache line)

32 bytes (cache line)

CAM

Data
Data

way 31
Tag
Word Select

Byte Alignment
Sign Extension

Byte Select

Data Word
(4 bytes to Destination Register)
Data Address (Virtual)
31
Tag

10 9
54
21 0
Set Index Word Byte

Note: CAM = Content Addressable Memory
A9689-01

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.1.2

Mini-Data Cache Overview
The mini-data cache is a 2-Kbyte, 2-way set associative cache; this means there are 32 sets with
each set containing 2 ways. Each way of a set contains 32 bytes (one cache line) and one valid bit.
There also exist 2 dirty bits for every line, one for the lower 16 bytes and the other one for the
upper 16 bytes. When a store hits the cache, the dirty bit associated with it is set. The replacement
policy is a round-robin algorithm.
Figure 23 shows the cache organization and how the data address is used to access the cache.
The mini-data cache is virtually addressed and virtually tagged and supports the same caching
policies as the data cache. However, lines cannot be locked into the mini-data cache.

Figure 23. Mini-Data Cache Organization

Set 31

way 0
way 1

This example
shows Set 0 being
selected by the
Set Index

32 bytes (cache line)

Set 1

way 0
way 1

Set 0

way 0
way 1

32 bytes (cache line)
32 bytes (cache line)

Tag
Word Select

Byte Alignment
Sign Extension

Byte Select

Data Word
(4 bytes to Destination Register)
Data Address (Virtual)
31
Tag

10 9
54
21 0
Set Index Word Byte

Note: CAM = Content Addressable Memory
A9692-01

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.1.3

Write Buffer and Fill Buffer Overview
The Intel XScale® core employs an eight entry write buffer, each entry containing 16 bytes. Stores
to external memory are first placed in the write buffer and subsequently taken out when the bus is
available. The write buffer supports the coalescing of multiple store requests to external memory.
An incoming store may coalesce with any of the eight entries.
The fill buffer holds the external memory request information for a data cache or mini-data cache
fill or non-cacheable read request. Up to four 32-byte read request operations can be outstanding in
the fill buffer before the Intel XScale® core needs to stall.
The fill buffer has been augmented with a four-entry pend buffer that captures data memory
requests to outstanding fill operations. Each entry in the pend buffer contains enough data storage
to hold one 32-bit word, specifically for store operations. Cacheable load or store operations that
hit an entry in the fill buffer get placed in the pend buffer and are completed when the associated
fill completes. Any entry in the pend buffer can be pended against any of the entries in the fill
buffer; multiple entries in the pend buffer can be pended against a single entry in the fill buffer.
Pended operations complete in program order.

3.6.2

Data Cache and Mini-Data Cache Operation
The following discussions refer to the data cache and mini-data cache as one cache (data/minidata) since their behavior is the same when accessed.

3.6.2.1

Operation when Caching is Enabled
When the data/mini-data cache is enabled for an access, the data/mini-data cache compares the
address of the request against the addresses of data that it is currently holding. If the line containing
the address of the request is resident in the cache, the access “hits’ the cache. For a load operation
the cache returns the requested data to the destination register and for a store operation the data is
stored into the cache. The data associated with the store may also be written to external memory if
write-through caching is specified for that area of memory. If the cache does not contain the
requested data, the access ‘misses’ the cache, and the sequence of events that follows depends on
the configuration of the cache, the configuration of the MMU and the page attributes.

3.6.2.2

Operation when Data Caching is Disabled
The data/mini-data cache is still accessed even though it is disabled. If a load hits the cache it will
return the requested data to the destination register. If a store hits the cache, the data is written into
the cache. Any access that misses the cache will not allocate a line in the cache when it’s disabled,
even if the MMU is enabled and the memory region’s cacheability attribute is set.

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.2.3
3.6.2.3.1

Cache Policies
Cacheability
Data at a specified address is cacheable given the following:

• The MMU is enabled
• The cacheable attribute is set in the descriptor for the accessed address
• The data/mini-data cache is enabled
3.6.2.3.2

Read Miss Policy
The following sequence of events occurs when a cacheable load operation misses the cache:
1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
— If so, the current request is placed in the pending buffer and waits until the previously
requested fill completes, after which it accesses the cache again, to obtain the request data
and returns it to the destination register.
— If there is no outstanding fill request for that line, the current load request is placed in the
fill buffer and a 32-byte external memory read request is made. If the pending buffer or fill
buffer is full, the Intel XScale® core will stall until an entry is available.
2. A line is allocated in the cache to receive the 32 bytes of fill data. The line selected is
determined by the round-robin pointer (see Section 3.6.2.4). The line chosen may contain a
valid line previously allocated in the cache. In this case both dirty bits are examined and if set,
the four words associated with a dirty bit that’s asserted will be written back to external
memory as a 4-word burst operation.
3. When the data requested by the load is returned from external memory, it is immediately sent
to the destination register specified by the load. A system that returns the requested data back
first, with respect to the other bytes of the line, will obtain the best performance.
4. As data returns from external memory, it is written into the cache in the previously allocated
line.
A load operation that misses the cache and is not cacheable makes a request from external memory
for the exact data size of the original load request. For example, LDRH requests exactly two bytes
from external memory, LDR requests four bytes from external memory, etc. This request is placed
in the fill buffer until, the data is returned from external memory, which is then forwarded back to
the destination register(s).

100

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.2.3.3

Write Miss Policy
A write operation that misses the cache, requests a 32-byte cache line from external memory if the
access is cacheable and write allocation is specified in the page; then, the following events occur:
1. The fill buffer is checked to see if an outstanding fill request already exists for that line.
— If so, the current request is placed in the pending buffer and waits until the previously
requested fill completes, after which it writes its data into the recently allocated cache
line.
— If there is no outstanding fill request for that line, the current store request is placed in the
fill buffer and a 32-byte external memory read request is made. If the pending buffer or fill
buffer is full, the Intel XScale® core will stall until an entry is available.
2. The 32 bytes of data can be returned back to the Intel XScale® core in any word order, i.e, the
eight words in the line can be returned in any order. Note that it does not matter, for
performance reasons, which order the data is returned to the Intel XScale® core since the store
operation has to wait until the entire line is written into the cache before it can complete.
3. When the entire 32-byte line has returned from external memory, a line is allocated in the
cache, selected by the round-robin pointer (see Section 3.6.2.4). The line to be written into the
cache may replace a valid line previously allocated in the cache. In this case both dirty bits are
examined and if any are set, the four words associated with a dirty bit that’s asserted will be
written back to external memory as a 4-word burst operation. This write operation will be
placed in the write buffer.
4. The line is written into the cache along with the data associated with the store operation.
If the above condition for requesting a 32-byte cache line is not met, a write miss will cause a write
request to external memory for the exact data size specified by the store operation, assuming the
write request does not coalesce with another write operation in the write buffer.

3.6.2.3.4

Write-Back versus Write-Through
The Intel XScale® core supports write-back caching or write-through caching, controlled through
the MMU page attributes. When write-through caching is specified, all store operations are written
to external memory even if the access hits the cache. This feature keeps the external memory
coherent with the cache, i.e., no dirty bits are set for this region of memory in the data/mini-data
cache. This however does not guarantee that the data/mini-data cache is coherent with external
memory, which is dependent on the system level configuration, specifically if the external memory
is shared by another master.
When write-back caching is specified, a store operation that hits the cache will not generate a write
to external memory, thus reducing external memory traffic.

Hardware Reference Manual

101

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.2.4

Round-Robin Replacement Algorithm
The line replacement algorithm for the data cache is round-robin. Each set in the data cache has a
round-robin pointer that keeps track of the next line (in that set) to replace. The next line to replace
in a set is the next sequential line after the last one that was just filled. For example, if the line for
the last fill was written into way 5-set 2, the next line to replace for that set would be way 6. None
of the other round-robin pointers for the other sets are affected in this case.
After reset, way 31 is pointed to by the round-robin pointer for all the sets. Once a line is written
into way 31, the round-robin pointer points to the first available way of a set, beginning with way 0
if no lines have been reconfigured as data RAM in that particular set. Reconfiguring lines as data
RAM effectively reduces the available lines for cache updating. For example, if the first three lines
of a set were reconfigured, the round-robin pointer would point to the line at way 3 after it rolled
over from way 31. Refer to Section 3.6.4 for more details on data RAM.
The mini-data cache follows the same round-robin replacement algorithm as the data cache except
that there are only two lines the round-robin pointer can point to such that the round-robin pointer
always points to the least recently filled line. A least recently used replacement algorithm is not
supported because the purpose of the mini-data cache is to cache data that exhibits low temporal
locality, i.e., data that is placed into the mini-data cache is typically modified once and then written
back out to external memory.

3.6.2.5

Parity Protection
The data cache and mini-data cache are protected by parity to ensure data integrity; there is one
parity bit per byte of data. (The tags are not parity protected.) When a parity error is detected on a
data/mini-data cache access, a data abort exception occurs. Before servicing the exception,
hardware will set bit 10 of the Fault Status register.
A data/mini-data cache parity error is an imprecise data abort, meaning R14_ABORT (+8) may not
point to the instruction that caused the parity error. If the parity error occurred during a load, the
targeted register may be updated with incorrect data.
A data abort due to a data/mini-data cache parity error may not be recoverable if the data address
that caused the abort occurred on a line in the cache that has a write-back caching policy. Prior
updates to this line may be lost; in this case the software exception handler should perform a “clean
and clear” operation on the data cache, ignoring subsequent parity errors, and restart the offending
process. This operation is shown in Section 3.6.3.3.1.

3.6.2.6

Atomic Accesses
The SWP and SWPB instructions generate an atomic load and store operation allowing a memory
semaphore to be loaded and altered without interruption. These accesses may hit or miss the data/
mini-data cache depending on configuration of the cache, configuration of the MMU, and the page
attributes. Refer to Section 3.11.4 for more information.

102

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.3

Data Cache and Mini-Data Cache Control

3.6.3.1

Data Memory State After Reset
After processor reset, both the data cache and mini-data cache are disabled, all valid bits are set to
0 (invalid), and the round-robin bit points to way 31. Any lines in the data cache that were
configured as data RAM before reset are changed back to cacheable lines after reset, i.e., there are
32 KBytes of data cache and 0 bytes of data RAM.

3.6.3.2

Enabling/Disabling
The data cache and mini-data cache are enabled by setting bit 2 in coprocessor 15, register 1
(Control register).
Example 21 shows code that enables the data and mini-data caches. Note that the MMU must be
enabled to use the data cache.

Example 21. Enabling the Data Cache
enableDCache:
MCR p15, 0, r0, c7, c10, 4; Drain pending data operations...
;
MRC p15, 0, r0, c1, c0, 0; Get current control register
ORR r0, r0, #4
; Enable DCache by setting ‘C’ (bit 2)
MCR p15, 0, r0, c1, c0, 0; And update the Control register

3.6.3.3

Invalidate and Clean Operations
Individual entries can be invalidated and cleaned in the data cache and mini-data cache via
coprocessor 15, register 7. Note that a line locked into the data cache remains locked even after it
has been subjected to an invalidate-entry operation. This will leave an unusable line in the cache
until a global unlock has occurred. For this reason, do not use these commands on locked lines.
This same register also provides the command to invalidate the entire data cache and mini-data
cache. These global invalidate commands have no effect on lines locked in the data cache. Locked
lines must be unlocked before they can be invalidated. This is accomplished by the Unlock Data
Cache command.

Hardware Reference Manual

103

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.3.3.1

Global Clean and Invalidate Operation
A simple software routine is used to globally clean the data cache. It takes advantage of the lineallocate data cache operation, which allocates a line into the data cache. This allocation evicts any
cache dirty data back to external memory. Example 22 shows how data cache can be cleaned.

Example 22. Global Clean Operation
;
;
;
;
;

Global Clean/Invalidate THE DATA CACHE
R1 contains the virtual address of a region of cacheable memory reserved for
this clean operation
R0 is the loop count; Iterate 1024 times which is the number of lines in the
data cache

;; Macro ALLOCATE performs the line-allocation cache operation on the
;; address specified in register Rx.
;;
MACRO ALLOCATE Rx
MCR P15, 0, Rx, C7, C2, 5
ENDM
MOV R0, #1024
LOOP1:
ALLOCATE R1

;
;
;
;

Allocate a line at the virtual address
specified by R1.
Increment the address in R1 to the next cache line
Decrement loop count

ADD R1, R1, #32
SUBS R0, R0, #1
BNE LOOP1
;
;Clean the Mini-data Cache
; Can’t use line-allocate command, so cycle 2KB of unused data through.
; R2 contains the virtual address of a region of cacheable memory reserved for
; cleaning the Mini-data Cache
; R0 is the loop count; Iterate 64 times which is the number of lines in the
; Mini-data Cache.
MOV R0, #64
LOOP2:
LDR R3,[R2],#32 ; Load and increment to next cache line
SUBS R0, R0, #1
; Decrement loop count
BNE LOOP2
;
; Invalidate the data cache and mini-data cache
MCR P15, 0, R0, C7, C6, 0
;

The line-allocate operation does not require physical memory to exist at the virtual address
specified by the instruction, since it does not generate a load/fill request to external memory. Also,
the line-allocate operation does not set the 32 bytes of data associated with the line to any known
value. Reading this data will produce unpredictable results.
The line-allocate command will not operate on the mini Data Cache, so system software must clean
this cache by reading two Kbytes of contiguous unused data into it. This data must be unused and
reserved for this purpose so that it will not already be in the cache. It must reside in a page that is
marked as mini Data Cache cacheable.
The time it takes to execute a global clean operation depends on the number of dirty lines in cache.

104

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.4

Reconfiguring the Data Cache as Data RAM
Software has the ability to lock tags associated with 32-byte lines in the data cache, thus creating
the appearance of data RAM. Any subsequent access to this line will always hit the cache unless it
is invalidated. Once a line is locked into the data cache it is no longer available for cache allocation
on a line fill. Up to 28 lines in each set can be reconfigured as data RAM, such that the maximum
data RAM size is 28 Kbytes.
Hardware does not support locking lines into the mini-data cache; any attempt to do this will
produce unpredictable results.
There are two methods for locking tags into the data cache; the method of choice depends on the
application. One method is used to lock data that resides in external memory into the data cache
and the other method is used to reconfigure lines in the data cache as data RAM. Locking data from
external memory into the data cache is useful for lookup tables, constants, and any other data that is
frequently accessed. Reconfiguring a portion of the data cache as data RAM is useful when an
application needs scratch memory (bigger than the register file can provide) for frequently used
variables. These variables may be strewn across memory, making it advantageous for software to
pack them into data RAM memory.
Refer to the Intel XScale® Core Developers Manual for code examples.
Tags can be locked into the data cache by enabling the data cache lock mode bit located in
coprocessor 15, register 9. Once enabled, any new lines allocated into the data cache will be locked
down.
Note that the PLD instruction will not affect the cache contents if it encounters an error while
executing. For this reason, system software should ensure the memory address used in the PLD is
correct. If this cannot be ascertained, replace the PLD with a LDR instruction that targets a scratch
register.
Lines are locked into a set starting at way 0 and may progress up to way 27; which set a line gets
locked into depends on the set index of the virtual address of the request. Figure 19 is an example
of where lines of code may be locked into the cache along with how the round-robin pointer is
affected.
Software can lock down data located at different memory locations. This may cause some sets to
have more locked lines than others as shown in Figure 19.
Lines are unlocked in the data cache by performing an unlock operation.
Before locking, the programmer must ensure that no part of the target data range is already resident
in the cache. The Intel XScale® core will not refetch such data, which will result in it not being
locked into the cache. If there is any doubt as to the location of the targeted memory data, the cache
should be cleaned and invalidated to prevent this scenario. If the cache contains a locked region
that the programmer wishes to lock again, then the cache must be unlocked before being cleaned
and invalidated.

Hardware Reference Manual

105

Intel® IXP2800 Network Processor
Intel XScale® Core

3.6.5

Write Buffer/Fill Buffer Operation and Control
The write buffer is always enabled, which means stores to external memory will be buffered. The
K bit in the Auxiliary Control register (CP15, register 1) is a global enable/disable for allowing
coalescing in the write buffer. When this bit disables coalescing, no coalescing will occur
regardless the value of the page attributes. If this bit enables coalescing, the page attributes X, C,
and B are examined to see if coalescing is enabled for each region of memory.
All reads and writes to external memory occur in program order when coalescing is disabled in the
write buffer. If coalescing is enabled in the write buffer, writes may occur out of program order to
external memory. Program correctness is maintained in this case by comparing all store requests
with all the valid entries in the fill buffer.
The write buffer and fill buffer support a drain operation, such that before the next instruction
executes, all the Intel XScale® core data requests to external memory have completed.
Writes to a region marked non-cacheable/non-bufferable (page attributes C, B, and X all 0) will
cause execution to stall until the write completes.
If software is running in a privileged mode, it can explicitly drain all buffered writes.

3.7

Configuration
The System Control Coprocessor (CP15) configures the MMU, caches, buffers and other system
attributes. Where possible, the definition of CP15 follows the definition of the StrongARM*
products. Coprocessor 14 (CP14) contains the performance monitor registers and the trace buffer
registers.
CP15 is accessed through MRC and MCR coprocessor instructions and allowed only in privileged
mode. Any access to CP15 in user mode or with LDC or STC coprocessor instructions will cause
an undefined instruction exception.
CP14 registers can be accessed through MRC, MCR, LDC, and STC coprocessor instructions and
allowed only in privileged mode. Any access to CP14 in user mode will cause an undefined
instruction exception.
The Intel XScale® core Coprocessors, CP15 and CP14, do not support access via CDP, MRRC, or
MCRR instructions. An attempt to access these coprocessors with these instructions will result in
an Undefined Instruction exception.
Many of the MCR commands available in CP15 modify hardware state sometime after execution.
A software sequence is available for those wishing to determine when this update occurs.
Like certain other ARM* architecture products, the Intel XScale® core includes an extra level of
virtual address translation in the form of a PID (Process ID) register and associated logic.
Privileged code needs to be aware of this facility because, when interacting with CP15, some
addresses are modified by the PID and others are not.
An address that has yet to be modified by the PID (“PIDified”) is known as a virtual address (VA).
An address that has been through the PID logic, but not translated into a physical address, is a
modified virtual address (MVA). Non-privileged code always deals with VAs, while privileged
code that programs CP15 occasionally needs to use MVAs. For details refer to the Intel XScale®
Core Developers Manual.

106

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.8

Performance Monitoring
The Intel XScale® core hardware provides two 32-bit performance counters that allow two unique
events to be monitored simultaneously. In addition, the Intel XScale® core implements a 32-bit
clock counter that can be used in conjunction with the performance counters; its sole purpose is to
count the number of core clock cycles, which is useful in measuring total execution time.
The Intel XScale® core can monitor either occurrence events or duration events. When counting
occurrence events, a counter is incremented each time a specified event takes place and when
measuring duration, a counter counts the number of processor clocks that occur while a specified
condition is true. If any of the three counters overflow, an IRQ or FIQ will be generated if it’s
enabled. Each counter has its own interrupt enable. The counters continue to monitor events even
after an overflow occurs, until disabled by software. Refer to the Intel® IXP2400 and IXP2800
Network Processor Programmer’s Reference Manual for more detail.
Each of these counters can be programmed to monitor any one of various events.
To further augment performance monitoring, the Intel XScale® core clock counter can be used to
measure the executing time of an application. This information combined with a duration event can
feedback a percentage of time the event occurred with respect to overall execution time.
Each of the three counters and the performance monitoring control register are accessible through
Coprocessor 14 (CP14), registers 0-3. Access is allowed in privileged mode only.
The following are a few notes about controlling the performance monitoring mechanism:

• An interrupt will be reported when a counter’s overflow flag is set and its associated interrupt
enable bit is set in the PMNC register. The interrupt will remain asserted until software clears
the overflow flag by writing a one to the flag that is set. Note: the product specific interrupt
unit and the CPSR must have enabled the interrupt in order for software to receive it.

• The counters continue to record events even after they overflow.

3.8.1

Performance Monitoring Events
Table 24 lists events that may be monitored by the PMU. Each of the Performance Monitor Count
registers (PMN0 and PMN1) can count any listed event. Software selects which event is counted
by each PMNx register by programming the evtCountx fields of the PMNC register.

Table 24. Performance Monitoring Events (Sheet 1 of 2)
Event Number
(evtCount0 or
evtCount1)

Event Definition

0x0

Instruction cache miss requires fetch from external memory.

0x1

Instruction cache cannot deliver an instruction. This could indicate an ICache miss or an
ITLB miss. This event will occur every cycle in which the condition is present.

0x2

Stall due to a data dependency. This event will occur every cycle in which the condition is
present.

0x3

Instruction TLB miss.

0x4

Data TLB miss.

0x5

Branch instruction executed, branch may or may not have changed program flow.

0x6

Branch mispredicted. (B and BL instructions only.)

Hardware Reference Manual

107

Intel® IXP2800 Network Processor
Intel XScale® Core

Table 24. Performance Monitoring Events (Sheet 2 of 2)
Event Number
(evtCount0 or
evtCount1)

Event Definition

0x7

Instruction executed.

0x8

Stall because the data cache buffers are full. This event will occur every cycle in which the
condition is present.

0x9

Stall because the data cache buffers are full. This event will occur once for each contiguous
sequence of this type of stall.

0xA

Data cache access, not including Cache Operations

0xB

Data cache miss, not including Cache Operations

0xC

Data cache write-back. This event occurs once for each ½ line (four words) that are written
back from the cache.

0xD

Software changed the PC. This event occurs any time the PC is changed by software and
there is not a mode change. For example, a mov instruction with PC as the destination will
trigger this event. Executing a swi from User mode will not trigger this event, because it will
incur a mode change.

0x10 — 0x17
all others

Refer to the Intel® IXP2400 and IXP2800 Network Processor Programmer’s Reference
Manual for more details.
Reserved, unpredictable results

Some typical combination of counted events are listed in this section and summarized in Table 25.
In this section, we call such an event combination a mode.
Table 25. Some Common Uses of the PMU
Mode

3.8.1.1

PMNC.evtCount0

PMNC.evtCount1

Instruction Cache Efficiency

0x7 (instruction count)

0x0 (ICache miss)

Data Cache Efficiency

0xA (Dcache access)

0xB (DCache miss)

Instruction Fetch Latency

0x1 (ICache cannot deliver)

0x0 (ICache miss)

Data/Bus Request Buffer Full

0x8 (DBuffer stall duration)

0x9 (DBuffer stall)

Stall/Writeback Statistics

0x2 (data stall)

0xC (DCache writeback)

Instruction TLB Efficiency

0x7 (instruction count)

0x3 (ITLB miss)

Data TLB Efficiency

0xA (Dcache access)

0x4 (DTLB miss)

Instruction Cache Efficiency Mode
PMN0 totals the number of instructions that were executed, which does not include instructions
fetched from the instruction cache that were never executed. This can happen if a branch
instruction changes the program flow; the instruction cache may retrieve the next sequential
instructions after the branch, before it receives the target address of the branch.
PMN1 counts the number of instruction fetch requests to external memory. Each of these requests
loads 32 bytes at a time.
Statistics derived from these two events:

• Instruction cache miss-rate. This is derived by dividing PMN1 by PMN0.
• The average number of cycles it took to execute an instruction or commonly referred to as

cycles-per-instruction (CPI). CPI can be derived by dividing CCNT by PMN0, where CCNT
was used to measure total execution time.

108

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.8.1.2

Data Cache Efficiency Mode
PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable
accesses, mini-data cache access and accesses made to locations configured as data RAM.
Note that STM and LDM will each count as several accesses to the data cache depending on the
number of registers specified in the register list. LDRD will register two accesses.
PMN1 counts the number of data cache and mini-data cache misses. Cache operations do not
contribute to this count.
The statistic derived from these two events is:

• Data cache miss-rate. This is derived by dividing PMN1 by PMN0.
3.8.1.3

Instruction Fetch Latency Mode
PMN0 accumulates the number of cycles when the instruction-cache is not able to deliver an
instruction to the Intel XScale® core due to an instruction-cache miss or instruction-TLB miss.
This event means that the processor core is stalled.
PMN1 counts the number of instruction fetch requests to external memory. Each of these requests
loads 32 bytes at a time. This is the same event as measured in instruction cache efficiency mode
and is included in this mode for convenience so that only one performance monitoring run is need.
Statistics derived from these two events:

• The average number of cycles the processor stalled waiting for an instruction fetch from

external memory to return. This is calculated by dividing PMN0 by PMN1. If the average is
high then the Intel XScale® core may be starved of the bus external to the Intel XScale® core.

• The percentage of total execution cycles the processor stalled waiting on an instruction fetch
from external memory to return. This is calculated by dividing PMN0 by CCNT, which was
used to measure total execution time.

3.8.1.4

Data/Bus Request Buffer Full Mode
The Data Cache has buffers available to service cache misses or uncacheable accesses. For every
memory request that the Data Cache receives from the processor core, a buffer is speculatively
allocated in case an external memory request is required or temporary storage is needed for an
unaligned access. If no buffers are available, the Data Cache will stall the processor core.
The frequency of Data Cache stalls depends on the performance of the bus external to the Intel
XScale® core and what the memory access latency is for Data Cache miss requests to external
memory. If the Intel XScale® core memory access latency is high (possibly due to starvation) these
Data Cache buffers will become full. This performance monitoring mode is provided to determine
whether the Intel XScale® core is being starved of the bus external to the Intel XScale® core —
which affects the performance of the application running on the Intel XScale® core.
PMN0 accumulates the number of clock cycles by which the processor is stalled due to this
condition and PMN1 monitors the number of times this condition occurs.

Hardware Reference Manual

109

Intel® IXP2800 Network Processor
Intel XScale® Core

Statistics derived from these two events:

• The average number of cycles the processor stalled on a data-cache access that may overflow
the data-cache buffers.
This is calculated by dividing PMN0 by PMN1. This statistic lets you know if the duration
event cycles are due to many requests or are attributed to just a few requests. If the average is
high, the Intel XScale® core may be starved of the bus external to the Intel XScale® core.

• The percentage of total execution cycles the processor stalled because a Data Cache request
buffer was not available.
This is calculated by dividing PMN0 by CCNT, which was used to measure total execution
time.

3.8.1.5

Stall/Writeback Statistics
When an instruction requires the result of a previous instruction and that result is not yet available,
the Intel XScale® core stalls, to preserve the correct data dependencies. PMN0 counts the number
of stall cycles due to data dependencies. Not all data dependencies cause a stall; only the following
dependencies cause such a stall penalty:

• Load-use penalty: attempting to use the result of a load before the load completes. To avoid the
penalty, software should delay using the result of a load until it’s available. This penalty shows
the latency effect of data-cache access.

• Multiply/Accumulate-use penalty: attempting to use the result of a multiply or multiply-

accumulate operation before the operation completes. Again, to avoid the penalty, software
should delay using the result until it’s available.

• ALU use penalty: there are a few isolated cases where back-to-back ALU operations may
result in one cycle delay in the execution.

PMN1 counts the number of writeback operations emitted by the data cache. These writebacks
occur when the data cache evicts a dirty line of data to make room for a newly requested line or as
the result of clean operation (CP15, register 7).
Statistics derived from these two events:

• The percentage of total execution cycles the processor stalled because of a data dependency.
This is calculated by dividing PMN0 by CCNT, which was used to measure total execution
time. Often, a compiler can reschedule code to avoid these penalties when given the right
optimization switches.

• Total number of data writeback requests to external memory can be derived solely with PMN1.

110

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.8.1.6

Instruction TLB Efficiency Mode
PMN0 totals the number of instructions that were executed, which does not include instructions
that were translated by the instruction TLB and never executed. This can happen if a branch
instruction changes the program flow; the instruction TLB may translate the next sequential
instructions after the branch, before it receives the target address of the branch.
PMN1 counts the number of instruction TLB table-walks that occurs when there is a TLB miss.
If the instruction TLB is disabled, PMN1 will not increment.
Statistics derived from these two events:

• Instruction TLB miss-rate. This is derived by dividing PMN1 by PMN0.
• The average number of cycles it took to execute an instruction or commonly referred to as
cycles-per-instruction (CPI).
CPI can be derived by dividing CCNT by PMN0, where CCNT was used to measure total
execution time.

3.8.1.7

Data TLB Efficiency Mode
PMN0 totals the number of data cache accesses, which includes cacheable and non-cacheable
accesses, mini-data cache access and accesses made to locations configured as data RAM.
Note that STM and LDM will each count as several accesses to the data TLB depending on the
number of registers specified in the register list. LDRD will register two accesses.
PMN1 counts the number of data TLB table-walks, which occurs when there is a TLB miss. If the
data TLB is disabled PMN1 will not increment.
The statistic derived from these two events is:

• Data TLB miss-rate. This is derived by dividing PMN1 by PMN0.

3.8.2

Multiple Performance Monitoring Run Statistics
Even though only two events can be monitored at any given time, multiple performance monitoring
runs can be done, capturing different events from different modes. For example, the first run could
monitor the number of writeback operations (PMN1 of mode, Stall/Writeback) and the second run
could monitor the total number of data cache accesses (PMN0 of mode, Data Cache Efficiency).
From the results, a percentage of writeback operations to the total number of data accesses can be
derived.

3.9

Performance Considerations
This section describes relevant performance considerations that compiler writers, application
programmers, and system designers need to be aware of to efficiently use the Intel XScale® core.
Performance numbers discussed here include interrupt latency, branch prediction, and instruction
latencies.

Hardware Reference Manual

111

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.1

Interrupt Latency
Minimum Interrupt Latency is defined as the minimum number of cycles from the assertion of any
interrupt signal (IRQ or FIQ) to the execution of the instruction at the vector for that interrupt. The
point at which the assertion begins is TBD. This number assumes best case conditions exist when
the interrupt is asserted, e.g., the system isn’t waiting on the completion of some other operation.
A useful number to work with is the Maximum Interrupt Latency. This is typically a complex
calculation that depends on what else is going on in the system at the time the interrupt is asserted.
Some examples that can adversely affect interrupt latency are:

•
•
•
•

The instruction currently executing could be a 16-register LDM.
The processor could fault just when the interrupt arrives.
The processor could be waiting for data from a load, doing a page table walk, etc.
There are high core-to-system (bus) clock ratios.

Maximum Interrupt Latency can be reduced by:

• Ensuring that the interrupt vector and interrupt service routine are resident in the instruction
cache. This can be accomplished by locking them down into the cache.

• Removing or reducing the occurrences of hardware page table walks. This also can be

accomplished by locking down the application’s page table entries into the TLBs, along with
the page table entry for the interrupt service routine.

3.9.2

Branch Prediction
The Intel XScale® core implements dynamic branch prediction for the ARM* instructions B and
BL and for the Thumb instruction B. Any instruction that specifies the PC as the destination is
predicted as not taken. For example, an LDR or a MOV that loads or moves directly to the PC will
be predicted not taken and incur a branch latency penalty.
These instructions -- ARM B, ARM BL and Thumb B -- enter into the branch target buffer when
they are “taken” for the first time. (A “taken” branch refers to when they are evaluated to be true.)
Once in the branch target buffer, the Intel XScale® core dynamically predicts the outcome of these
instructions based on previous outcomes. Table 26 shows the branch latency penalty when these
instructions are correctly predicted and when they are not. A penalty of 0 for correct prediction
means that the Intel XScale® core can execute the next instruction in the program flow in the cycle
following the branch.

Table 26. Branch Latency Penalty
Core Clock Cycles
Description
ARM*

Thumb

Predicted Correctly. The instruction is in the branch target cache and is
correctly predicted.

Mispredicted. There are three occurrences of branch misprediction, all of which
incur a 4-cycle branch delay penalty.
1. The instruction is in the branch target buffer and is predicted not-taken, but is
actually taken.
2. The instruction is not in the branch target buffer and is a taken branch.
3. The instruction is in the branch target buffer and is predicted taken, but is
actually not-taken

112

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.3

Addressing Modes
All load and store addressing modes implemented in the Intel XScale® core do not add to the
instruction latencies numbers.

3.9.4

Instruction Latencies
The latencies for all the instructions are shown in the following sections with respect to their
functional groups: branch, data processing, multiply, status register access, load/store, semaphore,
and coprocessor. The following section explains how to read these tables.

3.9.4.1

Performance Terms
• Issue Clock (cycle 0)
The first cycle when an instruction is decoded and allowed to proceed to further stages in the
execution pipeline (i.e., when the instruction is actually issued).

• Cycle Distance from A to B
The cycle distance from cycle A to cycle B is (B-A) – that is, the number of cycles from the
start of cycle A to the start of cycle B. Example: the cycle distance from cycle 3 to cycle 4 is
one cycle.

• Issue Latency
The cycle distance from the first issue clock of the current instruction to the issue clock of the
next instruction. The actual number of cycles can be influenced by cache-misses, resourcedependency stalls, and resource availability conflicts.

• Result Latency
The cycle distance from the first issue clock of the current instruction to the issue clock of the
first instruction that can use the result without incurring a resource dependency stall. The
actual number of cycles can be influenced by cache-misses, resource-dependency stalls, and
resource availability conflicts

• Minimum Issue Latency (without Branch Misprediction)
The minimum cycle distance from the issue clock of the current instruction to the first possible
issue clock of the next instruction assuming best case conditions (i.e., that the issuing of the
next instruction is not stalled due to a resource dependency stall; the next instruction is
immediately available from the cache or memory interface; the current instruction does not
incur resource dependency stalls during execution that cannot be detected at issue time; and if
the instruction uses dynamic branch prediction, correct prediction is assumed).

• Minimum Result Latency
The required minimum cycle distance from the issue clock of the current instruction to the
issue clock of the first instruction that can use the result without incurring a resource
dependency stall assuming best case conditions (i.e., that the issuing of the next instruction is
not stalled due to a resource dependency stall; the next instruction is immediately available
from the cache or memory interface; and the current instruction does not incur resource
dependency stalls during execution that cannot be detected at issue time).

• Minimum Issue Latency (with Branch Misprediction)
The minimum cycle distance from the issue clock of the current branching instruction to the
first possible issue clock of the next instruction. This definition is identical to Minimum Issue
Latency except that the branching instruction has been mispredicted. It is calculated by adding

Hardware Reference Manual

113

Intel® IXP2800 Network Processor
Intel XScale® Core

Minimum Issue Latency (without Branch Misprediction) to the minimum branch latency
penalty number from Table 26, which is four cycles.

• Minimum Resource Latency
The minimum cycle distance from the issue clock of the current multiply instruction to the
issue clock of the next multiply instruction assuming the second multiply does not incur a data
dependency and is immediately available from the instruction cache or memory interface.
Example 23 contains a code fragment and an example of computing latencies.
Example 23. Computing Latencies
UMLALr6,r8,r0,r1
ADD r9,r10,r11
SUB r2,r8,r9
MOV r0,r1

Table 27 shows how to calculate Issue Latency and Result Latency for each instruction. Looking at
the issue column, the UMLAL instruction starts to issue on cycle 0 and the next instruction, ADD,
issues on cycle 2, so the Issue Latency for UMLAL is two. From the code fragment, there is a
result dependency between the UMLAL instruction and the SUB instruction. In Table 27,
UMLAL starts to issue at cycle 0 and the SUB issues at cycle 5; so the Result Latency is 5.
Table 27. Latency Example
Cycle

114

Issue

Executing

umlal (1st cycle)

umlal (2nd cycle)

umlal

add

umlal

sub (stalled)

umlal & add

sub (stalled)

umlal

sub

umlal

mov

sub

mov

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.4.2

Branch Instruction Timings

Table 28. Branch Instruction Timings (Predicted by the BTB)
Mnemonic

Minimum Issue Latency when Correctly
Predicted by the BTB

Minimum Issue Latency with Branch
Misprediction

(

Table 29. Branch Instruction Timings (Not Predicted by the BTB)

3.9.4.3

Mnemonic

Minimum Issue Latency when
the branch is not taken

Minimum Issue Latency when
the branch is taken

BLX(1)

N/A

BLX(2)

Data Processing Instruction with
PC as the destination

Same as Table 30

4 + numbers in Table 30

LDR PC,<>

LDM with PC in register list

3 + numreg1

10 + max (0, numreg-3)

numreg is the number of registers in the register list including the PC.

Data Processing Instruction Timings

Table 30. Data Processing Instruction Timings

Mnemonic

is not a Shift/Rotate
by Register

is a Shift/Rotate by
Register or
is RRX

Minimum Issue
Latency

Minimum Result
Latency1

Minimum Issue
Latency

Minimum Result
Latency1

ADC

ADD

AND

BIC

CMN

CMP

EOR

MOV

MVN

ORR

RSB

RSC

SBC

SUB

TEQ

TST

If the next instruction needs to use the result of the data processing for a shift by immediate or as Rn in a QDADD or QDSUB,
one extra cycle of result latency is added to the number listed.

Hardware Reference Manual

115

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.4.4

Multiply Instruction Timings

Table 31. Multiply Instruction Timings (Sheet 1 of 2)
Mnemonic

MLA

Rs Value
(Early Termination)

S-Bit
Value

Minimum
Issue Latency

Minimum Result
Latency1

Minimum Resource
Latency (Throughput)

Rs[31:15] = 0x00000
or
Rs[31:15] = 0x1FFFF

Rs[31:27] = 0x00
or
Rs[31:27] = 0x1F

Rs[31:15] = 0x00000
or
Rs[31:15] = 0x1FFFF

Rs[31:27] = 0x00
or
Rs[31:27] = 0x1F

Rs[31:15] = 0x00000
or
Rs[31:15] = 0x1FFFF

RdLo = 2; RdHi = 3

Rs[31:27] = 0x00
or
Rs[31:27] = 0x1F

RdLo = 3; RdHi = 4

RdLo = 4; RdHi = 5

all others

MUL

all others

SMLAL

all others
SMLALxy

N/A

RdLo = 2; RdHi = 3

SMLAWy

N/A

SMLAxy

N/A

Rs[31:15] = 0x00000
or
Rs[31:15] = 0x1FFFF

RdLo = 2; RdHi = 3

Rs[31:27] = 0x00
or
Rs[31:27] = 0x1F

RdLo = 3; RdHi = 4

RdLo = 4; RdHi = 5

SMULL

all others
SMULWy

N/A

SMULxy

N/A

RdLo = 2; RdHi = 3

RdLo = 3; RdHi = 4

RdLo = 4; RdHi = 5

Rs[31:15] = 0x00000
UMLAL

Rs[31:27] = 0x00
all others

116

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Table 31. Multiply Instruction Timings (Sheet 2 of 2)
Rs Value
(Early Termination)

Mnemonic

Minimum
Issue Latency

Minimum Result
Latency1

RdLo = 2; RdHi = 3

RdLo = 3; RdHi = 4

RdLo = 4; RdHi = 5

Rs[31:15] = 0x00000
UMULL

Rs[31:27] = 0x00
all others

Minimum Resource
Latency (Throughput)

S-Bit
Value

If the next instruction needs to use the result of the multiply for a shift by immediate or as Rn in a QDADD or QDSUB, one
extra cycle of result latency is added to the number listed.

Table 32. Multiply Implicit Accumulate Instruction Timings
Rs Value (Early
Termination)

Minimum Issue
Latency

Minimum Result
Latency

Minimum Resource
Latency
(Throughput)

Rs[31:16] = 0x0000
or
Rs[31:16] = 0xFFFF

Rs[31:28] = 0x0
or
Rs[31:28] = 0xF

all others

MIAxy

N/A

MIAPH

N/A

Mnemonic

MIA

Table 33. Implicit Accumulator Access Instruction Timings
Mnemonic
MAR
MRA
1.

3.9.4.5

Minimum Issue Latency

Minimum Result Latency

Minimum Resource Latency
(Throughput)

(RdLo = 2; RdHi =

3)1

If the next instruction needs to use the result of the MRA for a shift by immediate or as Rn in a QDADD or QDSUB, one extra
cycle of result latency is added to the number listed.

Saturated Arithmetic Instructions

Table 34. Saturated Data Processing Instruction Timings
Minimum Issue Latency

Minimum Result Latency

QADD

Mnemonic

QSUB

QDADD

QDSUB

Hardware Reference Manual

117

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.4.6

Status Register Access Instructions

Table 35. Status Register Access Instruction Timings
Mnemonic

3.9.4.7

Minimum Issue Latency

Minimum Result Latency

MRS

MSR

2 (6 if updating mode bits)

Load/Store Instructions

Table 36. Load and Store Instruction Timings
Mnemonic

Minimum Issue Latency

Minimum Result Latency

LDR

3 for load data; 1 for writeback of base

LDRB

3 for load data; 1 for writeback of base

LDRBT

3 for load data; 1 for writeback of base

LDRD

1 (+1 if Rd is R12)

3 for Rd; 4 for Rd+1; 2 for writeback of base

LDRH

3 for load data; 1 for writeback of base

LDRSB

3 for load data; 1 for writeback of base

LDRSH

3 for load data; 1 for writeback of base

LDRT

3 for load data; 1 for writeback of base

PLD

N/A

STR

1 for writeback of base

STRB

1 for writeback of base

STRBT

1 for writeback of base

STRD

1 for writeback of base

STRH

1 for writeback of base

STRT

1 for writeback of base

Table 37. Load and Store Multiple Instruction Timings
Minimum Issue Latency1

Minimum Result Latency

LDM

3 – 23

1 – 3 for load data; 1 for writeback of base

STM

3 – 18

1 for writeback of base

Mnemonic

3.9.4.8

LDM issue latency is 7 + N if R15 is in the register list and 2 + N if it is not. STM issue latency is calculated as 2 + N. N is
the number of registers to load or store.

Semaphore Instructions

Table 38. Semaphore Instruction Timings
Mnemonic

118

Minimum Issue Latency

Minimum Result Latency

SWP

SWPB

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.9.4.9

Coprocessor Instructions

Table 39. CP15 Register Access Instruction Timings
Mnemonic

Minimum Issue Latency

Minimum Result Latency

MRC

MCR

N/A

Table 40. CP14 Register Access Instruction Timings
Mnemonic

3.9.4.10

Minimum Issue Latency

Minimum Result Latency

MRC

MCR

N/A

LDC

N/A

STC

N/A

Miscellaneous Instruction Timing

Table 41. SWI Instruction Timings
Mnemonic

Minimum latency to first instruction of SWI exception handler

SWI

Table 42. Count Leading Zeros Instruction Timings
Mnemonic

Minimum Issue Latency

Minimum Result Latency

CLZ

3.9.4.11

Thumb Instructions
The timing of Thumb instructions are the same as their equivalent ARM* instructions. This
mapping can be found in the ARM* Architecture Reference Manual. The only exception is the
Thumb BL instruction when H = 0; the timing in this case would be the same as an ARM* data
processing instruction.

3.10

Test Features
This section gives a brief overview of the Intel XScale® core JTAG features. The Intel XScale®
core provides test features compatible with the IEEE Standard Test Access Port and Boundary Scan
Architecture (IEEE Std. 1149.1). These features include a TAP controller, a 5-bit instruction
register, and test data registers to support software debug. The Intel XScale® core also provides
support for a boundary-scan register, device ID register, and other data test registers.
A full description of these features can be found in the Intel® IXP2400 and IXP2800 Network
Processor Programmer’s Reference Manual.

Hardware Reference Manual

119

Intel® IXP2800 Network Processor
Intel XScale® Core

3.10.1

IXP2800 Network Processor Endianness
Endianness defines the way bytes are addressed within a word. A little-endian system is one in
which byte 0 is the least significant byte (LSB) in the word and byte 3 is the most significant byte
(MSB). A big-endian system is one in which byte 0 is the MSB and byte 3 is the LSB. For
example, the value of 0x12345678 at address 0x0 in a 32-bit little-endian system looks like this:

Table 43. Little-Endian Encoding
Address/Byte
Lane

0x0/ByteLane 3

0x0/ByteLane 2

0x0/ByteLane 1

0x0/ByteLane 0

Byte Value

0x12

0x34

0x56

0x78

The same value stored in a big-endian system is shown in Table 44:
Table 44. Big-Endian Encoding
Address/Byte
Lane

0x0/ByteLane 3

0x0/ByteLane 2

0x0/ByteLane 1

0x0/ByteLane 0

Byte Value

0x78

0x56

0x34

0x12

Bits within a byte are always in little-endian order. The least significant bit resides at bit location 0
and the most significant bit resides at bit location 7 (7:0).
The following conventions are used in this document:
1 Byte:

8-bit data

1 Word:

16-bit data

1 Longword:

32-bit data

Longword Little-Endian
Format (LWLE)

32-bit data (0x12345678) arranged as {12 34 56 78}
64-bit data 0x12345678 9ABCDE56 arranged as {12 34 56 78 9A BC DE 56}

Longword Big-Endian format
(LWBE):

32-bit data (0x12345678) arranged as {78 56 34 12}
64-bit data 0x12345678 9ABCDE56 arranged as {78 56 34 12, 56 DE BC 9A}

Endianness for the IXP2800 network processor can be divided into three major categories:

• Read and write transactions initiated by the Intel XScale® core:
— Reads initiated by the Intel XScale® core
— Writes initiated by the Intel XScale® core

• SRAM and DRAM access:
— 64-bit Data transfer between DRAM and the Intel XScale® core
— Byte, word, or longword transfer between SRAM/DRAM and the Intel XScale® core
— Data transfer between SRAM/DRAM and PCI
— Microengine-initiated access to SRAM and DRAM

• PCI Accesses
— Intel XScale® core generated reads/writes to PCI in memory space
— Intel XScale® core generated read/write of external/internal PCI configuration registers

120

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.10.1.1

Read and Write Transactions Initiated by the Intel XScale® Core
The Intel XScale® core may be used in either a little-endian or big-endian configuration. The
configuration affects the entire system in which the Intel XScale® core microarchitecture exists.
Software and hardware must agree on the byte ordering to be used. In software, a system’s
byte order is configured with CP15 register 1, the control register. Bit 7 of this register, the B bit,
informs the processor of the byte order in use by the system. Note that this bit takes effect even if
the MMU is not otherwise in use or enabled. The state of this bit is reflected in the cbiBigEndian
signal.
Although it is the responsibility of system hardware to assign correct byte lanes to each byte field
in the data bus, in the IXP2800 network processor, it is left to the software to interpret byte lanes in
accordance with the endianness of the system. As shown in Figure 24, system byte lanes 0 – 3 are
connected directly to the Intel XScale® core byte lanes 0 – 3. This means that byte lane 0 (M[7:0])
of the system is connected to byte lane 0 (X[7:0]) of the Intel XScale® core, byte lane 1 (M[15:8])
of the system is connected to byte lane 1 (X[15:8]) of the Intel XScale® core, etc.
Interface operation of the Intel XScale® core and the rest of the IXP2800 network processor can be
divided into two parts:

• Intel XScale® core reading from the IXP2800 network processor
• Intel XScale® core writing to the IXP2800 network processor
3.10.1.1.1

Reads Initiated by the Intel XScale® Core
Intel XScale® core reads can be one of the following three types:

• Byte read
• 16-bits (word) read
• 32-bits (longword) read
Byte Read
When reading a byte, the Intel XScale® core generates the byte_enable that corresponds to the
proper byte lane as defined by the endianness setting. Table 45 summarizes byte-enable generation
for this mode.
Table 45. Byte-Enable Generation by the Intel XScale® Core for Byte Transfers in Little- and
Big-Endian Systems
Byte Number
to be Read

Byte-Enables for Little-Endian System

Byte-Enables for Big-Endian System

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

Byte 0

Byte 1

Byte 2

Byte 3

The 4-to-1 multiplexer steers the byte read into the byte lane 0 location of the read register inside
the Intel XScale® core. Select signals for the multiplexer are generated based on endian setting and
ByteEnable generated by the Intel XScale® core as defined in Figure 24.

Hardware Reference Manual

121

Intel® IXP2800 Network Processor
Intel XScale® Core

16-Bit (Word) Read
When reading a word, the Intel XScale® core generates the byte_enable that corresponds to the
proper byte lane as defined by the endianness setting. Figure 25 summarizes byte enable generation
for this mode.
The 4-to-1 multiplexer steers byte lane 0 or byte lane 2 into the byte 0 location of the read register
inside the Intel XScale® core. The 2-to-1 multiplexer steers byte lane 1 or byte lane 3 into the
byte 1 location of the read register inside the Intel XScale® core. The Intel XScale® core does not
allow word access to an odd-byte address. Select signals for the multiplexer are generated based on
endian setting and ByteEnable generated by the Intel XScale® core, as defined in Figure 24.
Table 46 summarizes byte-enable generation for this mode.
Figure 24. Byte Steering for Read and Byte-Enable Generation by the Intel XScale® Core

Intel XScale® Core

D[7:0]

X[7:0] Byte 0

0
S0 1
2
3

D[15:8]

M[7:0]
X[15:8] Byte 1

S1 0
1

BE1

BE2

BE3

M[23:16]

X[31:24] Byte 3

D[31:24]

BE0

M[15:8]

X[23:16] Byte 2

D[23:16]

Intel® IXP2800
Core Gasket

M[31:24]

0
1

X_BE[0]

0
1

X_BE[1]

0
1

X_BE[2]

0
1

X_BE[3]

Big Endian =0
Little Endian = 1

Notes: For 32-bit Operation S0[3:0] = 0001; S1[1:0] = 01
Otherwise: S0[3:0] = X_BE[3:0]; S1[1:0] = X_BE[1:2]
A9694-03

122

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Table 46. Byte-Enable Generation by the Intel XScale® Core for 16-Bit Data Transfers in Littleand Big-Endian Systems
Word to
be Read

Byte-Enables for Little-Endian System

Byte-Enables for Big-Endian System

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

Byte 0,
Byte 1

Byte 2,
Byte 3

32-Bit (Longword) Read
32-bit (longword) reads are independent of endianness. Byte lane 0 from the Intel XScale® core’s
data bus gets into the byte 0 location of the read register inside the Intel XScale® core, byte lane 1
from the Intel XScale® core’s data bus gets into the byte 1 location of the read register inside the
Intel XScale® core, etc. The software determines byte location, based on the endian setting.
3.10.1.1.2

The Intel XScale® Core Writing to the IXP2800
Network Processor
Writes by the Intel XScale® core can also be divided into the following three categories:

• Byte Write
• Word Write (16 bits)
• Longword write (32 bits)
Byte Write
When the Intel XScale® core writes a single byte to external memory, it puts the byte in the byte
lane where it intends to write it, along with the byte enable for that byte turned ON, based on the
endian setting of the system. Intel XScale® core register bits [7:0] always contain the byte to be
written, regardless of the B-bit setting.
For example, if the Intel XScale® core wants to write to byte 0 in the little-endian system, it puts
the byte in byte lane 0 and turns X_BE[0] to ON. If the system is big-endian, the Intel XScale®
core puts byte 0 in byte lane 3 and turns X_BE[3] to ON. Other possible combinations of byte lanes
and byte enables are shown in the Table 47. Byte lanes other than the one currently being driven by
the Intel XScale® core, contain undefined data.
Table 47. Byte-Enable Generation by the Intel XScale® Core for Byte Writes in Little- and
Big-Endian Systems
Byte Number
to be Written

Byte-Enables for Little-Endian Systems

Byte-Enables for Big-Endian Systems

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

Byte 0

Byte 1

Byte 2

Byte 3

Hardware Reference Manual

123

Intel® IXP2800 Network Processor
Intel XScale® Core

Word Write (16-Bits Write)
When the Intel XScale® core writes a 16-bit word to external memory, it puts the bytes in the byte
lanes where it intends to write them along with the byte enables for those bytes turned ON based on
the endian setting of the system. The Intel XScale® core does not allow a word write on an
odd-byte address. The Intel XScale® core register bits [15:0] always contain the word to be written
regardless of the B-bit setting.
For example, if the Intel XScale® core wants to write one word to a little-endian system at address
0x0002, it will copy byte 0 to byte lane 2 and byte 1 to byte lane 3 along with X_BE[2] and
X_BE[3] turned ON. If the Intel XScale® core wants to write one word to a big-endian system at
address 0x0002, it will copy byte 0 to byte lane 0 and byte 1 to byte lane 1 along with X_BE[0] and
X_BE[1] turned ON. Table 48 shows other possible combinations of byte lanes and byte enables.
Byte lanes other than those currently driven by the Intel XScale® core contain undefined data.
Table 48. Byte-Enable Generation by the Intel XScale® Core for Word Writes in Little- and
Big-Endian Systems
Word
to be
Written

Byte-Enables for Little-Endian Systems

Byte-Enables for Big-Endian Systems

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

X_BE[0]

X_BE[1]

X_BE[2]

X_BE[3]

Byte 0,
Byte 1

Byte 2,
Byte 3

Longword (32-Bits) Write
The longword to be written is put on the Intel XScale® core’s data bus with byte 0 on X[7:0],
byte 1 on X[15:8], byte 2 on X[23:16], and byte 4 on X[31:24] (see Figure 25). All of the byte
enables are turned ON. A 32-bit longword write (0x12345678) by the Intel XScale® core to address
0x0000 regardless of endianness, causes byte 0 (0x78) to be written to address 0x0000, byte 1
(0x56) to address 0x0001, byte 2 (0x34) to address 0x0002, and byte 3 (0x12) to address 0x0003.
Figure 25. Intel XScale® Core-Initiated Write to the IXP2800 Network Processor

Byte Write by Intel XScale® Core
Byte
Write

X [7:0]
M[7:0]

Intel® IXP2800
Core Gasket

X_BE [0]
X [15:8]
M[15:8]
X_BE [1]
X [23:18]
M[23:16]
X_BE [2]
X [31:24]
M[31:24]
X_BE [3]

A9695-03

124

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 26. Intel XScale® Core-Initiated Write to the IXP2800 Network Processor (Continued)

Word Write by Intel XScale® Core
Intel® IXP2800
Network Processor
X [7:0]
Byte 0
Write

Word 0

M[7:0]

Byte 1
Write

X_BE [0]
X [15:8]
M[15:8]
X_BE [1]
X [23:18]

Word 1

M[23:16]
X_BE [2]
X [31:24]
M[31:24]
X_BE [3]

Long Word (32 bits)Write by Intel XScale® Core
Intel® IXP2800
Network Processor
Byte 0
Write

X [7:0]
M[7:0]
X_BE [0]

Byte 1
Write

X [15:8]
M[15:8]
X_BE [1]

Byte 2
Write

X [23:18]
M[23:16]
X_BE [2]

Byte 3
Write

X [31:24]
M[31:24]
X_BE [3]
A9696-03

3.11

Intel XScale® Core Gasket Unit

3.11.1

Overview
The Intel XScale® core uses the Core Memory Bus (CMB) to communicate with the functional
blocks. The rest of the IXP2800 Network Processor functional blocks use the Command Push Pull
(CPP) as the global bus to pass data. Therefore, the gasket is needed to translate Core Memory Bus
commands to Command Push Pull commands.
This gasket has a set of local CSRs, including interrupt registers. These registers can be accessed
by the Intel XScale® core via the gasket internal bus.The CSR Access Proxy (CAP) is only allowed
to do a set on these interrupt registers.

Hardware Reference Manual

125

Intel® IXP2800 Network Processor
Intel XScale® Core

The Intel XScale® core coprocessor bus is not used in the IXP2800 Network Processors, therefore
all accesses are only through the Command Memory Bus.
Figure 27 shows the block diagram of the global bus connections to the gasket.
The gasket unit has the following features:

• Interrupts are sent to the Intel XScale® core via the gasket, with the interrupt controller
registers used for masking the interrupts.

• The gasket converts CMB reads and writes to CPP format.
• All the atomic operations are applied on SRAM and SCRATCH only, not DRAM.
• There is a stepping-stone sitting between the Intel XScale® core and the gasket. The Intel

XScale® core runs at 600 – 700 MHz. The gasket currently supports a 1:1 (IXP2800 Network
Processor) clock ratio. For a 2:1 ratio, the Command Push Pull bus will be running at ½ of the
frequency of the Intel XScale® core.

• In IXP2800 memory controllers, read after write ordering is enforced. There is no write after
read enforcement for the Intel XScale® core. The gasket will perform enforcement by
employing Content Addressable Memory (CAM) to detect a write to an address with read
pending. This only applies for writes to SRAM.

• The gasket CPP interface contains one command bus, one D_Push bus, one D_Pull bus, one
S_Push bus, and one S_Pull bus, each with a 32-bit data width.

A maximum four outstanding reads and four outstanding writes from the Intel XScale® core are
allowed.
Figure 27. Global Buses Connection to the Intel XScale® Core Gasket

Intel XScale® Core

Gasket

Local
CSR
Req

CAP CSR

CMD_BUS
SRAM_PULL_BUS
SRAM_PUSH_BUS
DRAM_PULL_BUS
DRAM_PUSH_BUS
A9697-03

126

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.2

Intel XScale® Core Gasket Functional Description

3.11.2.1

Command Memory Bus to Command Push/Pull Conversion
The primary function of the Intel XScale® core gasket unit is to translate commands initiated from
the Intel XScale® core in the Intel XScale® core command bus format, into the IXP2800 internal
command format (Command Push/Pull format).
Table 49 shows how many CPP commands are generated by the gasket from each CMB command.
Write data is guaranteed to be 32-bit (longword) aligned. Table 49 shows only the Store command.
In the Load case, the gasket simply converts it to the CPP format. No command splitting is
required. A Load can only be a byte (8 bits), a word (16 bits), longword (32 bits), or eight
longwords (8x32).

Table 49. CMB Write Command to CPP Command Conversion
Store Length

CPP SRAM
Cmd Count

CPP DRAM
Cmd Count

Byte, word,
longword

2 longwords

1 or 2

3 longwords

1 or 3

Remark
SRAM uses 4-bit mask, and DRAM uses an 8-bit mask.
SRAM: If there is any mask bit detected as ‘0’,two
commands will be generated.
DRAM: If it starts with odd word address, two commands
will be generated.
SRAM: If there is a mask bit of ‘0’ detected, Three SRAM
commands will be generated.
DRAM: always two DRAM commands.

4 longwords

1 or 4

1 or 2

8 longwords

3.11.3

SRAM: If there is a mask bit of ‘0’ detected, four
commands will be generated.
DRAM: If there is a mask bit of ‘0’ detected, two
commands will be generated.
Not allowed in a write.

CAM Operation
In the SRAM controller, access ordering is guaranteed only for a read coming after a write. The
gasket enforces order rules in the following two cases.
1. Write coming after a read.
2. Read-Modify-Write coming after read.
The address CAMing is on 8-word boundaries. The SRAM effective address is 28 bits. Deduct
five bits (two bits for the word address and three bits for eight words), and the tag width for the
CAM is 23 bits wide. The CAM only operates on SRAM accesses.

Hardware Reference Manual

127

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.4

Atomic Operations
The Intel XScale® core has Swap (SWP) and Swap Byte (SWPB) instructions that generate an
atomic read-write pair to a single address. These instructions are supported for the SRAM and
Scratch space, and also to any other address space if it is done by a Read command followed by
Write command.
cbiIO is asserted when a data cache request is initiated to a memory region with cacheable and
bufferable bits in the translation table first-level descriptor set to 0. Also, if cbiIO is asserted during
the CMB read portion of the SWP, then it also does a Read Command followed by Write
Command, regardless of address. In those cases, the SWP/SWPB is atomic with respect to
processes running on the Intel XScale® core, but not with respect to the Microengines.
The following summarizes the Atomic operation.
Address Space

cbiIO

Operation

SRAM/Scratch

RMW Command

Not SRAM/Scratch

Read Command followed by Write Command

Any

Read Command followed by Write Command

When the Intel XScale® core presents the read command portion of the SWP, it will assert the
cbiLock signal. The gasket will “ack” the read and save the BufID in the push_ff. It will not
arbitrate for the command bus at that time; rather it will wait for the corresponding write of the
SWP (which will also have cbiLock asserted). At that time the gasket will arbitrate for the
command bus to send a command with the atomic operation in the command field (the command is
based on the address space chosen for the SRAM/Scratch, which has multiple aliased address
ranges).
The SRAM or Scratch controller will pull the data, do the atomic read-modify-write, and then push
the read data back. The gasket will use the saved BufID when returning the data to CMB.
Note:

Unrelated reads, such as instruction and Page Table fetches, can come in the interval between the
read-lock and write-unlock, and will be handled by the gasket. No other data reads or writes will
come in that interval. Also, the Intel XScale® core will not wait for the SWP read data before
presenting the write data.
The gasket uses address aliases to generate the following atomic operations.

•
•
•
•
•

Bit Set
Bit Clear
Add
Subtract
Swap

For the alias address type of atomic operation, the Intel XScale® core will issue a SWP command
with an alias address if it needs data return. The Intel XScale® core will use the write command
with an alias address if it does not need data return.
Xscale_IF will not check the second address, as long as it detects one write after one read, both
with cbiLock enabled. It will take the write address and put it in the command.

128

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.4.1

Summary of Rules for the Atomic Command Regarding I/O
The following rules summarize the Atomic command, regarding I/O.

• SWP to SRAM/Scratch and Not cbiIO, Xscale_IF generates an Atomic operation command.
• SWP to all other Addresses that are not SRAM/Scratch, will be treated as separate read and
write commands. No Atomic command is generated.

• SWP to SRAM/Scratch and cbiIO, will be treated as separate read and write commands. No
Atomic command is generated.

3.11.4.2

Intel XScale® Core Access to SRAM Q-Array
The Intel XScale® core can access the SRAM controllers queue function to do buffer allocation
and freeing. Allocation does a SRAM dequeue (deq) operation, and freeing does a SRAM enqueue
(enq) operation. Alias addresses are used as shown in Table 50 to access the freelist. Each SRAM
channel supports up to 64 lists, so there are 64 addresses per channel.

Table 50. IXP2800 Network Processor SRAM Q-Array Access Alias Addresses
Channel

Address Range

0xCC00 0100 – 0xCC00 01FC

0xCC40 0100 – 0xCC40 01FC

0xCC80 0100 – 0xCC80 01FC

0xCCC0 0100 – 0xCCC0 01FC

Address 7:2 selects which Queue_Array entry within the SRAM channel is used.
Doing a load to an address in the table will do a deq, and the SRAM controller returns the
dequeued information (i.e., the buffer pointer) as the load data; a store to an address in the table
will do an enq, and the data to be enqueued is taken from the Intel XScale® core store data.
The gasket generates command fields as follows, based on address and cbiLd:
Target_ID = SRAM (00 0010)
Command = deq (1011) if cbiLd, enq (1100) if ~cbiLd
Token[1:0] = 0x0
Byte_Mask = 0xFF
Length = 0x1
Address = {XScale_Address[23:22], XScale_Address[7:2], XScale_Write_Data[25:2]}

Note:

On the command bus, address[31:30] selects the SRAM channel, address[29:24] is the Q_Array
number, and address[23:0] is the SRAM longword address. For Dequeue, the SRAM controller
ignores address[23:0].

Hardware Reference Manual

129

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.5

I/O Transaction
The Intel XScale® core can request an I/O transaction by asserting xsoCBI_IO concurrently with
xsoCBI_Req. The value of xsoCBI_IO is undefined when xsoCBI_Req is not asserted. When the
gasket sees an I/O request with xsoCBI_IO asserted, it will raise xsiCBR_Ack but will not
acknowledge future requests until the IO transaction is complete. The gasket will check if all of the
command FIFOs and write data FIFOs are empty or not. It will also check if the command counters
(SRAM and DRAM) are equal to 0. All of these checks are to guarantee that:

• Writes are issued to the target, and targets have pulled the data.
• Pending reads have their data all back to the gasket.
When the gasket sees that all of the conditions are satisfied, it will assert xsiCBR_SynchDone to
the Intel XScale® core. XsiCBR_SynchDone is one cycle long and does not need to coincide with
xsiCBR_DataValid.

3.11.6

Hash Access
Hash accesses are accomplished by the gasket Local_CSR accesses from the Intel XScale® core.
There are two sets of registers in the gasket that are involved in Hash accesses.

• Four 32-bit XG_GCSR_Hash[3:0] registers for holding the data to be hashed and index
returned as well.

• A XG_GCSR_CTR0(valid) register to hold the status of the Hash Access.
The procedure for the Intel XScale® core to setup a Hash access is as follows.
1. The Intel XScale® core writes data to XG_GCSR_Hash by Local_CSR access, using address
[X:yy:zz]. X selects Hash register set, yy selects hash_48, hash_64, or hash_128 mode, and zz
selects one of four Hash_Data registers.
2. The data write order is 3-2-1-0 (for hash_128) and 1-0 (for hash_48 or hash_64). When the
data write to Hash_Data[0] is performed, it triggers the Hash request to go out on the CPP bus.
At the same time, XG_GCSR_Hash(valid) is cleared by hardware.
3. The Intel XScale® core starts to poll Hash_Result_Valid periodically by Local_CSR read.
4. After a period of time, the Hash_Result is returned to XG_GCSR_Hash, and
XG_GCSR_CTR0(valid) is set to indicate that Hash_Result is ready to be retrieved.
5. The Intel XScale® core issues a Local_CSR read to read back the Hash_Result.
Note:

Each Hash command requests only one index returned.
The Hash CSR is in the gasket local CSR space. See Section 3.11.7.

130

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.7

Gasket Local CSR
There are two sets of Control and Status registers residing in the gasket Local CSR space. ICSR
refers to the Interrupt CSR. The ICSR address range is 0xd600_0000 – 0xd6ff_ffff. The Gasket
CSR (GCSR) refers to the Hash CSRs and debug CSR. It has a range of
0xd700_0000 – 0xd7ff_ffff. GCSR is shown in Table 51.

Note:

The Gasket registers are defined in the IXP2400 and IXP2800 Network Processor Programmer’s
Reference Manual.

Table 51. GCSR Address Map (0xd700 0000)
Bits

Name

R/W

Description
Hash word 0

[31:0]

XG_GCSR_HASH0

R/W

Write from Intel XScale®
core.
Rd/Wr from CPP.
Hash word 1

[31:0]

XG_GCSR_HASH1

R/W

Write from Intel XScale®
core.
Rd/Wr from CPP.

Address Offset
0x00—for 48-bit Hash
0x10—for 64-bit Hash
0x20—for 128-bit Hash
0x04—for 48-bit Hash
0x14—for 64-bit Hash
0x24—for 128-bit Hash

Hash word 2
[31:0]

XG_GCSR_HASH2

R/W

Write from Intel XScale®
core.

0x28—for 128-bit Hash

Rd/Wr from CPP.
Hash word 3
[31:0]

XG_GCSR_HASH3

R/W

Write from Intel XScale®
core.

0x2c—for 128-bit Hash

Rd/Wr from CPP.
[31:1] reserved.
[0] hash valid flag.
[31:0]

XG_GCSR_CTR0

Read from Intel XScale®
core.

0x30

Set by LCSR control.
[31:1] reserved.
[0] Break_Function
[31:0]

XG_GCSR_CTR1

R/W

When set to 1, the debug
break signal is used to stop
the clocks.

0x3c

When set to 0, the debug
break signal is used to
cause an Intel XScale®
core debug breakpoint

Hardware Reference Manual

131

Intel® IXP2800 Network Processor
Intel XScale® Core

3.11.8

Interrupt
The Intel XScale® core CSR controller contains local CSR(s) and interrupts inputs from multiple
sources. The diagram in Figure 28 shows the flow through the controller.
Within the Interrupt/CSR register block there are raw status registers, enable registers, and local
CSR(s). The raw status registers are the un-masked interrupt status. These interrupt status are
masked or steered to theIntel XScale® core’s IRQ or FIQ inputs by multiple levels of enable
registers.
Refer to Figure 29.

• {IRQ,FIQ}Status = (RawStatus & {IRQ,FIQ}Enable)
• {IRQ,FIQ}ErrorStatus = (ErrorRawStatus & {IRQ,FIQ}ErrorEnable)
• {IRQ,FIQ}ThreadStatus_$_# = ({IRQ,FIQ}ThreadRawStatus_$_# &
{IRQ,FIQ}ThreadEnable_$_#)

Each interrupt input is visible in the RawStatus register and is masked or steered by two level of
interrupt enable registers. The error and thread status are masked by one level of enable registers.
Their combination along with other interrupt sources contributes to the RawStatusReg. The
RawStatus is masked via IRQEnable/FIQEnable to trigger the IRQ and FIQ interrupt to the Intel
XScale® core.
The enable register’s bits are set and cleared through EnableSet and EnableClear registers. The
Status, RawStatus, and Enable registers are read-only, and EnableSet and EnableClear are
write-only. Also, Enable and EnableSet share the same address for reads and writes respectively.
Note that software needs to take into account the delay between the clearing of an interrupt
condition and having its status updated in the RawStatus registers. Also in the case of simultaneous
writes to the same registers, the value of the last write is recorded.
Figure 28. Flow Through the Intel XScale® Core Interrupt Controller

IRQ FIQ

CAP_CSR_WR_ADDR
CAP_CSR_WR

CSR
Decode

Interrupt/
CSR
Registers

cbrData

From cbiAdr
CAP_CSR_WR_DATA

From cbiData
A9698-01

132

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 29. Interrupt Mask Block Diagram

{Error,Thread}RawStatusReg
{Error,Thread}RawStatus

{Error,Thread}RQEnReg
IRQ{Error,Thread}Status

{Error,Thread}FIQEnReg
FIQ{Error,Thread}Status

RawStatusReg
Interrupt{Error,Thread}RawStatus

Interrupts,IRQ{Error,Thread}Status
IRQEnReg

To IRQ

Interrupts,FIQ{Error,Thread}Status
FIQEnReg

Other
Enabled
Interrupts

To FIQ

A9699-01

Hardware Reference Manual

133

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12

Intel XScale® Core Peripheral Interface
This section describes the Intel XScale® core Peripheral Interface unit (XPI). The XPI is the block
that connects to all the slow and serial interfaces that communicate with the Intel XScale® core
through the APB. These can also be accessed by the Microengines and PCI unit.
This section does not describe the Intel XScale® core interface protocol, only how to interface with
the peripheral devices connected to the core. The I/O units described are:

•
•
•
•

UART
Watchdog timers
GPIO
Slowport

All the peripheral units are memory mapped from the Intel XScale® core point of view.

3.12.1

XPI Overview
Figure 30 shows the XPI location in the IXP2800 Network Processor. The XPI receives read and
write commands from the Command Push Pull bus to addresses the memory has mapped to I/O
devices.
The SHaC (Scratchpad, Hash Unit, and CSRs) acts like a bridge to control the access from the Intel
XScale® core or other host (like the PCI Unit). The extended APB is used to communicate between
the XPI and the SHaC. The extended APB has only one signal, APB_RDY, added. This signal is
used to tell the SHaC when the transaction should be terminated.
The XPI is responsible for passing the data between the extended APB and the internal blocks, like
the UART, GPIO, Timer, and Slowport, which will in turn pass these data to an external peripheral
device with a corresponding bus format.
The XPI is always a master on the Slowport bus and all the Slowport devices act like slaves. On the
other side, the SHaC is always the master and the XPI is the slave with respect to the APB.

134

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 30. XPI Interfaces for IXP2800 Network Processor

Intel® IXP2400/2800
Network Processor
XPI
[7:0]/[15:0]/[31:0]

UART

Intel
XScale®
Core
Reset
Sequential
Logic

[31:0]

GPIO

APB Bus

CPP Bus

PCI

SlowPort

rx,tx

SONET/SDH
Microprocessor
Interface

[7:0]

Demultiplexor

SHaC

[3:0]

[7:0]

PROM

Timer

watchdog_reset

B1740-02

3.12.1.1

Data Transfers
The current rate for data transfers is four bytes, except for the Slowport. The 8-bit and 16-bit
accesses are only available in the Slowport bus. The devices connected to the Slowport dictate this
data width. The user has to configure the data width register resident in the Slowport to run a
different type of data transaction. There is no burst to Slowport.

3.12.1.2

Data Alignment
For all the CSR accesses, a 32-bit data bus is assumed. Therefore, the lower two bits of the address
bus are ignored.
For the Slowport accesses, 8-, 16-, or 32-bit data access is dictated by the external device
connected to the Slowport. The APB Bus should be able to match the data width according to
which devices it is talking to.
SeeTable 52 for additional details on data alignment.

Hardware Reference Manual

135

Intel® IXP2800 Network Processor
Intel XScale® Core

Table 52. Data Transaction Alignment
Interface Units

APB Bus

Read

Write

GRegs

32 bits

UART

32 bits

GPIO

32 bits

Timer

32 bits

Slowport
Microprocessor Access
Slowport
1

Flash Memory Access
CSR Access
1.

3.12.1.3

8 bits

16 bits

32 bits

Assemble 8 bits into 32-bit data for
32-bit read mode; 8 bits for register
read mode (8-bit read mode).

8 bits

32 bits

32 bits for 32-bit read mode, 8
bits for register read mode;
8 bits for write;
32 bits

The flash memory interface only supports 8-bit wide flash devices. APB write transactions are assumed to be 8-bits wide,
and correspond to one write cycle at the flash interface. APB read transactions are assumed to be 32-bits wide, and correspond to four flash read cycles for the 32-bit read mode set in the SP_FRM register. However, for the flash register read
mode (8-bit read mode), it only needs one flash read cycle of 8-bit data and passes it back to APB directly. By default, the
32-bit read mode is set. It is advisable to stay in this mode most of the time and not change them dynamically during accesses.

Address Spaces for XPI Internal Devices
Table 53 shows the address space assignment for XPI devices.

Table 53. Address Spaces for XPI Internal Devices

136

Units

Starting Address

Ending Address

GPIO

0xC0010000

0xC0010040

TIMER

0xC0020000

0xC0020040

UART

0xC0030000

0xC003001C

PMU

0xC0050000

0xC0050E00

Slowport CSR

0xC0080000

0xC0080028

Slowport Device

0xC4000000

0xC7FFFFFF

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.2

UART Overview
The UART performs serial-to-parallel conversion on data characters received from a peripheral
device and parallel-to-serial conversion on data characters received from the network processor.
The processor can read the complete status of the UART at any time during the functional
operation. Available status information includes the type and condition of the transfer operations
being performed by the UART and any error conditions (parity, overrun, framing, or break
interrupt).
The serial ports can operate in either FIFO or non-FIFO mode. In FIFO mode, a 64-byte transmit
FIFO holds data from the processor to be transmitted on the serial link and a 64-byte receive FIFO
buffers data from the serial link until read by the processor.
The UART includes a programmable baud rate generator that is capable of dividing the clock input
by divisors of 1 to 216 - 1 and produces a 16X clock to drive the internal transmitter logic. It also
drives the receive logic. The UART has a processor interrupt system. The UART can be operated in
polled or in interrupt driven mode as selected by software.
The UART has the following features

• Functionally compatible with National Semiconductor*’s PC16550D for basic receive and
transmit.

• Adds or deletes standard asynchronous communications bits (start, stop, and parity) to or from
the serial data

• Independently controlled transmit, receive, line status
• Programmable baud rate generator allows division of clock by 1 to (216 - 1) and generates an
internal 16X clock

•
•
•
•
•
•
•
•
•

5-, 6-, 7-, or 8-bit characters
Even, odd, or no parity detection
1, 1½, or 2 stop bit generation
Baud rate generation
False start bit detection
64-byte Transmit FIFO
64-byte Receive FIFO
Complete status reporting capability
Internal diagnostic capabilities include:
— Break
— Parity
— Overrun
— Framing error simulation

• Fully prioritized interrupt system controls

Hardware Reference Manual

137

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.3

UART Operation
The format of a UART data frame is shown in Figure 31.

Figure 31. UART Data Frame

Start
Bit

Data
<0>

Data
<1>

Data
<2>

Data
<3>

Data
<4>

Data
<5>

Data
<6>

Data
<7>

Parity
Bit

Stop
Bit 1

Stop
Bit 2

TXD or RXD pin
LSB

MSB

Notes:
Receive data sample counter frequency = 16x bit frequency, each bit is sampled three times in the middle.
Shaded bits are optional and can be proammed by users.
B1741-02

Each data frame is between 7 bits and 12 bits long depending on the size of data programmed, if
parity is enabled and if two stop bits is selected. The frame begins with a start bit that is represented
by a high to low transition. Next, either 5 to 8 bits of data are transmitted, beginning with the least
significant bit. An optional parity bit follows, which is set if even parity is enabled and an odd
number of ones exist within the data byte, or if odd parity is enabled and the data byte contains an
even number of ones. The data frame ends with one, one and a half, or two stop bits as programmed
by the user, which is represented by one or two successive bit periods of a logic one.

3.12.3.1

UART FIFO OPERATION
The UART has one transmit FIFO and one receive FIFO. The transmit FIFO is 64 bytes deep and
eight bits wide. The receive FIFO is 64 bytes deep and 11 bits wide.

3.12.3.1.1

UART FIFO Interrupt Mode Operation –
Receiver Interrupt
When the Receive FIFO and receiver interrupts are enabled (UART_FCR[0]=1 and
UART_IER[0]=1), receiver interrupts occur as follows:

• The receive data available interrupt is invoked when the FIFO has reached its programmed
trigger level. The interrupt is cleared when the FIFO drops below the programmed trigger
level.

• The UART_IIR receive data available indication also occurs when the FIFO trigger level is
reached, and like the interrupt, the bits are cleared when the FIFO drops below the trigger
level.

• The receiver line status interrupt (UART_IIR = C6H), as before, has the highest priority. The
receiver data available interrupt (UART_IIR=C4H) is lower. The line status interrupt occurs
only when the character at the top of the FIFO has errors.

• The data ready bit (DR in UART_LSR register) is set to 1 as soon as a character is transferred
from the shift register to the Receive FIFO. This bit is reset to 0 when the FIFO is empty.

138

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Character Time-out Interrupt
When the receiver FIFO and receiver time-out interrupt are enabled, a character time-out interrupt
occurs when all of the following conditions exist:

• At least one character is in the FIFO.
• The last received character was longer than four continuous character times ago (if two stop
bits are programmed the second one is included in this time delay).

• The most recent processor read of the FIFO was longer than four continuous character times
ago.

The maximum time between a received character and a time-out interrupt is 160 ms at 300 baud
with a 12-bit receive character (i.e., 1 start, 8 data, 1 parity, and 2 stop bits).
When a time-out interrupt occurs, it is cleared and the timer is reset when the processor reads one
character from the receiver FIFO. If a time-out interrupt has not occurred, the time-out timer is
reset after a new character is received or after the processor reads the receiver FIFO.
Transmit Interrupt
When the transmitter FIFO and transmitter interrupt are enabled (UART_FCR[0]=1,
UART_IER[1]=1), transmit interrupts occur as follows:

• The Transmit Data Request interrupt occurs when the transmit FIFO is half empty or more

than half empty. The interrupt is cleared as soon as the Transmit Holding register is written
(1 to 64 characters may be written to the transmit FIFO while servicing the interrupt) or the IIR
is read.

3.12.3.1.2

FIFO Polled Mode Operation
With the FIFOs enabled (TRFIFOE bit of UART_FCR set to 1), setting UART_IER[4:0] to all 0s
puts the serial port in the FIFO polled mode of operation. Since the receiver and the transmitter are
controlled separately, either one or both can be in the polled mode of operation. In this mode,
software checks receiver and transmitter status via the UART_LSR. As stated in the register
description:

• UART_LSR[0] is set as long as there is one byte in the receiver FIFO.
• UART_LSR[1] through UART_LSR[4] specify which error(s) has occurred for the character
at the top of the FIFO. Character error status is handled the same way as interrupt mode. The
UART_IIR is not affected since UART_IER[2] = 0.

• UART_LSR[5] indicates when the transmitter FIFO needs data.
• UART_LSR[6] indicates that both the transmitter FIFO and shift register are empty.
• UART_LSR[7] indicates whether there are any errors in the receiver FIFO.

3.12.4

Baud Rate Generator
The baud rate generator is a programmable block and generates a clock used in the transmit block.
The output frequency of the baud rate generator is 16X the baud rate; baud rate is calculated as:
BaudRate = APB Clock / (16 X Divisor)
The Divisor ranges from 1 to 216 - 1. For example, for an APB clock of 1 MHz and a baud rate of
300 bps, the divisor is 209.

Hardware Reference Manual

139

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.5

General Purpose I/O (GPIO)
The IXP2800 Network Processor has eight General Purpose Input/Output (GPIO) port pins for use
in generating and capturing application-specific input and output signals. Each pin is
programmable as an input or output or as an interrupt signal sourcing from an external device. The
GPIO can be used with appropriate software in I2C application.
Each GPIO pin can be configured as a input or an output by programming the corresponding GPIO
pin direction register. When programmed as an input, the current state of the GPIO can be read
through the corresponding GPIO pin level register. The register can be read at any time and can be
used to confirm the state of the pin when it is configured as an output. In addition, each GPIO pin
can be programmed to detect a rising or a falling edge by setting the corresponding GPIO rising/
falling edge detect registers.
When configured as an output, the pin can be controlled by writing to the GPIO set register to write
a 1 and by writing to the GPIO clear register to write a 0. These registers can be written regardless
of whether the pin is configured as an input or a output.
Each of the GPIO pins is designed the same and instantiated to the number of GPIO port pins.
Figure 32 shows a GPIO functional diagram. The GPIO pin as seen can be programmed based on
the configuration registers.

Figure 32. GPIO Functional Diagram

Pin direction
set/clear/prog
register
Decode
Logic

Pin set/clear/
prog register

APB Bus

GPIO Pin

Edge detect
status register
Pin Level
Register

Edge detect
logic

Rising/Falling edge
detect enable register
A9701-01

140

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.6

Timers
The IXP2800 Network Processor supports four timers. These timers are clocked by the Advanced
Peripheral/Bus Clock (APB-CLK), which runs at 50 MHz to produce the PLPL_APB_CLK,
PLPL_APB_CLK/16, or PLPL_APB_CLK/256 signals. The counters are loaded with an initial
value, count down to 0, and raise an interrupt (if interrupts are not masked).
In addition, timer 4 can be used as a watchdog timer when the watchdog enable bits are configured
to 1. When used as a watchdog timer, and when a count of 0 is encountered, it will initiate the reset
sequence.
Figure 33 shows the timer control unit interfacing with other functional blocks.

Figure 33. Timer Control Unit Interfacing Diagram

CPP

SHaC

APB bus

Intel
XScale®
Core

Timer1,2,3,4

Timers

gpio[3:0]

GPIO

Watchdog
Reset
* Intel® IXP2800 Network Processor
A9702-04

3.12.6.1

Timer Operation
Each timer consists of a 32-bit counter.
By default, the timer counter load register (TCLD) is set to 0xFFFFFFFF. The timer will count
down from 0xFFFFFFFF to 0x00000000, then wrap back to 0xFFFFFFFF and continue to
decrement if the TCLD is not programmed to any value. If a different value is programmed in the
TCLD, then the counter will load this value every time it counts down to 0.
An interrupt is issued to the Intel XScale® core whenever the counter reaches 0. The interrupt
signals can be enabled or disabled by the IRQEnable/FIQEnable registers. The interrupt remains
asserted until it is cleared by writing a 1 to the corresponding timer clear register (TCLR).
The counter can be advanced by the clock, clock divided by 16, clock divided by 256, and the
GPIO signals. The clock rate is controlled by the TCTL value programmed into the TCTL
registers. There are four gpio signals, GPIO[3:0] that correspond to Timer 1, 2, 3, and 4,
respectively. These signal are synchronized within the timer-clock domain before driving the
counter.

Hardware Reference Manual

141

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 34 shows the Timer Internal logic.
Figure 34. Timer Internal Logic Diagram

Timer Registers
Block
TCTL
TCLD

WRITE_DATA

Timer
Control
Logic

TCLR
TWDE

TCSR

READ_DATA
APB_SEL
APB_WR
ADDRESS

Decoder
& Control
Logic

Watchdog Watchdog
Logic
Reset

ENABLE
CLK

Divided
by 16
Counter Logic

Interrupts

Divided
by 16
GP_TM[3:0]

A9703-01

3.12.7

Slowport Unit
The IXP2800 Network Processor Slowport Unit supports basic PROM access and 8-, 16-, and
32-bit microprocessor device access. It allows a master, (Intel XScale® core or Microengine), to do
a read/ write data transfer to these slave devices.
The address bus and data bus are multiplexed to reduce the pin count. In addition, the address bus
is also compressed from A[25:0] down to A[7:0] and shifted out with three clock cycles. Therefore,
an external set of buffers is needed for address storage and latch.
The access can be asynchronous. Insertion of delay cycles is possible for both setup and hold data.
A programmable timing control mechanism is provided for this purpose. There are two types of
interfaces supported in the Slowport Unit:

• Flash memory interface
• Microprocessor interface.

142

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

The Flash memory interface is used for the PROM device. The microprocessor interface can be
used for SONET/SDH Framer microprocessor access.
There are two ports in the Slowport unit. The first is dedicated to the flash memory device while
the second to the microprocessor device.

3.12.7.1

PROM Device Support
For all the Flash Memory access, only 8-bit devices are supported. APB write transactions are
assumed to be 8-bits wide, and correspond to one write cycle at the flash interface. The extended
APB read transactions are assumed to be 32-bits wide, and correspond to four read cycles at the
flash memory interface for all the flash memory data read. However, for the flash register read
inside the flash memory, like the flash status register, the returned data is one byte and placed in the
lower order byte location. In this case, only one external transaction cycle is involved.
To accomplish this, a register (SP_FRM) is installed to allow to configure between 8-bit read mode
and 32-bit read mode. By default, it goes to 32-bit read mode. For the 8-bit read mode, one read
cycle is involved. No packing process is needed. The data will be directly placed onto the lower
order byte, [7:0] and passed to APB. For the 32-bit read mode, it needs four read cycles. All 4 bytes
are packed into a 32-bit data and passed to the APB. 16-bit mode is not supported for read.
Write always accesses the flash with one 8-bit cycle. Therefore, no unpacking process is needed.
The PROM devices supported are listed in Figure 54:

Table 54. 8-Bit Flash Memory Device Density

3.12.7.2

Vendor

Part Number

Size

Intel

28F128J3A

16 MB

Intel

28F640J3A

8 MB

Intel

28F320J3A

4 MB

Microprocessor Interface Support for the Framer
The Slowport Unit also supports a microprocessor interface with Framer components. Some
supported devices are listed in Table 55:

Table 55. SONET/SDH Devices (Sheet 1 of 2)
Vendor

Part Number

Microprocessor
Interface

SP_PCR register
Setting

DW Setting in
SP_ADC register

PMC-Sierra*

PM3386

16 bits

0x3

0x1

PMC-Sierra*

PM5345

8 bits

0x2

0x0

PMC-Sierra*

PM5346

8 bits

0x2

0x0

PMC-Sierra*

PM5347

8 bits

0x2

0x0

PMC-Sierra*

PM5348

8 bits

0x2

0x0

PMC-Sierra*

PM5349

8 bits

0x2

0x0

PMC-Sierra*

PM5350

8 bits

0x2

0x0

PMC-Sierra*

PM5351

8 bits

0x2

0x0

PMC-Sierra*

PM5352

8 bits

0x2

0x0

Hardware Reference Manual

143

Intel® IXP2800 Network Processor
Intel XScale® Core

Table 55. SONET/SDH Devices (Sheet 2 of 2)
Vendor

Part Number

Microprocessor
Interface

SP_PCR register
Setting

DW Setting in
SP_ADC register

PMC-Sierra*

PM5355

8 bits

0x2

0x0

PMC-Sierra*

PM5356

8 bits

0x2

0x0

PMC-Sierra*

PM5357

8 bits

0x2

0x0

PMC-Sierra*

PM5358

16 bits

0x2

0x1

PMC-Sierra*

PM5381

16 bits

0x2

0x1

PMC-Sierra*

PM5382

8 bits

0x2

0x0

PMC-Sierra*

PM5386

16 bits

0x2

0x1

AMCC*

S4801 (AMAZON)

8 bits

0x0

AMCC*

S4803 (YUKON)

8 bits

0x0

AMCC*

S4804 (RHINE)

8/16 bits

0x0/0x3

0x0/0x1

Intel

IXF6012 (Volga)

16 bits

0x3/0x41

0x1

16 bits

0x3/0x4

0x1

0x3/0x4

—

Intel

IXF6048 (Amazon-A)

Intel

3.12.7.3

Centaur

—

Intel

IXF6501

—

0x3/0x4

—

Intel

Ben Nevis

32 bits

0x3/0x41

0x2

Lucent*

TDAT042G5

16 bits

0x1/

0x1

Lucent*

TDAT04622

16 bits

0x1

Lucent*

TDAT021G2

16 bits

0x1

Usually there are two different protocols, Intel or Motorola*, of microprocessor interface in the Intel framer; the setting in the
PCR should match with protocols activated in the framer.

Slowport Unit Interfaces
Figure 35 shows the Slowport unit interface diagram.

Figure 35. Slowport Unit Interface Diagram

SHaC

Intel
XScale®
Core

PROM/
FLASH

APB bus

CPP

PCI

SP_INT

SlowPort

Address/
Data
Convertor
Peripherals
A9704-02

144

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.4

Address Space
The total address space is defined as 64 Mbytes, which is further divided into two segments of
32 Mbytes each. Two devices can be connect to this bus. If these peripheral devices have a density
of 256 Mbits (32 Mbytes) each, all the address space is going to be filled like a contiguous address
space. However, if a small capacity device is used (like a device of 4, 8, or 16 MBytes), there will
be a memory hole left in between these two devices. Figure 36 is a 4-Mbyte memory example.
Trying to read the space in between, you will get the repeating value for each 4-Mbyte location

Figure 36. Address Space Hole Diagram

3FFFFFFh

23FFFFFh
2000000h

4 MB

03FFFFFh
0000000h

4 MB
A9705-01

3.12.7.5

Slowport Interfacing Topology
Figure 37 demonstrates one of the topologies used to connect to an 8-bit device. From the diagram,
we can observe that address is shifted out eight bits at a time and buffered into three 74F377 or
equivalent tri-state buffer devices in three consecutive clock cycles. These buffers also output
separately to form a 25-bit wide address bus to address the 8-bit devices. The data are expected to
be driven out after the address has been placed into the buffers.
There are two devices shown in Figure 37. The top one is the fix-timed device, while the lower
one, self-timing device. For the self-timing device, the access latency depends on the SP_ACK_L
responded back from this device.
Three extra signals, SP_CP, SP_OE_L, and SP_DIR, are added to pack and unpack the data when a
16-bit or 32-bit device is hooked up to Slowport. They are used for special application only as
described below.

Hardware Reference Manual

145

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 37. Slowport Example Application Topology

SP_RD_L

OE_L

SP_WR_L

WE_L

SP_CS_L[0]

CS_L

SP_CS_L[1]
SP_A[1:0]

A[1:0]

SP_AD[7:0]

D[7:0]
A[24:2]

SP_ALE_L
SP_CLK

Intel®
IXP2800
Network
Processor

Clock
Driver
CY2305

CE#

D[7:0]

Q[7:0]

CE#

D[7:0]

Q[7:0]

74f377

A[24:18]

A[24:2]

74f377
OE_L
WE_L
A[17:10]

CE#

D[7:0]

Q[7:0]

74f377

CS_L
A[1:0]
D[7:0]

A[9:2]

SP_ACK_L

ACK_L
A9318-02

3.12.7.6

Slowport 8-Bit Device Bus Protocols
The write/read transfer protocols are discussed in the following sections. The burst transfers are
going to be broken down into single mode transfer. For each single write/read transaction, it can be
either fixed-timed transaction or self-timing transaction. The fixed-timed transaction has the
response fixed in a certain period, that can be controlled by the timing control registers.
For the self-timing transaction, the response timing is dictated by the peripheral device. Hence,
wait states can be inserted during the transaction. All the back-to-back transactions are intervened
with one clock cycle. The Slowport clock, SP_CLK, shown in the following waveform diagrams,
is generated by dividing the PLPL_APB_CLK. The divisor used is specified in the clock control
register, SP_CCR.

146

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.6.1

Mode 0 Single Write Transfer for Fixed-Timed Device
Figure 38, shows the single write transfer for a fixed-timed device with the CSR programmed to a
value of setup=4, pulse width=0, and hold=2, followed by another read transfer.

Figure 38. Mode 0 Single Write Transfer for a Fixed-Timed Device

SP_CLK

SP_ALE_L
SP_CS_L
[1:0]
SP_WR_L

SP_RD_L
SP_A[1:0]
SP_AD[7:0]

A[1:0]

9:2

17:10 24:18

D[7:0]

9:2

17:10 24:18

A9706-02

The transaction is initiated with SP_ALE_L asserted. It latches the address from the SP_AD[7:0]
bus into the external buffer, using three clock cycles. After that, it should deassert the SP_ALE_L
to disable latching the address into the buffers.
The SP_A[1:0] signals span the whole transaction cycle.
For the write, it drives the data onto the SP_AD[7:0]. Meanwhile, it asserts the SP_CS_L[1:0]
signals. Depending on the timing control setup parameter, for this case, the SP_WR_L is not
asserted until four clock cycles have elapsed. The SP_CS_L[1:0] signals are deasserted two clocks
after the SP_WR_L is deasserted.

Hardware Reference Manual

147

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.6.2

Mode 0 Single Write Transfer for Self-Timing Device
Figure 39 depicts the single write transfer for a self-timing device with the CSR programmed to
setup=4, pulse width=0, and hold=3. Similarly, a read transaction is attached behind.

Figure 39. Mode 0 Single Write Transfer for a Self-Timing Device

SP_CLK

SP_ALE_L
SP_CS_L
[1:0]
SP_WR_L

SP_RD_L
SP_A[1:0]
SP_AD[7:0]

A[1:0]

9:2

17:10 24:18

D[7:0]

9:2

17:10 24:18

SP_ACK_L

A9707-03

Similar to the single write for fixed-timed device, the ALE_L, CS_L[1:0], AD[7:0], and A[1:0]
follow the same pattern, and the timing is controlled by the timing control register — except for the
WR_L, which is terminated depending on the SP_ACK_L returned from the self-timing device.
The time-out counter will be set to 255. If no SP_ACK_L responds back when the time-out counter
reaches 0, the transaction is terminated with a time-out. An interrupt signal is issued to the bus
master simultaneously with the time-out register update.

148

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.6.3

Mode 0 Single Read Transfer for Fixed-Timed Device
Figure 40 demonstrates the single read transfer issued to a fixed-timed PROM device followed by
another write transaction. The CSR is assumed to be configured to the value setup=2,
pulse width=10, and hold=1.

Figure 40. Mode 0 Single Read Transfer for Fixed-Timed Device

SP_CLK

SP_ALE_L
SP_CS_L
[1:0]
SP_WR_L

SP_RD_L
SP_A[1:0]
SP_AD[7:0]

A[1:0]

A9:2

A
A
17:10 24:18

D[7:0]

A9708-02

The address is loaded onto the external buffer in three clock cycles with the ALE_L asserted. Then,
a clock cycle is inserted to tri-state all the AD[7:0] signals. The CS_L[1:0] signals come asserted
on the fourth clock cycle. Now, the values stored in the timing control registers take effect. The
RD_L is asserted after two clock cycles. It keeps asserted for ten clock cycles. The CS_L[1:0]
should be de-asserted one clock cycle after RD_L is de-asserted. The data will be valid at clock
cycle 16 as shown in the diagram. Since the hold delay has two cycles, and the transaction is
terminated at clock cycle 16.

Hardware Reference Manual

149

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.6.4

Single Read Transfer for a Self-Timing Device
Figure 41 demonstrates the single read transfer issued to a self-timing PROM device followed by
another write transaction. The CSR assumed to be programmed to the value of setup=4,
pulse width=0, and hold=2.

Figure 41. Mode 0 Single Read Transfer for a Self-Timing Device

SP_CLK

SP_ALE_L
SP_CS_L
[1:0]
SP_WR_L

SP_RD_L
SP_A[1:0]
SP_AD[7:0]

A[1:0]

9:2

17:10 24:18

D[7:0]

9:2

17:10 24:18

D[7:0]

SP_ACK_L

A9709-01

The only difference for self-timed mode is in the SP_ACK_L signal. It has a dominant effect on the
length of the transaction cycle or it overrides the value in the timing control register. A time-out
counter is set to 256. The SP_ACK_L should arrive before the time-out counter counts down to 0.
Similarly to the single write for self-timing device, an interrupt is launched for the time-out event
and the time-out register is updated. In this case, the data will be sampled at clock cycle 12.

3.12.7.7

SONET/SDH Microprocessor Access Support
To support the SONET/SDH Microprocessor Interface, extra logic is added into this unit. Here we
consider three SONET/SDH available components, including the Lucent* TDAT042G5,
PMC-Sierra* PM5351, Intel, and AMCC* SONET/SDH devices.
However, because these microprocessor interfaces are not standardized, we treat them separately
and a configuration register is installed to activate the bus to work with different interface protocol
at a time. Extra pins are also added to accomplish this task.
A microprocessor interface type register is used to provide these kinds of solutions. The user is
allowed to configure the interface to the following four different modes. The pin functionality and
the interface protocol will be changed accordingly. By default, it activates the mode 0 with 8-bit
generic PROM device support as mentioned above.

150

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.7.1

Mode 1: 16-Bit Microprocessor Interface Support with
16-Bit Address Lines
The address size control register is programmed to 16-bit address space for this case. This mode is
designated for the devices with the similar protocol with the Lucent* TDAT042G5 SONET/SDH
device.
16-Bit Microprocessor Interfacing Topology with 16-Bit address lines
Figure 42 shows a solution for the 16-bit microprocessor interface. This solution bridges the
Lucent* TDAT042G5 SONET/SDH 16-bit interface. From Figure 42, we observe that the control
pins SP_RD_L and SP_WR_L are converted to R/W and ADS. The CS and DT are still
compactible with SP_CS_L[1] and SP_ACK_L protocol.
Extra pins are added to accomplish the task of multiplexing and demultiplexing the data bus. The
total pin count is 18.
During the write cycle, 8-bit data are stacked into 16-bit data. They are first shifted into two
tri-state buffers, 74F646 or equivalent by SP_CP, using two consecutive clock cycle; then the
SP_CS_L is used for output of the 16-bit data, which is shared with the CS.
During the read cycle, the 16-bit data are unpacked into 8-bit data by SP_CP. Two 74F646 or
equivalent tri-state buffers are used. First, the 16-bit data are stored into these buffers. Then they
are shifted out by SP_DIR, using two consecutive clock cycles.

Hardware Reference Manual

151

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 42. An Interface Topology with Lucent* TDAT042G5 SONET/SDH

SP_RD_L

R/W#

SP_WR_L

ADS#

SP_CS_L[1]

CS#

SP_ACK_L

DT#

SP_AD[7:0]
SP_ALE_L

CE# D[7:0]
CP

SP_CLK

Clock
Driver
CY2305

ADDR[16:0]

ADDR[16]
CE# D[7:0]

74F377

Q[7:0]

Intel® IXP2000
Network
Processor

74F377

Q[7:0]

ADDR[15:8]
CE# D[7:0]
CP

74F377

Q[7:0]

Lucent
TDAT042G5*

ADDR[7:1]

74F646
D[7:0]

SP_CP

CPAB

VCC

SAB
SBA

DATA[15:0]

CPBA

SP_OE_L

OE#

DATA[15:8]
O[7:0]

DIR

74F646
D[7:0]
CPAB
CPBA

SP_DIR

OE#
DIR

VCC

SAB
SBA
DATA[7:8]
O[7:0]

* Other names and brands may be claimed as property of others.
A9370-03

152

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

16-Bit Microprocessor Write Interface Protocol
Figure 43 uses the Lucent* TDAT042G5 device. In this case, the user should program the P_PCR
register to mode 1 and also program the write timing control register to setup=7, pulse width=5,
and hold=1, which represent seven clock cycles for CS, five clock cycles for DT delay, and one
clock cycle for ADS. They are intervened with two idle cycles.
From Figure 43, we observe that there are a total of twelve clock cycles used for write access,
(i.e., 240 ns), not including an intervened turnaround cycle after the write transaction. The
throughput is 8.3 Mbytes per second.
Figure 43. Mode 1 Single Write Transfer for Lucent* TDAT042G5 Device (B0)

T0
0

T3 T4

T6
12

T0
14

T3 T4

SP_CLK

SP_ALE_L
SP_CS_L[1] /CS#
SP_WR_L/ADS#
SP_RD_L/R/W#
SP_AD[7:0]

A
A
A
D
[7:0] [15:8] [23:16] [7:0]

D[15:8]

A
[7:0]

A
A
[15:8] [23:16]

A
[7:0]

A
[15:0]

SP_ACK_L /DT#
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]
DATA[15:0]

A
[7:0]

A
[15:0]

A[23:0]

D
[7:0]

A[23:0]

D[15:0]

B1742-04

Hardware Reference Manual

153

Intel® IXP2800 Network Processor
Intel XScale® Core

16-Bit Microprocessor Read Interface Protocol
Figure 44, likewise depicts a single read transaction launched from the IXP2800 Network
Processor to the Lucent* TDAT042G5 device followed by a single read transaction. However, in
this case the read timing control register has to be programmed to setup=0, pulse width=7, and
hold =0.
In Figure 44, we can count twelve clock cycles used for the read transaction in total, (i.e., 240 ns)
for a clock cycle of 10 ns, excluding a turnaround cycle after that. It has the throughput of 7.7
Mbytes per second.
Figure 44. Mode 1 Single Read Transfer for Lucent* TDAT042G5 Device (B0)

T0
0

T1
4

T3 T4

T7
10

SP_CLK

SP_ALE_L
SP_CS_L[1] /CS#
SP_WR_L/ADS#
SP_RD_L/R/W#
SP_AD[7:0]

A
A
A
[7:0] [15:8] [23:16]

D[15:8]

D[7:0]

A
A
A
D
[7:0] [15:8] [23:16] [7:0]

D[7:0]

SP_ACK_L /DT#
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]
DATA[15:0]

A
[7:0]

A
[15:0]

A
[7:0]

A[23:0]

D[15:0]

2x[15:8]

A
[15:0]

A[23:0]

D[7:0]

D[15:0]

B1746-04

154

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

3.12.7.7.2

Mode 2: Interface with 8 Data Bits and 11 Address Bits
This application is designed for the PMC-Sierra* PM5351 S/UNI-TETRA* device. For the
PMC-Sierra* PM5351, the address space is programmed to 11 bits; otherwise, other address space
should be specified.
8-Bit PMC-Sierra* PM5351 S/UNI-TETRA* Interfacing Topology
Figure 45 displays one of the topologies used to connect to the Slowport with the PMC-Sierra*
PM5351 S/UNI-TETRA* device.
From Figure 45, because the protocols are very close to the generic Slowport protocol, the pin
counts and the functionality is quite compatible. We do not need to use any more pins in this case.
The only difference is in the INTB signal, which will be connected to the SP_ACK_L. Therefore,
the SP_ACK_L needs to be converted to an interrupt signal.
Also because the address contains only 11bits, two 74F377 or equivalent buffers are needed.
The AS field in the SP_ADC register should be programmed to a 16-bit addressing space with the
upper five address bits unconnected.
The timing controls are similar to the generic case.

Figure 45. An Interface Topology with PMC-Sierra* PM5351 S/UNI-TETRA*

VCC

ALE
SP_RD_L

RDB

SP_WR_L

WRB

SP_CS_L[1]

CSB

SP_ACK_L

INTB

SP_AD[7:0]

DATA[7:0]

SP_ALE_L
SP_CLK

Intel® IXP2800
Network
Processor

Clock
Driver
CY2305

CE#

D[7:0]

Q[7:0]

CE#

D[7:0]

Q[7:0]

74F377

ADDR[10:8]

74F377

ADDR[10:0]

PMC-Sierra*
PM5351

ADDR[7:0]
A9369-04

Hardware Reference Manual

155

Intel® IXP2800 Network Processor
Intel XScale® Core

PMC-Sierra* PM5351 S/UNI-TETRA* Write Interface Protocol
Figure 46 depicts a single write transaction launched from the IXP2800 to the PMC-Sierra*
PM5351 device followed by single read transaction.
The write transaction for the PMC-Sierra* component has six clock cycles or a 120-ns access time
for a 50-MHz Slowport clock. In this case, no intervening cycle is added after the transaction. The
I/O throughput is 8.3 Mbytes per second. The SP_PCR should be programmed to mode 2 and the
fields of SU, PW, and HD in the SP_WTC2 should be set to 1, 2, and 1, respectively. Here, SU,
PW, and HD represent the SP_CS_L[1] pulse width, the SP_WR_L pulse width, and the SP_CP
pulse width, respectively.
Figure 46. Mode 2 Single Write Transfer for PMC-Sierra* PM5351 Device (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1]/CSB
SP_WR_L/WRB
SP_RD_L/RDB
SP_AD[7:0]

A
A
[7:0] [10:8]

D[7:0]

A
A
[7:0] [10:8]

D[7:0]

SP_ACK_L/INTB
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]

A
[7:0]

A
[10:8]

DATA[7:0]

A
[7:0]

A
[10:8]

A
[15:8]

A[10:0]

D[7:0]

A
[7:0]

A
[10:8]

A[10:0]

D[7:0]

B1747-04

PMC-Sierra* PM5351 S/UNI-TETRA* Read Interface Protocol

156

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 47, depicts a single read transaction launched from the IXP2800 Network Processor to the
PMC-Sierra* PM5351 device, followed by a single write transaction.
In this case, there are ten clock cycles of access time, or 200 ns in total, with three turnaround
cycles attached at the back. The I/O throughput is 11.2 Mbytes per second.
Figure 47. Mode 2 Single Read Transfer for PMC-Sierra* PM5351 Device (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1]/CSB
SP_WR_L/WRB
SP_RD_L/RDB
SP_AD[7:0]

A
A
[7:0] [10:8]

D[7:0]

A
A
[7:0] [10:8]

D[7:0]

A
A
[7:0] [10:8]

SP_ACK_L/INTB
SP_CP
SP_OE_L

SP_DIR
A
[15:8]

ADDR[15:0]
DATA[7:0]

A
[7:0]

A
[10:8]

A[10:0]

D[7:0]

A
[7:0]

A
[10:8]

A
[7:0]

A
[10:8]

A[10:0]

D[7:0]

A
[7:0]

A
A
[10:0] [10:0]

A
[7:0]

A
[10:8]

B1748-03

3.12.7.7.3

Mode 3: Support for the Intel and AMCC* 2488 Mbps
SONET/SDH Microprocessor Interface
The user has to configure the address bus to 10 bits.
Mode 3 Interfacing Topology
Figure 48 demonstrates one of the topologies used to connect the Slowport to the Intel and AMCC*
2488-Mbps SONET/SDH device. Similar to the Lucent* TDAT042G5 interface, the address and
the data need demultiplexing. Totally, it requires four buffers to accomplish this task.
The SP_RD_L, SP_WR_L, and SP_CS_L[1] entirely match the RDB, WRB, and CSB pins in the
Intel and AMCC* component. However, the INT has to be connected to the SP_ACK_L as the
PMC-Sierra* Interface does. The ALE pin shares the SP_CP signal. If the timing does not meet
specification, then ALE can be tied high as shown in Figure 49. It employs the same method as
Lucent*’s TDAT042G5’s topology to pack and unpack the data between the IXP2800 Slowport
interface and the Intel and AMCC* microprocessor interface.

Hardware Reference Manual

157

Intel® IXP2800 Network Processor
Intel XScale® Core

For a write, SP_CP loads the data onto the 74F646 (or equivalent) tri-state buffers, using two clock
cycles. To reduce the pin count, the 16-bit data is latched with the same pin (SP_CS_L[1]),
assuming that a turnaround cycle is inserted between the transaction cycles.
For a read, data are shifted out of two 74F646 or equivalent tri-state buffers by SP_CP, using two
consecutive clock cycles.
Figure 48. An Interface Topology with Intel / AMCC* SONET/SDH Device

VCC

SP_RD_L

RDB

SP_WR_L

WRB

SP_CS_L[1]

CSB

SP_ACK_L

INT

MCUTYPE

SP_AD[7:0]
ALE
ADDR[9:0]
SP_ALE_L

CE# D[7:0]
CP

SP_CLK

Intel®
IXP2800
Network
Processor

Clock
Driver
CY2305

ADDR[10:8]
CE# D[7:0]
CP

Intel® or
AMCC*
SONET/SDH

ADDR[7:1]
VCC

74F646

DATA[15:0]
SAB
SBA

CPAB
CPBA

SP_OE_L

74F377

Q[7:0]

D[7:0]

SP_CP

74F377

Q[7:0]

OE#
DIR

DATA[15:8]
O[7:0]

VCC

74F646
D[7:0]
CPAB

SAB
SBA

CPBA

SP_DIR

OE#
DIR

DATA[7:0]
O[7:0]

* Other names and brands may be claimed as property of others.
A9714-02

158

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 49. Mode 3 Second Interface Topology with Intel / AMCC* SONET/SDH Device

VCC

SP_RD_L

SP_WR_L

RWB

SP_CS_L[1]

CSB

MCUTYPE

INT

SP_ACK_L
VCC

SP_AD[7:0]

ALE
ADDR[9:0]

SP_ALE_L

SP_CLK

Intel®
IXP2800
Network
Processor

Clock
Driver
CY2305

CE#

D[7:0]

Q[7:0]

74F377

ADDR[10:8]
CE#

D[7:0]

Q[7:0]

74F377

ADDR[7:1]
VCC

74F646

DATA[15:0]

D[7:0]

SP_CP
SP_OE_L

CPAB
CPBA
OE#
DIR

SAB
SBA
DATA[15:8]
O[7:0]

VCC

74F646
D[7:0]

SP_DIR

Intel® or
AMCC*
SONET/SDH

CPAB
CPBA
OE#
DIR

SAB
SBA
DATA[7:0]
O[7:0]

* Other names and brands may be claimed as property of others.
A9715-02

Hardware Reference Manual

159

Intel® IXP2800 Network Processor
Intel XScale® Core

Mode 3 Write Interface Protocol
Figure 50 depicts a single write transaction launched from the IXP2800 Network Processor to the
Intel and AMCC* SONET/SDH device, followed by two consecutive reads.
Compared with the Lucent* TDAT042G5, this device has a shorter access time, about eight clock
cycles (i.e., 160 ns). In this case, an intervening cycle may not be needed for the write transactions.
Therefore, the throughput is about 12.5 Mbytes per second.
Figure 50. Mode 3 Single Write Transfer Followed by Read (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1]/CSB
SP_WR_L/WRB
SP_RD_L/RDB
SP_AD[7:0]

A
A
[7:0] [15:8]

D
[7:0]

D[15:8]

A
[7:0]

A
[15:8]

D[15:8]

D[7:0]

SP_ACK_L/INT
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]
DATA[15:0]

A
[7:0]

A
[10:8]

A[10:1]

D
[15:8]

D[15:0]

A
[7:0]

A[10:1]

D[15:0]

2xD[15:8]

B1749-04

160

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Mode 3 Read Interface Protocol
Figure 51 depicts a single read transaction launched from the IXP2800 to the Intel and AMCC*
SONET/SDH device, followed by two consecutive writes.
Similarly, the access time is much better than the Lucent* TDAT042G5. The access time is eight
clock cycles or 160 ns for a 50-MHz Slowport clock. Here, there are three intervening cycles
between transactions. Therefore, the throughput is 11.1 Mbytes per second.
Figure 51. Mode 3 Single Read Transfer Followed by Write (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1] /CSB
SP_WR_L/WRB
SP_RD_L/RDB
SP_AD[7:0]

A
[7:0]

A
[15:8]

D[15:8]

D[7:0]

A
[7:0]

A
[15:8]

A
[7:0]

A
[10:8]

D
[7:0]

D[15:8]

A
[7:0] A[15:8]

A[10:1]

A
[10:8]

SP_ACK_L /INT
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]

A
[7:0]

DATA[15:0]

A[10:1]

D[15:0]

2xD[15:8]

D[15:0]

B1752-07

Mode 4 Interfacing Topology
Figure 52 demonstrates one of the topologies used to connect Slowport to the Intel and AMCC*
SONET/SDH device.
Similar to the Lucent* TDAT042G5 interface, the address and the data need demultiplexing. It
requires a total of six buffers.
The RD_L, WR_L, and CS_L[1] entirely match the E, RWB, and CSB pins respectively, in the
Intel framer configured to Motorola* mode. However, the INT has to be connected to the
SP_ACK_L as the PMC-Sierra* Interface does. The ALE pin can share the SP_CP. However, if it
does not meet the timing, the ALE pin can be tied high as shown in Figure 53.

Hardware Reference Manual

161

Intel® IXP2800 Network Processor
Intel XScale® Core

It employs the same way to pack and unpack the data between the IXP2800 Network Processor
Slowport interface and the Intel and AMCC* microprocessor interface.
For a write, W2B loads the data onto the 74F646 or equivalent tri-state buffers, using two clock
cycles. To reduce the pin count, the 16-bit data are latched with the same pin (CS_L[1]), assuming
that a turnaround cycle is inserted between the transaction cycles.
For a read, data are pipelined out of two 74F646 or equivalent tri-state buffers by B2S, using two
consecutive clock cycles.
Figure 52. An Interface Topology with Intel / AMCC* SONET/SDH Device in Motorola* Mode

SP_RD_L

SP_WR_L

RWB

SP_CS_L[1]

CSB

SP_ACK_L

INT

MCUTYPE

SP_AD[7:0]
ALE
ADDR[9:0]
SP_ALE_L

SP_CLK

Intel®
IXP2800
Network
Processor

Clock
Driver
CY2305

CE#

D[7:0]

Q[7:0]
ADDR[10.8]

CE#

D[7:0]

Q[7:0]

VCC

DATA[15:0]

D[7:0]

SP_OE_L

74F377

Intel® or
AMCC*
SONET/SDH

ADDR[7:1]

74F646

SP_CP

74F377

SAB
SBA

CPAB
CPBA

DATA[15:8]

OE#
DIR

O[7:0]

VCC

74F646
D[7:0]
CPAB

SAB
SBA

CPBA

SP_DIR

OE#
DIR

DATA[7:0]
O[7:0]

* Other names and brands may be claimed as property of others.
A9718-02

162

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Figure 53. Second Interface Topology with Intel / AMCC* SONET/SDH Device

SP_RD_L

SP_WR_L

RWB

SP_CS_L[1]

CSB

MCUTYPE

INT

SP_ACK_L
VCC

SP_AD[7:0]

ALE
ADDR[9:0]

SP_ALE_L

SP_CLK

Intel®
IXP2800
Network
Processor

Clock
Driver
CY2305

CE#

D[7:0]

Q[7:0]
ADDR[10:8]

CE#

D[7:0]

Q[7:0]

VCC

74F646

SP_OE_L

DATA[15:0]

CPAB
CPBA
OE#
DIR

SAB
SBA
DATA[15:8]
O[7:0]

VCC

74F646
D[7:0]

SP_DIR

74F377

Intel® or
AMCC*
SONET/SDH

ADDR[7:1]

D[7:0]

SP_CP

74F377

CPAB
CPBA
OE#
DIR

SAB
SBA
DATA[7:0]
O[7:0]

* Other names and brands may be claimed as property of others.
A9719-02

Hardware Reference Manual

163

Intel® IXP2800 Network Processor
Intel XScale® Core

Mode 4 Write Interface Protocol
Figure 54 depicts a single write transaction launched from the IXP2800 Network Processor to the
Intel and AMCC* SONET/SDH device, followed by two consecutive reads.
Compared with the Lucent* TDAT042G5, this device has a shorter access time, about eight clock
cycles (i.e., 120 ns). In this case, an intervened cycle may not be needed; therefore, the throughput
is about 12.5 Mbytes per second.
Figure 54. Mode 4 Single Write Transfer (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1]/CSB
SP_WR_L/RWB
SP_RD_L/E
SP_AD[7:0]

A
[7:0]

A
[15:8]

D
[7:0]

D[15:8]

A
[7:0]

A
[15:8]

D[15:8]

D[7:0]

SP_ACK_L/INT
SP_CP
SP_OE_L

SP_DIR
ADDR[15:0]
DATA[15:0]

A
[7:0]

A[10:1]

D
[7:0]

D[15:0]

A
[7:0]

A[10:1]

D[15:0]

2xD[15:8]

B1756-04

164

Hardware Reference Manual

Intel® IXP2800 Network Processor
Intel XScale® Core

Mode 4 Read Interface Protocol
Figure 55 shows a single read transaction launched from the IXP2800 Network Processor to the
Intel and AMCC* SONET/SDH device, followed by two consecutive writes.
Similarly, the access time is much better than the Lucent* TDAT042G5. The access time is about
eight clock cycles or 160 ns. Here, we need an intervened cycle at the back. Therefore, the
throughput is 11.2 Mbytes per second.
Figure 55. Mode 4 Single Read Transfer (B0)

SP_CLK

SP_ALE_L
SP_CS_L[1]/CSB
SP_WR_L/RWB
SP_RD_L/E
SP_AD[7:0]

A
[7:0]

A
[15:8]

D[15:8]

D[7:0]

A
[7:0]

A
[15:8]

D
[7:0]

D[15:8]

A
[7:0]

SP_ACK_L/INT
SP_CP
SP_OE_L
SP_DIR
ADDR[15:0]

A
[7:0]

DATA[15:0]

A[10:1]

D[15:0]

2xD[15:8]

A
[7:0]

A[10:1]

D
[7:0]

D[15:0]

B1757-07

Hardware Reference Manual

165

Intel® IXP2800 Network Processor
Intel XScale® Core

166

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

Microengines

This section defines the Network Processor Microengine (ME). This is the second version of the
Microengine, and is often referred to as the MEv2 (Microengine Version 2).

4.1

Overview
The following sections describe the programmer’s view of the Microengine. The block diagram in
Figure 56 is used in the description. Note that this block diagram is simplified for clarity, not all
interface signals are shown, and some blocks and connectivity have been omitted to make the
diagram more readable. This block diagram does not show any pipeline stages, rather it shows the
logical flow of information.
The Microengine provides support for software controlled multi-threaded operation. Given the
disparity in processor cycle times versus external memory times, a single thread of execution will
often block waiting for external memory operations to complete. Having multiple threads available
allows for threads to interleave operation—there is often at least one thread ready to run while
others are blocked.

Hardware Reference Manual

167

Intel® IXP2800 Network Processor
Microengines

Figure 56. Microengine Block Diagram

NNData_In
(from previous ME)

640
Local
Mem

d
e
c
o
d
e

S_Push
(from SRAM
Scratchpad,
MSF, Hash,
PCI, CAP)

D_Push
(from DRAM)

128
GPRs
(A Bank)

128
GPRs
(B Bank)

128
Next
Neighbor

128
D
XFER
In

128
S
XFER
In

Control
Store

Lm_addr_1
Lm_addr_0
A_Src
B_Src
T_Index
NN_Get
CRC_Remainder

Immed

CRC Unit
B_Operand

A_Operand

Execution
Datapath
(Shift, Add, Subtract, Multiply Logicals,
Find First Bit, CAM)
ALU_Out
Dest

S_Push

NN_Data_Out
(to next ME)
128
D
XFER
Out

Local
CSRs

CMD
FIFO
(4)

128
S
XFER
Out

Command

D_Pull

Control
Data

S_Pull
B1670-01

168

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

4.1.1

Control Store
The Control Store is a static RAM that holds the program that the Microengine executes. It holds
8192 instructions, each of which is 40 bits wide. It is initialized by an external device that writes to
Ustore_Addr and Ustore_Data Local CSRs.
The Control Store can optionally be protected by parity against soft errors. The parity protection is
optional, so that it can be disabled for implementations that don’t need or want to pay the cost for
it. Parity checking is enabled by CTX_Enable[Control Store Parity Enable]. A parity error on an
instruction read will halt the Microengine and assert an output signal that can be used as an
interrupt.

4.1.2

Contexts
There are eight hardware Contexts available in the Microengine. To allow for efficient context
swapping, each Context has its own register set, Program Counter, and Context specific Local
registers. Having a separate copy per Context eliminates the need to move Context specific
information to/from shared memory and Microengine registers for each Context swap. Fast context
swapping allows a Context to do computation while other Contexts wait for IO (typically external
memory accesses) to complete or for a signal from another Context or hardware unit. Note: a
context swap is similar to a taken branch in timing.
Each of the eight Contexts is always in one of four states.
1. Inactive — Some applications may not require all eight contexts. A Context is in the Inactive
state when its CTX_Enable CSR enable bit is a 0.
2. Executing — A Context is in Executing state when its context number is in
Active_CTX_Status CSR. The executing Context’s PC is used to fetch instructions from the
Control Store. A Context will stay in this state until it executes an instruction that causes it to
go to Sleep state (there is no hardware interrupt or preemption; Context swapping is
completely under software control). At most one Context can be in Executing state at any time.
3. Ready — In this state, a Context is ready to execute, but is not because a different Context is
executing. When the Executing Context goes to Sleep state, the Microengine’s context arbiter
selects the next Context to go to the Executing state from among all the Contexts in the Ready
state. The arbitration is round robin.
4. Sleep — Context is waiting for external event(s) specified in the CTX_#_Wakeup_Events
CSR to occur (typically, but not limited to, an IO access). In this state the Context does not
arbitrate to enter the Executing state.
The state diagram in Figure 57 illustrates the Context state transitions. Each of the eight Contexts
will be in one of these states. At most one Context can be in Executing state at a time; any number
of Contexts can be in any of the other states.

Hardware Reference Manual

169

Intel® IXP2800 Network Processor
Microengines

Figure 57. Context State Transition Diagram

CTX_ENABLE bit is set by
Intel XScale® Core
Reset

Inactive

CTX_ENABLE
bit is cleared

Sleep

Ready

CTX_ENABLE bit is cleared

lE
rn a

ig
nt S

n al

es
arriv

Context executes
CTX Arbitration instruction

Executing Context goes
to Sleep state, and this
Context is the highest
round-robin priority.

Executing

A9352-03

The Microengine is in Idle state whenever no Context is running (all Contexts are in either Inactive
or Sleep states). This state is entered:
1. After reset (because CTX_Enable Local CSR is clear, putting all Contexts into Inactive states).
2. When a context swap is executed, but no context is ready to wakeup.
3. When a ctx_arb[bpt] instruction is executed by the Microengine (this is a special case of
condition 2 above, since the ctx_arb[bpt] clears CTX_Enable, putting all Contexts into
Inactive states).
The Microengine provides the following functionality during Idle state:
1. The Microengine continuously checks if a Context is in Ready state. If so, a new Context
begins to execute. If no Context is Ready, the Microengine remains in the Idle state.
2. Only the ALU instructions are supported. They are used for debug via special hardware
defined in number 3 below.
3. A write to the Ustore_Addr Local CSR with the Ustore_Addr[ECS] bit set, causing the
Microengine to repeatedly execute the instruction pointed by the address specified in the
Ustore_Addr CSR. Only the ALU instructions are supported in this mode. Also, the result of
the execution is written to the ALU_Out Local CSR rather than a destination register.
4. A write to the Ustore_Addr Local CSR with the Ustore_Addr[ECS] bit set, followed by a
write to the Ustore_Data Local CSR loads an instruction into the Control Store. After the
Control Store is loaded, execution proceeds as described in number 3 above. Note that the
write to Ustore_Data causes Ustore_Addr to increment, so it must be written back to the
address of the desired instruction.

170

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

4.1.3

Datapath Registers
As shown in the block diagram in Figure 56, each Microengine contains four types of 32-bit
datapath registers:

•
•
•
•
4.1.3.1

256 General Purpose registers
512 Transfer registers
128 Next Neighbor registers
640 32-bit words of Local Memory

Note:

4.1.3.2

The Microengine registers are defined in the IXP2400 and IXP2800 Network Processor
Programmer’s Reference Manual.

Transfer Registers
There are four types of transfer (abbreviated as Xfer) registers used for transferring data to and
from the Microengine and locations external to the Microengine (DRAMs, SRAMs, etc.).

•
•
•
•

S_TRANSFER_IN
S_TRANSFER_OUT
D_TRANSFER_IN
D_TRANSFER_OUT

Transfer_In registers, when used as a source in an instruction, supply operands to the execution
datapath. The specific register selected is either encoded in the instruction or selected indirectly
using T_Index. Transfer_In registers are written by external units based on the Push_ID input to
the Microengine.
Transfer_Out registers, when used as a destination in an instruction, are written with the result from
the execution datapath. The specific register selected is encoded in the instruction, or selected
indirectly via T_Index. Transfer_Out registers supply data to external units based on the Pull_ID
input to the Microengine.
As shown in Figure 56, the S_TRANSFER_IN and D_TRANSFER_IN registers connect to both
the S_Push and D_Push buses via a multiplexor internal to the Microengine. Additionally, the
S_TRANSFER_OUT and D_TRANSFER_OUT Transfer registers connect to both the S_Pull and
D_Pull buses. This feature enables a programmer to use the either type of transfer register
regardless of the source or destination of the transfer.

Hardware Reference Manual

171

Intel® IXP2800 Network Processor
Microengines

Typically, the external units access the Transfer registers in response to commands sent by the
Microengines. The commands are sent in response to instructions executed by the Microengine
(for example, the command instructs a SRAM controller to read from external SRAM, and place
the data into a S_TRANSFER_IN register). However, it is possible for an external unit to access a
given Microengine’s Transfer registers either autonomously, or under control of a different
Microengine, or the Intel XScale® core, etc. The Microengine interface signals controlling writing/
reading of the Transfer_In/Transfer_Out registers are independent of the operation of the rest of the
Microengine.

4.1.3.3

Next Neighbor Registers
A new feature added for the Microengine Version 2 are 128 Next Neighbor registers that provide a
dedicated datapath for transferring data from the previous/next neighbor Microengine. Next
Neighbor registers, when used as a source in an instruction, supply operands to the execution
datapath. They are written in two different ways: (1) by an external entity, typically, but not limited
to, another adjacent Microengine, or (2) by the same Microengine they are in, as controlled by
CTX_Enable[NN_Mode]. The specific register is selected in one of two ways: (1) Contextrelative, the register number is encoded in the instruction, or (2) as a Ring, selected via NN_Get
and NN_Put CSR registers.
When CTX_Enable[NN_Mode] is ‘0’ – When Next Neighbor is used as a destination in an
instruction, the instruction result data is sent out of the Microengine, typically to another, adjacent
Microengine.
When CTX_Enable[NN_Mode] is ‘1’– When Next Neighbor is used as a destination in an
instruction, the instruction result data is written to the selected Next Neighbor register in the
Microengine. Note that there is a 5-instruction latency until the newly written data can be read.
The data is not sent out of the Microengine as it would be when CTX_Enable[NN_Mode] is ‘0’.

Table 56. Next Neighbor Write as a Function of CTX_Enable[NN_Mode]
Where the Write Goes
NN_Mode

4.1.3.4

External?

NN Register in this
Microengine?

Yes

Local Memory
Local Memory is addressable storage located in the Microengine, organized as 640 32-bit words.
Local Memory is read and written exclusively under program control. Local Memory supplies
operands to the execution datapath as a source, and receives results as a destination.
The specific Local Memory location selected is based on the value in one of the Local
Memory_Addr registers, which are written by local_CSR_wr instructions. There are two
LM_Addr registers per Context and a working copy of each. When a Context goes to the Sleep
state, the value of the working copies is put into the Context’s copy of LM_Addr. When the
Context goes to the Executing state, the value in its copy of LM_Addr is put into the working
copies. The choice of LM_Addr_0 or LM_Addr_1 is selected in the instruction.

172

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

It is also possible to make use of both or one LM_Addrs as global by setting
CTX_Enable[LM_Addr_0_Global] and/or CTX_Enable[LM_Addr_1_Global]. When used
globally, all Contexts use the working copy of LM_Addr in place of their own Context specific
one; the Context specific ones are unused.

4.1.4

Addressing Modes
GPRs can be accessed in two different addressing modes: Context-Relative and Absolute. Some
instructions can specify either mode; other instructions can specify only Context-Relative mode.

• Transfer and Next Neighbor registers can be accessed in Context-Relative and Indexed modes.
• Local Memory is accessed in Indexed mode.
• The addressing mode in use is encoded directly into each instruction, for each source and
destination specifier.

4.1.4.1

Context-Relative Addressing Mode
The GPRs are logically subdivided into equal regions such that each Context has exclusive access
to one of the regions. The number of regions (four or eight) is configured in the CTX_Enable CSR.
Thus, a Context-Relative register name is actually associated with multiple different physical
registers. The actual register to be accessed is determined by the Context making the access request
(the Context number is concatenated with the register number specified in the instruction — see
Table 57). Context-Relative addressing is a powerful feature that enables eight different contexts to
share the same microcode, yet maintain separate data.
Table 57 shows how the Context number is used in selecting the register number in relative mode.
The register number in Table 57 is the Absolute GPR address, or Transfer or Next Neighbor Index
number to use to access the specific Context-Relative register. For example, with eight active
Contexts, Context-Relative Register 0 for Context 2 is Absolute Register Number 32.

Table 57. Registers Used by Contexts in Context-Relative Addressing Mode
Number of
Active
Contexts

Hardware Reference Manual

Active
Context
Number

GPR
Absolute Register Numbers

S_Transfer or
Neighbor
Index Number

D_Transfer
Index Number

A Port

B Port

0 – 15

16 – 31

32 – 47

48 – 63

64 – 79

80 – 95

96 – 111

112 – 127

0 – 31

32 – 63

64 – 95

96 – 127

173

Intel® IXP2800 Network Processor
Microengines

4.1.4.2

Absolute Addressing Mode
With Absolute addressing, any GPR can be read or written by any one of the eight Contexts in a
Microengine. Absolute addressing enables register data to be shared among all of the Contexts,
e.g., for global variables or for parameter passing. All 256 GPRs can be read by Absolute address.

4.1.4.3

Indexed Addressing Mode
With Indexed addressing, any Transfer or Next Neighbor register can be read or written by any one
of the eight Contexts in an Microengine. Indexed addressing enables register data to be shared
among all of the Contexts. For indexed addressing the register number comes from the T_Index
register for Transfer registers or NN_Put and NN_Get registers (for Next Neighbor registers).

4.2

Local CSRs
Local Control and Status registers (CSRs) are external to the Execution Datapath, and hold specific
purpose information. They can be read and written by special instructions (local_csr_rd and
local_csr_wr) and are typically accessed less frequently than datapath registers. Because Local
CSRs are not built in the datapath, there is a write to use delay of either three or four cycles, and a
read to consume penalty of one cycle.

4.3

Execution Datapath
The Execution Datapath can take one or two operands, perform an operation, and optionally write
back a result. The sources and destinations can be GPRs, Transfer registers, Next Neighbor
registers, and Local Memory. The operations are shifts, addition, subtraction, logicals,
multiplication, byte-align, and “find first bit set”.

4.3.1

Byte Align
The datapath provides a mechanism to move data from source register(s) to any destination
register(s) with byte aligning. Byte aligning takes four consecutive bytes from two concatenated
values (eight bytes), starting at any of four byte boundaries (0, 1, 2, 3), and based on the endian
type (which is defined in the instruction opcode), as shown in Table 58. The four bytes are taken
from two concatenated values. Four bytes are always supplied from a temporary register that
always holds the A or B operand from the previous cycle, and the other four bytes from the B or A
operand of the Byte Align instruction. The operation is described below using the block diagram
Figure 58. The alignment is controlled by the two LSBs of the Byte_Index Local CSR.

Table 58. Align Value and Shift Amount
Align Value
(in Byte_Index[1:0])

174

Right Shift Amount (Number of Bits in Decimal)
Little-Endian

Big-Endian

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

Figure 58. Byte Align Block Diagram

Prev_B

Prev_A

. . .

. . .
A_Operand

B_Operand

Shift

Byte_Index

Result
A9353-01

Example 24 shows an align sequence of instructions and the value of the various operands.
Table 59 shows the data in the registers for this example. The value in Byte_Index[1:0] CSR
(which controls the shift amount) for this example is 2.
Table 59. Register Contents for Example 23
Register

Byte 3
[31:24]

Byte 2
[23:16]

Byte 1
[15:8]

Byte 0
[7:0]

Example 24. Big-Endian Align
Instruction
Byte_align_be[--, r0]
Byte_align_be[dest1, r1]

Prev B

A Operand

B Operand

Result

0123

4567

2345

Byte_align_be[dest2, r2]

4567

89AB

6789

Byte_align_be[dest3, r3]

89AB

CDEF

ABCD

NOTE: A Operand comes from Prev_B register during byte_align_be instructions.

Hardware Reference Manual

175

Intel® IXP2800 Network Processor
Microengines

Example 25 shows another sequence of instructions and the value of the various operands.
Table 60, shows the data in the registers for this example.
The value in Byte_Index[1:0] CSR (which controls the shift amount) for this example is 2.
Table 60. Register Contents for Example 24
Register

Byte 3
[31:24]

Byte 2
[23:16]

Byte 1
[15:8]

Byte 0
[7:0]

Example 25. Little-Endian Align
Instruction
Byte_align_le[--, r0]

A Operand

B Operand

Prev A

Result

3210

Byte_align_le[dest1, r1]

7654

3210

5432

Byte_align_le[dest2, r2]

BA98

7654

9876

Byte_align_le[dest3, r3]

FEDC

BA98

DCBA

NOTE: B Operand comes from Prev_A register during byte_align_le instructions.

As the examples show, byte aligning “n” words takes “n+1” cycles due to the first instruction
needed to start the operation.
Another mode of operation is to use the T_Index register with post-increment, to select the source
registers. T_Index operation is described later in this chapter.

4.3.2

CAM
The block diagram in Figure 59 is used to explain the CAM operation.
The CAM has 16 entries. Each entry stores a 32-bit value, which can be compared against a source
operand by instruction: CAM_Lookup[dest_reg, source_reg].
All entries are compared in parallel, and the result of the lookup is a 9-bit value that is written into
the specified destination register in bits 11:3, with all other bits of the register set to 0 (the choice of
bits 11:3 is explained below). The result can also optionally be written into either of the LM_Addr
registers (see below in this section for details).
The 9-bit result consists of four State bits (dest_reg[11:8]), concatenated with a 1-bit Hit/Miss
indication (dest_reg[7]), concatenated with 4-bit entry number (dest_reg[6:3]). All other bits of
dest_reg are written with 0. Possible results of the lookup are:

• miss (0) — lookup value is not in CAM, entry number is Least Recently Used entry (which
can be used as a suggested entry to replace), and State bits are 0000.

• hit (1) — lookup value is in CAM, entry number is entry that has matched; State bits are the
value from the entry that has matched.

176

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

Note:

The State bits are data associated with the entry. State bits are only used by software. There is no
implication of ownership of the entry by any Context. The State bits hardware function is:

• the value is set by software (when the entry is loaded or changed in an already-loaded entry).
• its value is read out on a lookup that hits, and used as part of the status written into the
destination register.

• its value can be read out separately (normally only used for diagnostic or debug).
The LRU (Least Recently Used) Logic maintains a time-ordered list of CAM entry usage. When an
entry is loaded, or matches on a lookup, it is marked as MRU (Most Recently Used). Note that a
lookup that misses does not modify the LRU list. The CAM is loaded by instruction:
CAM_Write[entry_reg, source_reg, state_value].

The value in the register specified by source_reg is put into the Tag field of the entry specified by
entry_reg. The value for the State bits of the entry is specified in the instruction as state_value.
The value in the State bits for an entry can be written, without modifying the Tag, by instruction:
CAM_Write_State[entry_reg, state_value].

Note:

CAM_Write_State

does not modify the LRU list.

Figure 59. CAM Block Diagram

Lookup Value
(from A port)

Tag

State

Tag

State

Tag

State

Tag

State

Match
Match
Match

Status
and
LRU
Logic

Match

Lookup Status
(to Dest Req)

State

Status

Entry Number

0000

Miss 0

LRU Entry

State

Hit Entry
A9354-01

Hardware Reference Manual

177

Intel® IXP2800 Network Processor
Microengines

One possible way to use the result of a lookup is to dispatch to the proper code using instruction:
jump[register, label#],defer [3]

The Tag value and State bits value for the specified entry is written into the destination register,
respectively for the two instructions (the State bits are placed into bits [11:8] of dest_reg, with all
other bits 0). Reading the tag is useful in the case where an entry needs to be evicted to make room
for a new value — the lookup of the new value results in a miss, with the LRU entry number
returned as a result of the miss. The CAM_Read_Tag instruction can then be used to find the value
that was stored in that entry. An alternative would be to keep the tag value in a GPR. These two
instructions can also be used by debug and diagnostic software. Neither of these modify the state of
the LRU pointer.
Note:

The following rules must be adhered to when using the CAM.

• CAM is not reset by Microengine reset. Software must either do a CAM_clear prior to using
the CAM to initialize the LRU and clear the tags to 0, or explicitly write all entries with

CAM_write.

• No two tags can be written to have the same value. If this rule is violated, the result of a lookup
that matches that value will be unpredictable, and LRU state is unpredictable.

The value, 0x00000000 can be used as a valid lookup value. However, note that the CAM_clear
instruction puts 0x00000000 into all tags.To avoid violating rule 2 after doing CAM_clear, it is
necessary to write all entries to unique values prior to doing a lookup of 0x00000000. An algorithm
for debug software to determine the contents of the CAM is shown in Example 26.

178

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

Example 26. Algorithm for Debug Software to Determine the Contents of the CAM
; First read each of the tag entries. Note that these reads
; don’t modify the LRU list or any other CAM state.
tag[0] = CAM_Read_Tag(entry_0);
......
tag[15] = CAM_Read_Tag(entry_15);
; Now read each of the state bits
state[0] = CAM_Read_State(entry_0);
...
state[15] = CAM_Read_State(entry_15);
; Knowing what tags are in the CAM makes it possible to
; create a value that is not in any tag, and will therefore
; miss on a lookup.
; Next loop through a sequence of 16 lookups, each of which will
; miss, to obtain the LRU values of the CAM.
for (i = 0; i < 16; i++)
BEGIN_LOOP
; Do a lookup with a tag not present in the CAM. On a
; miss, the LRU entry will be returned. Since this lookup
; missed the LRU state is not modified.
LRU[i] = CAM_Lookup(some_tag_not_in_cam);
; Now do a lookup using the tag of the LRU entry. This
; lookup will hit, which makes that entry MRU.
; This is necessary to allow the next lookup miss to
; see the next LRU entry.
junk = CAM_Lookup(tag[LRU[i]]);
END_LOOP
; Because all entries were hit in the same order as they were
; LRU, the LRU list is now back to where it started before the
; loop executed.
; LRU[0] through LRU[15] holds the LRU list.

4.4

CRC Unit
The CRC Unit operates in parallel with the Execution Datapath. It takes two operands, performs a
CRC operation, and writes back a result. CRC-CCITT, CRC-32, CRC-10, CRC-5, and iSCSI
polynomials are supported. One of the operands is the CRC_Remainder Local CSR, and the other
is a GPR, Transfer_In register, Next Neighbor, or Local Memory, specified in the instruction and
passed through the Execution Datapath to the CRC Unit. The instruction specifies the CRC
operation type, whether to swap bytes and or bits, and the bytes of the operand to be included in the
operation. The result of the CRC operation is written back into CRC_Remainder. The source
operand can also be written into a destination register (however the byte/bit swapping and masking
do not affect the destination register; they only affect the CRC computation). This allows moving
data, for example, from S_TRANSFER_IN registers to S_TRANSFER_OUT registers at the same
time as computing the CRC.

Hardware Reference Manual

179

Intel® IXP2800 Network Processor
Microengines

4.5

Event Signals
Event Signals are used to coordinate a program with completion of external events. For example,
when a Microengine issues a command to an external unit to read data (which will be written into a
Transfer_In register), the program must insure that it does not try to use the data until the external
unit has written it. There is no hardware mechanism to flag that a register write is pending, and then
prevent the program from using it. Instead the coordination is under software control, with
hardware support.
When the program issues the command to the external event, it can request that the external unit
supply an indication (called an Event Signal) that the command has been completed. There are 15
Event Signals per Context that can be used, and Local CSRs per Context to track which Event
Signals are pending and which have been returned. The Event Signals can be used to move a
Context from Sleep state to Ready state, or alternatively, the program can test and branch on the
status of Event Signals.
Event Signals can be set in nine different ways.
1. When data is written into S_TRANSFER_IN registers (part of S_Push_ID input)
2. When data is written into D_TRANSFER_IN registers (part of D_Push_ID input)
3. When data is taken from S_TRANSFER_OUT registers (part of S_Pull_ID input)
4. When data is taken from D_TRANSFER_OUT registers (part of D_Pull_ID input)
5. On InterThread_Sig_In input
6. On NN_Sig_In input
7. On Prev_Sig_In input
8. On write to Same_ME_Signal Local CSR
9. By Internal Timer
Any or all Event Signals can be set by any of the above sources.
When a Context goes to the Sleep state (executes a ctx_arb instruction, or a Command instruction
with ctx_swap token), it specifies which Event Signal(s) it requires to be put in the Ready state.
The ctx_arb instruction also specifies whether the logical AND or logical OR of the Event
Signal(s) is needed to put the Context into the Ready state.
When a Context Event Signals arrive, it goes to the Ready state, and then to the Executing state.
In the case where the Event Signal is linked to moving data into or out of Transfer registers
(numbers 1 through 4 in the list above), the code can safely use the Transfer register as the first
instruction (for example, using a Transfer_In register as a source operand will get the new read
data). The same is true when the Event Signal is tested for branches (br_=signal or br_!signal
instructions).
The ctx_arb instruction, CTX_Sig_Events, and CTX_Wakeup_#_Events Local CSR descriptions
provide details.

180

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

4.5.1

Microengine Endianness
Microengine operation from an “endian” point of view can be divided into following categories:

•
•
•
•
•
•
4.5.1.1

Read from RBUF (64 bits)
Write to TBUF (64 bits)
Read/write from/to SRAM
Read/write from/to DRAM
Read/write from/to SHaC and other CSRs
Write to Hash

Read from RBUF (64 Bits)
Data in RBUF is arranged in LWBE order. Whenever the Microengine reads from RBUF, the low
order longword (LDW0) is transferred into Microengine transfer register 0 (treg0), the high order
longword (LDW1) is transferred into treg1, etc. This is explained in Figure 60.

Figure 60. Read from RBUF (64 Bits)

MicroEngine

0123
4567
8 9 10 11
12 13 14 15

treg0
treg1
treg2
treg3

LDW1

LDW0
4567
12 13 14 15

0123
8 9 10 11
RBUF
A8941-01

Hardware Reference Manual

181

Intel® IXP2800 Network Processor
Microengines

4.5.1.2

Write to TBUF
Data in TBUF is arranged in LWBE order. When writing from the Microengine transfer registers to
TBUF, treg0 goes into LDW0, treg1 goes into LDW1, etc. See Figure 61.

Figure 61. Write to TBUF (64 Bits)

TBUF
0123
8 9 10 11

MicroEngine

4567
12 13 14 15

0123
4567
8 9 10 11
12 13 14 15

treg0
treg1
treg2
treg3
A8942-01

4.5.1.3

Read/Write from/to SRAM
Data inside SRAM is in big-endian order. While transferring data from SRAM to a Microengine,
no endianness is involved and first read data goes into the first transfer register specified, the next
read data into the second, etc.

4.5.1.4

Read/Write from/to DRAM
Data inside DRAM is in LWBE order. When a Microengine reads from DRAM, LDW0 goes into
the first transfer register specified in the instruction, LDW1 goes into the next, and so on. While
writing to DRAM, treg0 goes first, then followed by treg1, and both are combined in the DRAM
controller as {LDW1, LDW0} and written as a 64-bit quantity into DRAM.

4.5.1.5

Read/Write from/to SHaC and Other CSRs
Read and write from SHaC and other CSRs happen as 32-bit operations only and are endianindependent. The low byte goes into the low byte of the transfer register and the high byte goes into
the high byte of the transfer register.

182

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

4.5.1.6

Write to Hash Unit
Figure 62 explains 48-, 64-, and 128-bit hash operations. When the Microengine transfers a 48-bit
hash operand to the hash unit, the operand resides in two transfer registers and is transferred, as
shown in Figure 62. In the second longword transfer, only the lower half is valid. Hash unit
concatenates the two longwords as shown in Figure 62. Similarly, 64-bit and 128-bit hash operand
transfers from the Microengine to the hash unit happen as shown in Figure 62.

Figure 62. 48-, 64-, and 128-Bit Hash Operand Transfers

48-bit Hash
63

64-bit Hash

32 31

11 10 9 8

76543210

15 14 13 12 11 10 9 8

S-Push / S-Pull Bus

MicroEngine
Transfer Registers

32 31

S-Push / S-Pull Bus

76543210

treg0

11 10 9 8

treg1

MicroEngine
Transfer Registers

treg0

76543210

15 14 13 12 11 10 9 8 treg1

128-bit Hash
127

96 95

64 63

32 31

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8

76543210

S-Push / S-Pull Bus

MicroEngine
Transfer Registers

76543210

treg0

15 14 13 12 11 10 9 8

treg1

23 22 21 20 19 18 17 16 treg2
31 30 29 28 27 26 25 24 treg3

A8943-01

4.5.2

Media Access
Media operation can be divided in two parts:

• Read from RBUF (Section 4.5.2.1)
• Write to TBUF (Section 4.5.2.2)

Hardware Reference Manual

183

Intel® IXP2800 Network Processor
Microengines

4.5.2.1

Read from RBUF
To analyze the endianness on the media-receive interface and the way in which bytes are arranged
inside RBUF, a brief introduction of how bytes are generated from the serial interface is provided
here. Pipe A denotes the serial stream of data received at the serial interface (SERDES). Bit 0 of
byte 0 comes first, followed by bit 1, etc. Pipe B converts this bit stream into byte stream
byte 0 — byte 7, etc. So, byte 0 currently is the least significant byte received. In Pipe C, before
being transmitted to the SPI-4 interface, these bytes are organized in 16-bit words in big-endian
order where byte 0 is at B[15:8] and byte 1 is at B[7:0].
When the SPI-4 interface inside the IXP2800 received these 16-bit words, they are put into RBUF
in LWBE order where longwords inside one RBUF entry are organized in little-endian order as
shown in one RBUF element in Figure 63. In the least-significant-longword, byte 0 is at a higher
address than byte 3 (therefore, big-endian). Similarly, in the most-significant-longword, byte4 is at
a higher address than byte 7 (therefore, big-endian). While transferring from RBUF to
Microengine, the least significant longword from one RBUF element is transferred first, followed
by the most significant longword into the Microengine transfer registers.

Figure 63. Bit, Byte, and Longword Organization in One RBUF Element

B63

B32 B31

byte byte byte byte byte byte byte byte
4
5
6
7
0
1
2
3

RBUF Element
Offset n

SPI-4 Bus
addr15

addr8 addr7

addr0

byte 0
byte 2
byte 4

byte 1
byte 3
byte 5

byte 6

byte 7
Pipe C

{7 6 5 4 3 2 1 0}

byte 0

{15 14 13 12 11 10 9 8}

byte 1

{23 22 21 20 19 18 17 16}

byte 2

{31 30 29 28 27 26 25 24}

byte 3

{63 62 61 60 59 58 57 56}

byte 7

Pipe B
byte 0
0

57 58 59 60 61 62 63

Pipe A
A9725-01

184

Hardware Reference Manual

Intel® IXP2800 Network Processor
Microengines

4.5.2.2

Write to TBUF
For writing to TBUF, the header comes from the Microengine and data comes from RBUF or
DRAM. Since the Microengine to TBUF header transfer happened in 8-byte chunks, it is possible
that the first longword that is inside tr0 may not contain any data if the valid header begins in
transfer register tr1. Since data in tr0 goes to the LW1 location at offset 0 and data in tr2 goes to the
LW0 location at offset 0, there are some invalid bytes at the beginning of the header, at offset 0.
These invalid bytes are removed by the aligner on the way out of TBUF, based on the control word
for this TBUF element. The data from tr2, tr3, ... tr6 is placed in TBUF, as shown in Figure 64 in
big-endian order.

Figure 64. Write to TBUF

To SPI-4
Remove the empty bytes
based on the control word
LW1
X

h9 h10 h11 offset 1

h12 h13 X

offset 2

offset 3

tr0

tr1

tr2

tr3

h9 h10 h11

LW0

15 offset 4

23 offset 5

A63
tr4 h12 h13

tr5

MicroEngine
Transfer Registers

h3 offset 0

64-bit Read
from addr 1 by
MicroEngine
treg 0 gets
0123
treg 1 gets
4567

addr0

12 13 14 15 8

addr 0000_0000

9 10 11

addr 0001_0000

20 21 22 23 16 17 18 19

RBUF or DRAM
A8945-01

Hardware Reference Manual

185

Intel® IXP2800 Network Processor
Microengines

Since data in RBUF or DRAM is arranged in LWBE order, it is swapped on the way into the TBUF
to make it truly big-endian, as shown in Figure 64. Again, the invalid bytes at the beginning of the
payload that starts at offset 3 and at the end-of-header at offset 2 is removed by the aligner on the
way out of TBUF.

4.5.2.3

TBUF to SPI-4 Transfer
Figure 65 shows how the MSF interface removes invalid bytes from TBUF data and transfers them
onto the SPI-4 interface in 16-bit (2-byte) chunks.

Figure 65. MSF Interface

To SPI-4 Interface
A15

A8 A7

h5
h7
h9

h6
h8
h10

h11

h12

h13
4
6
8

3
5
7
9

LW1

LW0

h9 h10 h11 offset 1

h12 h13 X

offset 2

offset 3

Byte to bit-stream
conversion

To
Serial
Link

After removing the
invalid bytes,
data is packed in
two byte chunks.

h3 h2 h1

Word to Byte conversion

h3 offset 0

15 offset 4

23 offset 5
A8946-01

186

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

DRAM

This section describes Rambus* DRAM operation.

5.1

Overview
The IXP2800 Network Processor has controllers for three Rambus* DRAM (RDRAM) channels.
Either one, two, or three channels can be enabled. When more than one channel is enabled, the
channels are interleaved (also known as striping) on 128-byte boundaries to provide balanced
access to all populated channels. Interleaving is performed in hardware and is transparent to the
programmer. The programmer views the DRAM memory space as a contiguous block of memory.
The total address space of two Gbytes is supported by the DRAM interface regardless of the
number of channels that are enabled. The controllers support 64-, 128-, 256-, and 512-Mbyte, and
1-Gbyte devices; however, with interleaving, each of the channels must have the same number,
size, and speed of RDRAMs populated. Each channel can be populated with up to 32 RDRAM
devices. While each channel must have the same size and speed RDRAMs, it is possible for each
individual channel to have different size and speed RDRAMs, as long as the total amount of
memory is the same for all of the channels.
ECC (Error Correcting Code) is supported. Enabling ECC requires that x18 RDRAMs be used.
If ECC is disabled, x16 RDRAMs can be used.
The Microengines (MEs), Intel XScale® core, and PCI (external Bus Masters and DMA Channels)
have access to the DRAM memory space.
The controllers also automatically perform refresh as well as IO driver calibration to account for
variations in operating conditions due to process, temperature, voltage, and board layout.
RDRAM Powerdown and nap modes are not supported.

Hardware Reference Manual

187

Intel® IXP2800 Network Processor
DRAM

5.2

Size Configuration
Each channel can be populated with 1 – 4 RDRAMs (Short Channel Mode). For supported loading
configurations, refer to Table 61. The RAM technology used determines the increment size and
maximum memory per channel as shown in Table 62.
Note:

One or two channels can be left unpopulated if desired.

Table 61. RDRAM Loading
Bus Interface

Maximum Number of Loads

Trace Length (inches)

Short Channel: 400 and 533 MHz

4 devices per channel.

201

Long Channel: 400 MHz

2 RIMMs per channel – a maximum of
32 devices in both RIMMs.

201

Long Channel: 533 MHz

1 RIMM and 1 C-RIMM per channel –
a maximum of 16 devices.

201

For termination, the DRAMs should be located as close as possible to the IXP2800 Network Processor.

Table 62. RDRAM Sizes
RDRAM Technology1

Increment Size

Maximum per Channel

64/72 MB

8 MBs

256 MB

128/144 MB

16 MB

512 MB

256/288 MB

32 MB

1 GB2

512/576 MB

64 MB

2 GB2

NOTES:
1. The two numbers shown for each technology indicate x16 parts and x18 parts.
2. The maximum memory that can be addressed across all channels is 2 Gbytes. This limitation is based on
the partitioning of the 4-Gbyte address space (32-bit addresses). Therefore, if all three channels are used,
each can be populated up to a maximum of 768 Mbytes. Two channels can be populated to a maximum of
1 Gbyte each. A single channel could be populated to a maximum of 1 Gbyte.

RDRAMs with 1 x 16 dependent banks, 2 x 16 dependent banks, and four independent banks are
supported.

188

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.3

DRAM Clocking
Figure 66 shows the clock generation for one channel (this description is just for reference; for
more information, refer to Rambus* design literature). The other channels use the same
configuration.
Note:

Refer to Section 10 for additional information on clocking.

Figure 66. Clock Configuration

RDRAM

Intel®
IXP2800
Network
Processor

CTM, nCTM
CFM, nCFM

PCLKM

RDRAM

Direct
Rambus*
Clock
Generator
(DRCG)

SYNCLKN
CLK_PHASE_REF

REF_CLK

A9726-02

The RDRAM Controller receives two clocks, both generated internal to the IXP2800 Network
Processor.
The internal clock, is used to control all logic associated with communication with other on-chip
Units. This clock is ½ of the Microengine frequency, and is in the range of 500 – 700 MHz.
The other clock, the Rambus* Memory Controller (RMC) clock, is internally divided by two and
brought out on the CLK_PHASE_REF pin, which is then used as the reference clock for the DRCG
(see Figure 67 and Figure 68). The reason for this is that our internal RMC clock is derived from
the Microengine clock (supported programmable divide range is from 8 – 15 for the A stepping and
6 – 15 for the B stepping) at a Microengine frequency of 1.4 GHz (the available RMC clock
frequencies are 100 MHz and 127 MHz). In the RMC implementation, we have a fixed 1:1 clock
relationship between the RMC clock and the SYNCLK (SYNCLK = Clock-to-Master(CTM)/4); to
maintain this relationship, we provide the clock to the DRCG. The CTM is received by the DRAM
controller which it drives back out as Clock-from-Master (CFM). Additionally, the controller
creates PCLKM and SYNCLKN, which are also driven to the DRCG.

Hardware Reference Manual

189

Intel® IXP2800 Network Processor
DRAM

Figure 67. IXP2800 Clocking for RDRAM at 400 MHz

DRCG
50 MHz

Bus CLK = 400 MHz

x8
Phase Detector

25
MHz

CLK_PHASE_REF

System 100 MHz
Ref_Clk

25
MHz

PLL
PCLK =
100 MHz

SYNCLK =
100 MHz

RMC

RAC
A9727-01

Figure 68. IXP2800 Clocking for RDRAM at 508 MHz

DRCG
63.5 MHz

Bus CLK = 508 MHz

x8
Phase Detector

31.75
MHz

CLK_PHASE_REF

System 100 MHz
Ref_Clk

31.75
MHz

PLL
PCLK =
127 MHz

RMC

SYNCLK =
127 MHz

RAC
A9728-01

5.4

Bank Policy
The RDRAM Controller uses a “closed bank” policy. Banks are activated long enough to do an
access and then closed and precharged. They are not left open in anticipation of another access to
the same page. This is unlike many CPU applications, where there is a high degree of locality.
Since that locality does not exist in the typical applications in which the IXP2800 Network
Processor uses RDRAM, the “closed bank” policy is used.

190

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.5

Interleaving
The RDRAM channels are interleaved on 128-byte boundaries in hardware to improve
concurrency and bandwidth utilization. Contiguous addresses are directed to different channels by
rearranging the physical address bits in a programmable manner described in Section 5.5.1 through
Section 5.5.3 and then remapped as described in Section 5.5.4. The block diagram in Figure 69
illustrates the flow.

Figure 69. Address Mapping Flow
Bank 0
CMD FIFO
Microengine, Intel
XScale® core, PCIinitiated address

Channel
Selection

In-Channel Address

Address
Remapping

Bank 1
CMD FIFO

Bank 2
CMD FIFO

RDRAM_CONTROL[NO_CHAN]

RDRAM_CONTROL[BANK_REMAP]

Bank 3
CMD FIFO

The mapping of addresses to channels is completely transparent to software. Software deals with
physical addresses in RDRAM space; the mapping is done completely by hardware.
Note:

5.5.1

Accessing an address above the amount of RDRAM populated will cause unpredictable results.

Three Channels Active (3-Way Interleave)
When all three channels are active, the interleave scheme selects the channel for each block, using
modulo-3 reduction (address bits [31:7] are summed as modulo-3, and the remainder is the selected
channel number). The algorithm ensures that adjacent blocks are mapped to different channels.
The address within the DRAM is then selected by rearranging the received address, as shown in
Table 63. In this case, the number of DRAMs on a channel must be either 1, 2, 4, 8, 16, or 32.
For Rev. B, the address within the DRAM is selected by adding the received address to the contents
of one of the CSRs (K0 – K11), or 0, as shown in Table 64. The values to load into K0 – K11 are a
function of the amount of memory on the channel, and are specified in the IXP2400 and IXP2800
Network Processor Programmer’s Reference Manual.
For memory sizes of 32, 64, or 128 Mbytes, etc., the specified constants give the same remapping
as was done in a previous revision.

Hardware Reference Manual

191

Intel® IXP2800 Network Processor
DRAM

Table 63. Address Rearrangement for 3-Way Interleave (Sheet 1 of 2)
When
these
bits of
address
are all
“1”s…1

Shift
30:7
right
by
this
many
bits

30:7

Add this amount to shifted 30:7 (based on amount of memory on the channel)
Address within channel is {30:7+table_value), 6:0}

8 MB3

16 MB

32 MB3

64 MB

128 MB3

256 MB

512 MB3

1 GB

N/A2

N/A

8388607

28:7

N/A

2097151

4194303

8388606

26:7

N/A

524287

1048575

2097150

4194300

8388600

24:7

N/A

131071

262143

524286

1048572

2097144

4194288

8388576

22:7

65535

131070

262140

524280

1048560

2097120

4194240

8388480

20:7

65532

131064

262128

524256

1048512

2097024

4194048

8388096

18:7

65520

131040

262080

524160

1048320

2096640

4193280

8386560

16:7

65472

130944

261888

523776

1047552

2095104

4190208

8380416

14:7

65280

130560

261120

522240

1044480

2088960

4177920

8355840

12:7

64512

129024

258048

516096

1032192

2064384

4128768

8257536

10:7

61440

122880

245760

491520

983040

1966080

3932160

7864320

8:7

49152

98304

196608

393216

786432

1572864

3145728

6291456

None

NOTES:
1. This is a priority encoder; when multiple lines satisfy the condition, the line with the largest number of ones
is used.
2. N/A means not applicable.
3. For these cases, the top 3 blocks (each block is 128 bytes) of the logical address space is not accessible.
For example if each channel has 8 Mbytes, only (24 Mbytes - 384) total bytes are usable. This is an artifact
of the remapping method.
4. The numbers in the table are derived as follows:
For the first pair of ones (8:7) value is 3/4 the number of blocks. For each subsequent pair of ones, the
value is the previous value, plus another 3/4 the remaining blocks.
• [8:7]==11 - 3/4 * blocks
• [10:7]==1111 - (3/4 + 3/16) * blocks
• [12:7]==111111 - (3/4 + 3/16 + 3/64) * blocks
• etc.

192

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

Table 64. Address Rearrangement for 3-Way Interleave (Sheet 2 of 2) (Rev B)
When these bits of address are all
“1”s…1

Add the value in this CSR to
the address

30:7

K11

28:7

K10

26:7

24:7

22:7

20:7

18:7

16:7

14:7

12:7

10:7

8:7

None

Value 0 added.

NOTES:
1. This is a priority encoder; when multiple lines satisfy the condition,
the line with the largest number of ones is used.

5.5.2

Two Channels Active (2-Way Interleave)
It is possible to have only two channels populated for system cost and area savings. If only two
channels are desired, than channels 0 and 1 should be populated and channel 2 should be left
empty. In the Two Channel Mode, the address interleaving is designed with the goal of spreading
adjacent accesses across the 2 channels.
When two channels are active, address bit 7 is used as the channel select. Addresses that have
address 7 equal to 0 are mapped to channel 0 while those with address 7 equal to 1 are mapped to
channel 1. The address within the channel is {[31:8], [6:0]}.

5.5.3

One Channel Active (No Interleave)
When only one channel is active, all accesses go to that channel. In this case, it is possible for an
access to split across two DRAM banks (which could be in different RDRAMs).

Hardware Reference Manual

193

Intel® IXP2800 Network Processor
DRAM

5.5.4

Interleaving Across RDRAMs and Banks
In addition to interleaving across the different RDRAM channels, addresses are also interleaved
across RDRAM chips and internal banks. This improves utilization since certain operations to
different banks can be performed concurrently. The interleaving is done based on rearranging the
remapped address derived from Section 5.5.1, Section 5.5.2, and Section 5.5.3 as a function of the
memory size as shown in Table 65. The two MSBs of the rearranged address are used to select
which Bank Command FIFO the command is place in. The rearranged address is also partitioned to
choose RDRAM chip, bank within RDRAM, and page within bank.

Table 65. Address Bank Interleaving
Remapped Address
Based on RDRAM_Control[Bank_Remap]

Memory Size on
Channel (MB)3
00

7:14, 22:15

9:14, 7:8, 22:15

11:14, 7:10, 22:15

13:14, 7:12, 22:15

7:14, 23:15

9:14, 7:8, 23:15

11:14, 7:10, 23:15

13:14, 7:12, 23:15

7:14, 24:15

9:14, 7:8, 24:15

11:14, 7:10, 24:15

13:14, 7:12, 24:15

7:14, 25:15

9:14, 7:8, 25:15

11:14, 7:10, 25:15

13:14, 7:12, 25:15

128

7:14, 26:15

9:14, 7:8, 26:15

11:14, 7:10, 2615

13:14, 7:12, 26:15

256

7:14, 27:15

9:14, 7:8, 27:15

11:14, 7:10, 27:15

13:14, 7:12, 27:15

512

7:14, 28:15

9:14, 7:8, 28:15

11:14, 7:10, 28:15

13:14, 7:12, 28:15

1024

7:14, 29:15

9:14, 7:8, 29:15

11:14, 7:10, 29:15

13:14, 7:12, 29:15

Bits used to select
Bank Command
FIFO

7:8

9:10

11:12

13:14

NOTES:
1. Table shows device/bank sorting of the channel remapped block address, which is in address 31:7. LSBs of
the address are always 6:0 (byte within the block), which are not remapped.
2. Unused MSBs of address have value of 0.
3. Size is programmed in RDRAM_Control[Size].

5.6

Parity and ECC
DRAM can be optionally protected by byte parity or by an 8-bit error detecting and correcting code
(ECC). RDRAMn_Control[ECC] for each channel selects whether or not that channel should use
Parity, ECC, or no protection. When parity or ECC is enabled x18 RDRAMs must be used with the
extra bits connected to the dqa[8] and dqb[8] signals. Eight bits of ECC code cover eight bytes of
data (aligned to an 8-byte boundary).

5.6.1

Parity and ECC Disabled
• On reads, the data is delivered to the originator of the read; no error is possible.
• Partial writes (writes of less than eight bytes) are done as masked writes.

194

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.6.2

Parity Enabled
On writes, odd byte parity is computed for each byte and written into the corresponding parity bit.
Partial writes (writes of less than eight bytes) are done as masked writes.
On reads, odd byte parity is computed on each byte of data and compared to the corresponding
parity bit. If there is an error RDRAMn_Error_Status_1[Uncorr_Err] bit is set, which can interrupt
the Intel XScale® core if enabled. The Data Error signal will be asserted when the read data is
delivered on D_Push_Data.
The address of the error, along with other information, is logged in
RDRAMn_Error_Status_1[ADDR] and RDRAMn_Error_Status_2. Once the error bit is set, those
registers are locked. That is, the information relating to subsequent errors is not loaded until the
error status bit is cleared by the Intel XScale® core write.

5.6.3

ECC Enabled
On writes, eight ECC check bits are computed based on 64 bits of data, and are written into the
check bits. Partial writes (writes of less than eight bytes) cause the channel controller to do a
read-modify-write. Any single-bit error detected during the read portion is corrected prior to
merging with the write data. An uncorrectable error detected during the read does not modify the
data. Either type of error will set the appropriate error status bit, as described below.
On reads, the correct value for the check bits is computed from the data and is compared to the
ECC check bits. If there is no error, data is delivered to the originator of the read, because it came
from the RDRAMs. If there is a single-bit error, it is corrected before being delivered (the
correction is done automatically, the reader is given the correct data). The error is also logged by
setting the RDRAMn_Error_Status_1[Corr_Err] bit, which can interrupt the Intel XScale® core if
enabled.
If there is an uncorrectable error, the RDRAMn_Error_Status_1[Uncorr_Err] bit is set, which can
interrupt the Intel XScale® core if enabled. The Data Error signal is asserted when the read data is
delivered on D_Push_Data, unless the token, Ignore Data Error, was asserted in the command. In
that case, the RDRAM controller does not assert Data Error and does not assert a Signal (it will use
0xF, which is a null signal, in place of the requested signal number).
In both correctable and uncorrectable cases, the address of the error, along with other information,
is logged in RDRAMn_Error_Status_1[ADDR] and RDRAMn_Error_Status_2. Once either of the
error bits is set, those registers are locked. That is, the information relating to subsequent errors is
not loaded until both error status bits are clear. That does not prevent the correction of single-bit
errors, only the logging.
Note:

When a single-bit error is corrected, the corrected data is not written back into memory (scrubbed)
by hardware; this can be done by software if desired, because all of the information pertaining to
the error is logged.

Hardware Reference Manual

195

Intel® IXP2800 Network Processor
DRAM

To avoid the detection of false ECC errors, the RDRAM ECC mode must be initialized using the
procedure described below:
1. Ensure that parity/ECC is not enabled: program DRAM_CTRL[15:14] = 00
2. Write all zeros (0x00000000) to all the memory locations. By default, this initializes the
memory with odd parity and in this case (data all 0), it coincides with ECC and does not
require any read-modify-writes because ECC is not enabled.
3. Ensure that all of the writes are completed prior to enabling ECC. This is done by performing
a read operation to 1000 locations.
4. Enable ECC mode: program DRAM_CTRL[15:14] accordingly.

5.6.4

ECC Calculation and Syndrome
The ECC check bits are calculated by forming parity checks on groups of data bits. The check bits
are stored in memory during writes via the dqa[8] and dqb[8] signals. Note that memory
initialization code must put good ECC into all of memory by writing each location before it can be
read. Writing any arbitrary data into memory – for example 0, will accomplish this. This will take
several milliseconds per Mbyte of memory.
On reads, the expected code is calculated from the data, and then compared to (XORed) the ECC
that was read. The result of the comparison is called the syndrome. If the syndrome is equal to 0,
then there was no error. There are eight syndromes that are calculated based on the read data and its
corresponding ECC bit. When ECC is enabled, upon detecting a single-bit error, the syndrome is
used to determine which bit needs to be flipped to correct the error.

5.7

Timing Configuration
Table 66 shows the example timing settings for RDRAMs of various speeds. The parameters are
programmed in the RDRAM_Config CSRs (refer to the PRM for register descriptions)

Table 66. RDRAM Timing Parameter Settings
Parameter
Name

196

-40800

-45800

-50800

-45711

-50711

-45600

-53600

CfgTrcd

CfgTrasSyn

CfgTrp

CfgToffpSyn

CfgTrasrefSyn

CfgTprefSyn

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.8

Microengine Signals
Upon completion of a read or write, the RDRAM controller can signal a Microengine context,
when enabled. It does so using the sig_done token; see Example 27.

Example 27. RDRAM Controller Signaling a Microengine Context
dram [read,$xfer6,addr_a,0,1], sig_done_4
dram [read,$xfer7,addr_b,0,1], sig_done_6
ctx_arb[4, 5, 6, 7]

Because the RDRAM address space is interleaved, consecutive accesses can go to different
RDRAM channels. There is no ordering guaranteed among different channels, so a separate signal
is needed for each.
In addition, because accesses start at any address, and can specify up to 16 64-bit words
(128 bytes), they can also split across two channels (refer to Section 5.5). The ctx_arb instruction
must set two Wakeup_Events (an odd/even pair) per access. The RDRAM controllers coordinate as
follows:

• If the access split across two channels, the channel handling the low part of the split delivers

the even-numbered Event Signal, and the channel handling the upper part of the split delivers
the odd-numbered Event Signal.

• If the access does not split, the channel delivers both Event Signals (by coordinating with the
D_Push or D_Pull arbiter for read and writes respectively).

• In all cases, the channel delivers the Event Signal with the last data Push or Pull of a burst.
Using the above rules, the Microengine will be put into the Ready State (ready to resume
executing) only when all accesses have completed.

5.9

Serial Port
The RDRAM chips are configured through a serial port, which consists of signals D_SIO,
D_CMD, and D_SCK. Access to the serial port is via the RDRAM_Serial_Command and
RDRAM_Serial_Data CSRs (refer to the IXP2400 and IXP2800 Network Processor Programmer’s
Reference Manual for the register descriptions).
All serial commands are initiated by a write to RDRAM_Serial_Command. Because the serial port
is slow, RDRAM_Serial_Command has a Busy bit, which indicates that a serial port command is
in progress. Software must test this bit before initiating a command. This ensures that software will
not lose a command, while eliminating the need for a hardware FIFO for serial commands.
Serial writes are done by the following steps:
1. Read RDRAM_Serial_Command; test Busy bit until its a 0.
2. Write RDRAM_Serial_Data.
3. Write RDRAM_Serial_Command to start the write.

Hardware Reference Manual

197

Intel® IXP2800 Network Processor
DRAM

Serial reads are done by the following steps:
1. Read RDRAM_Serial_Command; test Busy bit until it is a 0.
2. Write RDRAM_Serial_Command to start the read.
3. Read RDRAM_Serial_Command; test Busy bit until it is a 0.
4. Read RDRAM_Serial_Data to collect the serial read data.

5.10

RDRAM Controller Block Diagram
The RDRAM controller consists of three pieces. Figure 70 is a simplified block diagram.

Figure 70. RDRAM Controller Block Diagram

Pre_RMC

RMC

RAC

CMD Bus

Intel®
IXP2800
Network
Processor

RDRAMs

D_Push Bus

D_Pull Bus

A9729-02

Pre_RMC — has the queues for commands, data (both in and out), and interfaces to internal
buses. It checks incoming commands and addresses to determine if they are targeted to the channel,
and if so, enqueues them (if a command splits across two channels, the channel must enqueue the
portion of the command that it owns). It sorts the enqueued commands to RDRAM banks, selects
the command to be executed based on policy to get good bank utilization, and then hands off that
command to RMC. It also arbitrates for refresh and calibration, which it requests RMC to perform.
Pre_RMC also contains the ECC logic, and the CSRs that set size, timing, ECC, etc.
RMC — Rambus* Memory Controller, that handles the pin protocol. It controls all timing
dependencies, pin turnaround, RAS-CAS, RAS-RAS, etc., including bank interactions. RMC
handles all commands in the order that it receives them. RMC is based on the Rambus* RMC.
RAC — Rambus* ASIC Cell, a high-speed parallel-to-serial and serial-to-parallel interface. This
is a hard macro that contains the I/O pads and drivers, DLL, and associated pin interface logic.
The following is a brief explanation of command operation:
Pre_RMC enqueues commands and sends them to RMC. It is responsible for initiating Pull
operations to get Microengine/RBUF/Intel XScale® core/PCI data into the Pull_Data FIFO. A
write is not eligible to go to RMC until Pre_RMC has all the data in the Pull Data FIFO.
Pre_RMC provides the Full signal to the Command Arbiter to inform it stop allowing RDRAM
commands.

198

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.10.1

Commands
When a valid command is placed on the command bus, the control logic checks to see if the
address matches the channel’s address range, based on interleaving as described in Section 5.5.
The command, address, length, etc. are enqueued into the command Inlet FIFO.
If the command Inlet FIFO becomes full, the channel sends a signal to the command arbiter, which
will prevent it from sending further DRAM commands. The full signal must be asserted while there
is still enough room in the FIFOs to hold the worst case number of in-flight commands.

5.10.2

DRAM Write
When a write (or RBUF_RD, which does a DRAM write) command is at the head of the Command
Inlet FIFO, it is moved to the proper Bank CMD FIFO, and the Pull_ID is sent to the Pull arbiter.
This can only be done if there is room for the command in the Bank’s CMD FIFO and for the pull
data in the Bank’s Pull Data FIFO (which must take into account all pull data in flight). If there is
not enough room in the Bank’s CMD FIFO or the Bank’s Pull Data FIFO, the write command waits
at the head of the Command Inlet FIFO. When the Pull_ID is sent to the Pull Arbiter, the Bank
number is put into the PP (Pull in Progress) FIFO; this allows the channel to sort the Pull Data into
the proper Bank Pull Data FIFO when it arrives.
The source of the Pull Data can be either RBUF, PCI, Microengine, or the Intel XScale® core, and
is specified in the Pull_ID. When the source is RBUF or PCI, data will be supplied to the Pull Data
FIFO, at 64 bits per cycle. When the source is Microengine or the Intel XScale® core, data will be
supplied at 32 bits per cycle, justified to the low 32 bits of Pull Data. The Pull Arbiter must merge
and pack data as required. In addition, the data must be aligned according to the start address,
which has longword resolution; this is done in Pre_RMC.
The Length field of the command at the head of the Bank CMD FIFO is compared to the number of
64-bit words in the Bank Pull_Data FIFO. When the number of 64-bit words in Pull_Data FIFO is
greater or equal to the length, the write arbitrates for the RMC. When it wins arbitration, it sends
the address and command to RMC, which requests the write data from Pull_Data FIFO at the
proper time to send it to the RDRAMs.

Note:

5.10.2.1

The Microengine is signaled when the last data is pulled.

Masked Write
Masked writes (write of less than eight bytes) are done as either Read-Modify-Writes when ECC is
enabled, or as Rambus*-masked writes (using COLM packets), when ECC is not enabled. In both
cases, the masked write will modify seven or fewer bytes because the command bus limits a
masked write to a ref_count of 1.
If a RMW is used, no commands from that Bank’s CMD FIFO are started between the read and the
write; other Bank commands can be done during that time.

Hardware Reference Manual

199

Intel® IXP2800 Network Processor
DRAM

5.10.3

DRAM Read
When a read (or TBUF_WR, which does a DRAM read) command is at the head of the Command
Inlet FIFO, it is moved to the proper Bank CMD FIFO if there is room. If there is not enough room
in the Bank’s CMD FIFO, the read command waits at the head of the Command Inlet FIFO.
When a read command is at the head of the Bank CMD FIFO, and there is room for the read data in
the Push Data FIFO (including all reads in flight at the RDRAM), it will arbitrate for RMC. When
it wins arbitration, it sends the address and command to RMC. The Push_ID is put into the RP
FIFO (Read in Progress) to coordinate it with read data from RMC.
When read data is returned from RMC, it is placed into the Push_Data FIFO. Each Push_Data is
sent to the Push Arbiter with a Push_ID; the RDRAM controller increments the Push_ID for each
data phase. If Push Arbiter asserts the full signal, Push Data is stopped and held in the Push Data
skid FIFO. The Push Data is sent to the read destination under control of the Push Arbiter.
The destination of the Push Data can be either Intel XScale® core, PCI, TBUF, or Microengine, and
is specified in the Push_ID. When the destination is TBUF or PCI, data is taken at 64 bits per cycle.
When the destination is the Microengine or the Intel XScale® core, data is taken at 32 bits per
cycle. The Push Arbiter justifies the data to the low 32 bits of Push Data. The Microengine is
signaled when the last data is pushed.

5.10.4

CSR Write
When a CSR write command is at the head of the Command Inlet FIFO, it is moved to the CSR
CMD register, and the Pull_ID is sent to the Pull arbiter. This can only be done if the CSR CMD
register is not currently occupied. If it is, the CSR write command waits at the head of the
Command Inlet FIFO.
When the Pull_ID is sent to the Pull Arbiter, a tag is put into the PP FIFO (Pull in Progress); this
allows the channel to identify the Pull Data as CSR data when it arrives. When the CSR pull data
arrives, it is put into the addressed CSR, and the CSR CMD register is marked as empty.

5.10.5

CSR Read
When a CSR read command is at the head of the Command Inlet FIFO, it is moved to the CSR
CMD register. This can only be done if the CSR CMD register is not currently occupied. If it is, the
CSR read command waits at the head of the Command Inlet FIFO.
On the first available cycle in which RDRAM data from RMC is not being put into the Push Data
FIFO, the CSR data will be put into the Push Data FIFO. If it is convenient to guarantee a slot by
putting a bubble on the RMC input, then that will be done.

200

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.10.6

Arbitration
The channel needs to arbitrate among several different operations at RMC. Arbitration rules are
given here for those cases: from highest to lowest priority:

• Refresh RDRAM.
• Current calibrate RDRAM.
• Bank operations. When there are multiple bank operations ready, the rules are: (1) round robin
among banks to avoid bank collisions, and (2) skip a bank to avoid DQ bus turnarounds. No
bank can be skipped more than twice.

Commands are given to RMC in the order in which they will be executed.

5.10.7

Reference Ordering
Table 67 lists the ordering of reads and writes to the same address for DRAM. The definition of
first and second is defined by the time the command is valid on the command bus.

Table 67. Ordering of Reads and Writes to the Same Address for DRAM

5.11

First
Access

Second
Access

Ordering Rules

Read

None. If there are no side-effects on reads, both readers get the same data.

Read

Write

Reader must get the pre-modified data. This is not enforced in hardware. The write
instruction must not be executed until after the Microengine receives the signal of read
completion (i.e., program must use sig_done on the read).

Write

Read

Reader must get the post-modified data. This is not enforced in hardware. The read
instruction must not be executed until after the Microengine receives the signal of write
completion (i.e., program must use sig_done token on the write instruction and wait
for the signal before executing the read instruction).

Write

The hardware guarantees that the writes complete in the order in which they are
issued.

DRAM Push/Pull Arbiter
The DRAM Push/Pull Arbiter contains the push and pull arbiters for the D-Cluster (DRAM
Cluster). Both the PUSH and PULL data buses have multiple masters and multiple targets. The
DRAM Push/Pull Arbiter determines which master gets to drive the data bus for a given
transaction and makes sure that the data is delivered correctly.
This unit has the following features:

•
•
•
•
•

Up to three DX Unit (DRAM Unit) masters.
64-bit wide push and pull data buses.
Round-robin arbitration scheme.
Peak delivery of 64 bits per cycle.
Supports third-party data transfers; the Microengine’s can command data movements between
the MSF (Media) and either the DX Units or the CR Units.

Hardware Reference Manual

201

Intel® IXP2800 Network Processor
DRAM

• Supports chaining for burst DRAM push operations to tell the arbiter to grant consecutive push
requests.

• Supports data error bit handling and delivery.
Figure 71 shows the functional blocks for the DRAM Push/Pull Arbiter.
Figure 71. DRAM Push/Pull Arbiter Functional Blocks

D0-Unit

D1-Unit

D2-Unit

DP-Unit
DPSA-FUB

PCI

TC0-Cluster

DPLA-FUB

TC1-Cluster

Intel
XScale®
Core

TBUF/
RBUF
A9731-02

5.11.1

Arbiter Push/Pull Operation
Within the arbiter there are two functional units: the push arbiter and the pull arbiter. Push and pull
always refer to the way data is flowing from the bus master, i.e., a Microengine makes a read
request, the DRAM channel does the read, and then “pushes” the data back to the Microengine.
For a push transaction, a push master drives the command and data to the DRAM push arbiter
(DPSA) and into a dedicated request FIFO. When that command is at the head of the FIFO, and it
is either the requesting unit’s turn to go based on the round-robin arbitration policy, or there are no
other requesters, then the arbiter will “grant” the request. This grant means that the arbiter delivers
the push data to the correct target with all the correct handshakes and retires the request (a data
transaction is always eight bytes).
The DRAM pull arbiter (DPLA) is slightly different because it functions on bursts of data
transactions instead of single transactions. For a pull transaction, a pull master drives a command
to the pull arbiter and into a dedicated request FIFO. When the command gets to the head of the
FIFO it is evaluated, s was done for the push arbiter. The difference is that each command may
reference bursts of data movements (always in multiples of eight bytes). The arbiter grants the
command, and keeps it granted until it increments through all of the data movements required by
the command. As the data is read from its source, the command is modified to address the next data
address, and a handshake to the requesting unit is driven when the data is valid.

202

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

5.11.2

DRAM Push Arbiter Description
The general data flow for a push operation is as shown in Table 68. The DRAM Push Arbiter
functional blocks are shown in Figure 72.

Table 68. DRAM Push Arbiter Operation
Push Bus Master/Requestor

Data Source

Data Destination

IXP2800 Network Processor
TC0 Cluster (ME 0 – 7)
D0 Unit
D1 Unit
D2 Unit

Current Master

TC1 Cluster (ME 10 – 17)
Intel XScale® core
PCI Unit
MSF Unit

The push arbiter takes push requests from any requestors. Each requestor has a dedicated request
FIFO. A request comes in the form of a PUSH_ID, and is accompanied by the data to be pushed, a
data error bit, and a chain bit. All of this information is enqueued in the correct FIFO for each
request, i.e., for each eight bytes of data. The push arbiter must drive a full signal to the requestor if
the FIFO reaches a predefined “full” level to apply backpressure and stop requests from coming.
The FIFO is 64 entries deep and goes full at 40 entries. The long skid allows for burst reads in
flight to finish before stalling the DRAM controller. If the FIFO is not full, the push arbiter can
enqueue a new request from each requestor on every cycle.
The push arbiter monitors the heads of each FIFO, and does a round robin arbitration between any
available requestors. If the chain bit is asserted, it indicates that once the head request of a queue is
granted, the arbiter should continue to grant that queue until the chain bit de-asserts. It is expected
that the requestor will assert the chain bit for no longer than a full burst length. The push arbiter
must also take special notice of requests destined for the receive buffer in the Media Switch Fabric
(MSF). Finally, the push arbiter must manage the delivery of data at different rates, depending on
how wide the bus is going into a given target.
The Microengines, PCI, and the Intel XScale® core all have 32-bit data buses. For these targets, the
push arbiter takes two clock cycles to deliver 64 bits of data by first delivering bits 31:0 in the first
cycle, and then putting bits 63:32 onto the low 32 bits of the PUSH_DATA in the second cycle.

Hardware Reference Manual

203

Intel® IXP2800 Network Processor
DRAM

Figure 72. DRAM Push Arbiter Functional Blocks

Round
Robin

D0_PUSH_ID

D0_PUSH_REQ

D0_PUSH_DATA

D1_PUSH_ID

D1_PUSH_REQ

D1_PUSH_DATA
D2_PUSH_REQ
D2_PUSH_ID

DPXX_PUSH_ID

A
R
B
I
T
E
R

DPXX_PUSH_DATA

D2_PUSH_DATA

A9732-01

The DRM Push Arbiter boundary conditions are:

• Make sure each of the push_request queues assert the full signal and back pressure the
requesting unit.

• Maintain 100% bus utilization, i.e., no holes.

5.12

DRAM Pull Arbiter Description
The general data flow for a push operation is as shown in Table 69. The DRAM Pull Arbiter
functional blocks are shown in Figure 73.

Table 69. DPLA Description
Pull Bus Master/Requestor

Data Source

Data Destination

IXP2800 Network Processor
TC0 Cluster (Microengine 0 – 7)
D0 Unit
D1 Unit
D2 Unit

TC1 Cluster (Microengine 8 – 15)
Intel XScale® core
PCI Unit

Current Master

MSF Unit

The pull arbiter is very similar to the push arbiter, except that it gathers the data from a data source
ID and delivers it to the requesting unit where it is written to DRAM memory.

204

Hardware Reference Manual

Intel® IXP2800 Network Processor
DRAM

When a requestor gets a pull command on the CMD_BUS, the requestor sends the command to the
pull arbiter. This is enqueued into a requestor-dedicated FIFO. The pull request FIFOs are much
smaller than the push request FIFOs because pull requests can request up to 128 bytes of data. It is
eight entries deep and asserts full when it has six entries to account for in-flight requests.
The pull arbiter monitors the heads of each of the three FIFOs. A round robin arbitration scheme is
applied to all valid requests. When a request is granted, the request is completed regardless of how
many data transfers are required. Therefore, one request can take as many as 16 – 32 DRAM
cycles. The push data bus can only use 32 bits when delivering data to the Microengines, PCI, and
the Intel XScale® core. For these data sources, it takes two cycles to pull every eight bytes
requested; otherwise, it takes only one cycle per eight bytes. On four byte cycles, data is delivered
as pulled.
Figure 73. DRAM Pull Arbiter Functional Blocks

Round
Robin
D0_PUSH_REQ
D0_PULL_ID
DPXX_PUSH_ID
D1_PUSH_REQ
D1_PULL_ID

D2_PUSH_REQ

A
R
B
I
T
E
R

DPXX_TAKE

D2_PULL_ID

ME_CLUSTER_0_DATA
ME_CLUSTER_1_DATA
XSCALE*_DATA

DPXX_PULL_DATA[63:0]

PCI_PULL_DATA
MSF_PULL_DATA
* Intel XScale® Microarchitecture
A9733-02

Hardware Reference Manual

205

Intel® IXP2800 Network Processor
DRAM

206

Hardware Reference Manual

Intel® IXP2800 Network Processor
SRAM Interface

SRAM Interface
6.1

Overview
The IXP2800 Network Processor contains four independent SRAM controllers. SRAM controllers
support pipelined QDR synchronous static RAM (SRAM) and a coprocessor that adheres to QDR
signaling. Any or all controllers can be left unpopulated if the application does not need them.
Reads and writes to SRAM are generated by Microengines (MEs), the Intel XScale® core, and PCI
Bus masters. They are connected to the controllers through Command Buses and Push and Pull
Buses. Each of the SRAM controllers takes commands from the command bus and enqueues them.
The commands are de-queued according to priority, and successive accesses to the SRAMs are
performed.
Each SRAM controller receives commands using two Command Buses, one of which may be tied
off as inactive, depending on the chip implementation. The SRAM Controller can enqueue a
command from each Command Bus in each cycle. Data movement between the SRAM controllers
and the Microengines is through the S_Push bus and S_Pull bus.
The overall structure of the SRAM controllers is shown in Figure 74.

Hardware Reference Manual

207

Intel® IXP2800 Network Processor
SRAM Interface

Figure 74. SRAM Controller/Chassis Block Diagram

Command
Bus from ME
Cluster 0

SRAM chips
and/or
co-processor

Command
Bus from ME
Cluster 1

SRAM
Controller

Push Bus / ID
to ME Cluster 0

Push Arb

Push Bus / ID
to ME Cluster 1

Push Arb

Pulll ID to
ME Cluster 0

Pull Arb

Pulll ID to
ME Cluster 1

Pull Arb

Pull Data from
ME Cluster 0

SRAM
Controller

Pull Data from
ME Cluster 1

A8951-01

6.2

SRAM Interface Configurations
Memory is logically four bytes (one longword) wide while physically, the data pins are two bytes
wide and double-clocked. Byte parity is supported. Each of the four bytes has a parity bit, which is
written when the byte is written and checked when the longword is read. There are byte-enables
that select the bytes to write, for lengths of less than a longword.
The QDR controller implements a big-endian ordering scheme at the interface pins. For write
operations, bytes 0/1, (data bits [31:16]), and associated parity and byte-enables are written on the
rising edge of the K clock while bytes 2/3, (data bits [15:0]), and associated parity and byte-enables
are written on the rising edge of the K_n clock. For read operations, bytes 0/1, (data bits [31:16]),
and associated parity and byte-enables are captured on the rising edge of CIN0 clock while bytes
2/3, (data bits [15:0]), and associated parity and byte-enables are captured on the rising edge of
CIN0_n clock.

208

Hardware Reference Manual

Intel® IXP2800 Network Processor
SRAM Interface

In general, QDR and QDR II bursts of two SRAMs are supported at speeds up to 233 MHz. As
other (larger) QDR SRAMs are introduced, they will also be supported.
The SRAM controller can also be configured to interface to an external coprocessor that adheres to
the QDR or QDR II electrical and functional specification.

6.2.1

Internal Interface
Each SRAM channel receives commands through the command bus mechanism and transfers data
to and from the Microengines, the Intel XScale® core, and PCI, using SRAM push and SRAM pull
buses.

6.2.2

Number of Channels
The IXP2800 Network Processor supports four channels.

6.2.3

Coprocessor and/or SRAMs Attached to a Channel
Each channel supports the attachment of QDR SRAMs, a co-processor, or both, depending on the
module level signal integrity and loading.

6.3

SRAM Controller Configurations
There are enough address pins (24) to support up to 64 Mbytes of SRAM. The SRAM controllers
can directly generate multiple port enables (up to five pairs) to allow for depth expansion. Two
pairs of pins are dedicated for port enables. Smaller RAMs use fewer address signals than the
number provided to accommodate the largest RAMs, so some address pins (23:18) are
configurable as either address or port-enable based on CSR SRAM_Control[Port_Control] as
shown in Table 70.
Note:

All of the SRAMs on a given channel must be the same size.

Note:

Table 70 shows the capability of the logic — 1, 2, or 4 loads are supported as shown in the table,
but this is subject to change.

Table 70. SRAM Controller Configurations
SRAM
Configuration

SRAM Size

Addresses Needed
to Index SRAM

Addresses Used
as Port Enables

Total Number of Port
Select Pairs Available

512K x 18

1 Mbyte

17:0

23:22, 21:20

1M x 18

2 Mbytes

18:0

23:22, 21:20

2M x 18

4 Mbytes

19:0

23:22, 21:20

4M x 18

8 Mbytes

20:0

23:22

8M x 18

16 Mbytes

21:0

23:22

16M x 18

32 Mbytes

22:0

None

32M x 18

64 Mbytes

23:0

None

Hardware Reference Manual

209

Intel® IXP2800 Network Processor
SRAM Interface

Each channel can be expanded in depth according to the number of port enables available. If
external decoding is used, then the number of SRAMs is not limited by the number of port enables
generated by the SRAM controller.
Note:

External decoding may require external pipeline registers to account for the decode time,
depending on the desired frequency.
Maximum SRAM system sizes are shown in Table 71. Shaded entries require external decoding,
because they use more port-enables than the SRAM controller can directly supply.

Table 71. Total Memory per Channel
Number of SRAMs on Channel
SRAM Size
1

512K x 18

1 MB

2 MB

3 MB

4 MB

5 MB

6 MB

7 MB

8 MB

1M x 18

2 MB

4 MB

6 MB

8 MB

10 MB

12 MB

14 MB

16 MB

2M x 18

4 MB

8 MB

12 MB

16 MB

20 MB

24 MB

28 MB

32 MB

4M x 18

8 MB

16 MB

24 MB

32 MB

64 MB

8M x 18

16 MB

32 MB

48 MB

64 MB

16M x 18

32 MB

64 MB

32M x 18

64 MB

Figure 75 shows how the SRAM clocks on a channel are connected. For receiving data from the
SRAMs, the clock path and data path are matched to meet hold time requirements.
Figure 75. SRAM Clock Connection on a Channel

SRAM

Intel® IXP2800
Network
Processor
C, C_n

K, K_n

A9734-02

It is also possible to pipeline the SRAM signals with external registers. This is useful for the case
when there is considerable loading on the address and data signals, which would slow down the
cycle time. The pipeline stages make it possible to keep the cycle time fast by fanning out the
address, byte write, and data signals. The RAM read data may also be put through a pipeline
register, depending on configuration. External decoding of port selects can also be done to expand
the number of SRAMs supported. Figure 76 is a block diagram that shows the concept of external
pipelining.

210

Hardware Reference Manual

Intel® IXP2800 Network Processor
SRAM Interface

A side-effect of the pipeline registers is to add latency to reads, and the SRAM controller must
account for that delay by waiting extra cycles (relative to no external pipeline registers) before it
registers the read data. The number of extra pipeline delays is programmed in
SRAM_Control[Pipeline].
Figure 76. External Pipeline Registers Block Diagram

SRAM

Intel® IXP2800
Network
Processor
Q

Addr, BWE, etc.

A9735-01

6.4

Command Overview
This section will give an overview of the SRAM commands and their operation. The details will be
given later in the document. Memory reference ordering will be specified along with the detailed
command operation.

6.4.1

Basic Read/Write Commands
The basic read and write commands will transfer from 1 – 16 longwords of data to or from the
QDR SRAM external to the IXP2800 Network Processor.
For a read command, the SRAM is read and the data placed on the Push bus, one longword at a
time. The command source (for example, the Microengine) is signaled that the command is
complete during the last data phase of the push bus transfer.
For a write command, the data is first pulled from the source, then written to the SRAM in
consecutive SRAM cycles. The command source is signaled that the command is complete during
the last data phase of the pull bus transfer.
If a read operation stalls due to the pull-data FIFO filling, any concurrent write operation that is in
progress to the same address is temporarily stopped. This technique results in atomic data reads.

6.4.2

Atomic Operations
The SRAM Controller does read-modify-writes for the atomic operations, and the pre-modified
data can be returned if desired. Other (non-atomic) readers and writers can access the addressed
location between the read and write portions of the read-modify-write. Table 72 describes the
atomic operations supported by the SRAM Controller.

Hardware Reference Manual

211

Intel® IXP2800 Network Processor
SRAM Interface

Table 72. Atomic Operations
Instruction

Pull Operand

Value Written to SRAM

Set_bits

Optional1

SRAM_Read_Data or Pull_Data

Clear_bits

Optional

SRAM_Read_Data and not Pull_Data

Increment

SRAM_Read_Data + 0x00000001

Decrement

SRAM_Read_Data - 0x00000001

Add

Optional

SRAM_Read_Data + Pull_Data

Swap

Optional

Pull_Data

There are two versions of the Set, Clear, Add, and Swap instructions. One version pulls operand data from the Microengine
transfer registers, and the second version passes the operand data to the SRAM unit as part of the command.

Up to two Microengine signals are assigned to each read-modify-write reference. Microcode
should always tag the read-modify-write reference with an even-numbered signal. If the operation
requires a pull, the requested signal is sent on the pull. If the pre-modified data is to be returned to
the Microengine, then the Microengine is sent (requested signal OR 1) when that data is pushed.
In Example 28, the version of Test_and_Set requires both a pull and a push:
Example 28. SRAM Test_and_Set with Pull Data
immed [$xfer0, 0x1]
SRAM[test_and_set, $xfer0, test_address, 0, 1], sig_done_2
// SIGNAL_2 is set when $xfer0 is pulled from this ME. SIGNAL_3 is
// set when $xfer0 is written with the test value. Sleep until both
// SIGNALS have arrived.
CTX_ARB[signal_2, signal_3]

In Example 29, the version of Test_and_Set does not require a pull, but does issue a push. A signal
is generated when the push is complete.
Example 29. SRAM Test_and_Set with Optional No-Pull Data
#define
#define
#define
#define

no_pull_mode_bit 24
byte_mask_override_bit 20
no_pull_data_bit 12
upper_part_bit 21

// This constant can be created once at init time
ALU[no_pull_constant, --, b, 0x3, <) in the link longword.
A ring is an ordered list of data words stored in a fixed block of contiguous addresses. A ring is
described by a head pointer and a tail pointer. Data is written, using the put command, to a ring at
the address contained in the tail pointer and the tail pointer is incremented. Data is read, using the
get command, from a ring at the address contained in the head pointer and the head pointer is
incremented. Whenever either pointer reaches the end of the ring, the pointer is wrapped back to
the address of the start of the ring.
A journal is similar to a ring. It is generally used for debugging. Journal commands only write to
the data structure. New data overwrites the oldest data. Microcode can choose to tag the journal
data with the Microengine number and CTX number of the journal writer.
The Q_array to support queuing, rings and journals contains 64 registers per SRAM channel. For
a design with a large number of queues, the queue descriptors cannot all be stored on the chip, and
thus a subset of the queue descriptors (16) is cached in the Q_array. (To implement the cache, 16
contiguous Q_array registers must be allocated.) The cache tag (the mapping of queue number to
Q_array registers) for the Q_array is maintained by microcode in the CAM of a Microengine.
The writeback and load of the cached registers in the Q_array is under the control of that
microcode.
Note:

The size of the Q_array does not set a limit on the number of queues supported.
For other queues (free buffer pools, for example), rings, and journals, the information does not
need to be subsetted and thus can be loaded into the Q_array at initialization time and left there to
be updated solely by the SRAM controller.
The sum total of the cached queue descriptors plus the number of rings, journals, and static queues
must be less than or equal to 64 for a given SRAM channel.
The fields and sizes of the Q_array registers are shown in Table 73 and Table 74. All addresses
are of type longword, and are 32 bits in length.

Table 73. Queue Format
Name

Longword
Number

Bit
Number1

Definition

EOP

End of Packet — decrement Q_count on dequeue

SOP

Start of Packet — used by the programmer

Cell Count

29:24

Number of cells in the buffer

Head

23:0

Head pointer

Tail

23:0

Tail pointer

Q_count

23:0

Number of packets on the queue or number of buffers on the queue

SW_Private

31:24

Ignored by hardware, returned to Microengine

Head Valid

N/A

Cached head pointer valid — maintained by hardware

Tail Valid

N/A

Cached tail pointer valid — maintained by hardware

Bits 31:24 of longword number 2 are available for use by microcode.

Hardware Reference Manual

215

Intel® IXP2800 Network Processor
SRAM Interface

Table 74. Ring/Journal Format
Longword
Number

Name

Note:

Ring Size

Bit
Number
31:29

Definition
See Table 75 for size encoding.

Head

23:0

Get pointer

Tail

23:0

Put pointer

Ring Count

23:0

Number of longwords on the ring

For a Ring or Journal, Head and Tail must be initialized to the same address.
Journals/Rings can be configured to be one of eight sizes, as shown in Table 75.

Table 75. Ring Size Encoding
Ring Size Encoding

Size of Journal/Ring Area

Head/Tail Field Base

Head and Tail Field Increment

000

512 longwords

23:9

8:0

001