Cortex A Series Programmer’s Guide Cort AProg

CortAProgGuide

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 455

DownloadCortex-A Series Programmer’s Guide Cort AProg
Open PDF In BrowserView PDF
Cortex -A Series
™

Version: 2.0

Programmer’s Guide

Copyright © 2011 ARM. All rights reserved.
ARM DEN0013B (ID082411)

Cortex-A Series
Programmer’s Guide
Copyright © 2011 ARM. All rights reserved.
Release Information
The following changes have been made to this book.
Change history
Date

Issue

Confidentiality

Change

25 March 2011

A

Non-Confidential

First release

10 August 2011

B

Non-Confidential

Second release. Virtualization chapter added
Updated to include Cortex-A15 processor, and LPAE
Corrected and revised throughout

Proprietary Notice
This Cortex-A Series Programmer’s Guide is protected by copyright and the practice or implementation of the
information herein may be protected by one or more patents or pending applications. No part of this Cortex-A Series
Programmer’s Guide may be reproduced in any form by any means without the express prior written permission of
ARM. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by
this Cortex-A Series Programmer’s Guide.
Your access to the information in this Cortex-A Series Programmer’s Guide is conditional upon your acceptance that
you will not use or permit others to use the information for the purposes of determining whether implementations of the
information herein infringe any third party patents.
This Cortex-A Series Programmer’s Guide is provided “as is”. ARM makes no representations or warranties, either
express or implied, included but not limited to, warranties of merchantability, fitness for a particular purpose, or
non-infringement, that the content of this Cortex-A Series Programmer’s Guide is suitable for any particular purpose or
that any practice or implementation of the contents of the Cortex-A Series Programmer’s Guide will not infringe any
third party patents, copyrights, trade secrets, or other rights.
This Cortex-A Series Programmer’s Guide may include technical inaccuracies or typographical errors.
To the extent not prohibited by law, in no event will ARM be liable for any damages, including without limitation any
direct loss, lost revenue, lost profits or data, special, indirect, consequential, incidental or punitive damages, however
caused and regardless of the theory of liability, arising out of or related to any furnishing, practicing, modifying or any
use of this Programmer’s Guide, even if ARM has been advised of the possibility of such damages. The information
provided herein is subject to U.S. export control laws, including the U.S. Export Administration Act and its associated
regulations, and may be subject to export or import regulations in other countries. You agree to comply fully with all
laws and regulations of the United States and other countries (“Export Laws”) to assure that neither the information
herein, nor any direct products thereof are; (i) exported, directly or indirectly, in violation of Export Laws, either to any
countries that are subject to U.S. export restrictions or to any end user who has been prohibited from participating in the
U.S. export transactions by any federal agency of the U.S. government; or (ii) intended to be used for any purpose
prohibited by Export Laws, including, without limitation, nuclear, chemical, or biological weapons proliferation.
Words and logos marked with ® or TM are registered trademarks or trademarks of ARM Limited, except as otherwise
stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their
respective owners.
Copyright © 2011 ARM Limited
110 Fulbourn Road Cambridge, CB1 9NJ, England
This document is Non-Confidential but any disclosure by you is subject to you providing notice to and the
acceptance by the recipient of, the conditions set out above.
In this document, where the term ARM is used to refer to the company it means “ARM or any of its subsidiaries as
appropriate”.
Web Address
http://www.arm.com

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

ii

Contents
Cortex-A Series Programmer’s Guide

Preface
References ....................................................................................................................... x
Typographical conventions .............................................................................................. xi
Feedback on this book .................................................................................................... xii
Terms and abbreviations ............................................................................................... xiii

Chapter 1

Introduction
1.1
1.2
1.3

Chapter 2

The ARM Architecture
2.1
2.2
2.3
2.4

Chapter 3

2-3
2-4
2-8
2-9

Linux distributions ......................................................................................................... 3-2
Useful tools ................................................................................................................... 3-6
Software toolchains for ARM processors ...................................................................... 3-8
ARM DS-5 ................................................................................................................... 3-11
Example platforms ...................................................................................................... 3-13

ARM Registers, Modes and Instruction Sets
4.1
4.2
4.3

ARM DEN0013B
ID082411

Architecture versions ....................................................................................................
Architecture history and extensions ..............................................................................
Key points of the ARM Cortex-A series architecture ....................................................
Processors and pipelines ..............................................................................................

Tools, Operating Systems and Boards
3.1
3.2
3.3
3.4
3.5

Chapter 4

History ........................................................................................................................... 1-3
System-on-Chip (SoC) .................................................................................................. 1-4
Embedded systems ...................................................................................................... 1-5

Instruction sets .............................................................................................................. 4-2
Modes ........................................................................................................................... 4-3
Registers ....................................................................................................................... 4-4

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

iii

Contents

4.4
4.5

Chapter 5

Introduction to Assembly Language
5.1
5.2
5.3
5.4
5.5
5.6

Chapter 6

Virtual memory .............................................................................................................. 8-3
Level 1 page tables ....................................................................................................... 8-4
Level 2 page tables ....................................................................................................... 8-7
The Translation Lookaside Buffer ................................................................................. 8-9
TLB coherency ............................................................................................................ 8-10
Choice of page sizes .................................................................................................. 8-11
Memory attributes ....................................................................................................... 8-12
Multi-tasking and OS usage of page tables ................................................................ 8-15
Linux use of page tables ............................................................................................. 8-18
The Cortex-A15 MMU and Large Physical Address Extensions ................................ 8-21

Memory Ordering
9.1

ARM DEN0013B
ID082411

Why do caches help? ................................................................................................... 7-3
Cache drawbacks ......................................................................................................... 7-4
Memory hierarchy ......................................................................................................... 7-5
Cache terminology ........................................................................................................ 7-6
Cache architecture ........................................................................................................ 7-7
Cache controller ............................................................................................................ 7-8
Direct mapped caches .................................................................................................. 7-9
Set associative caches ............................................................................................... 7-11
A real-life example ...................................................................................................... 7-12
Virtual and physical tags and indexes ........................................................................ 7-13
Cache policies ............................................................................................................ 7-14
Allocation policy .......................................................................................................... 7-15
Replacement policy .................................................................................................... 7-16
Write policy ................................................................................................................. 7-17
Write and Fetch buffers .............................................................................................. 7-18
Cache performance and hit rate ................................................................................. 7-19
Invalidating and cleaning cache memory .................................................................... 7-20
Cache lockdown ......................................................................................................... 7-21
Level 2 cache controller .............................................................................................. 7-22
Point of coherency and unification .............................................................................. 7-23
Parity and ECC in caches ........................................................................................... 7-24
Tightly coupled memory .............................................................................................. 7-25

Memory Management Unit
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10

Chapter 9

Instruction set basics .................................................................................................... 6-2
Data processing operations .......................................................................................... 6-6
Multiplication operations ............................................................................................... 6-9
Memory instructions .................................................................................................... 6-10
Branches ..................................................................................................................... 6-13
Integer SIMD instructions ........................................................................................... 6-14
Saturating arithmetic ................................................................................................... 6-18
Miscellaneous instructions .......................................................................................... 6-19

Caches
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
7.20
7.21
7.22

Chapter 8

Comparison with other assembly languages ................................................................ 5-2
Instruction sets .............................................................................................................. 5-4
ARM tools assembly language ..................................................................................... 5-5
Introduction to the GNU Assembler .............................................................................. 5-7
Interworking ................................................................................................................ 5-11
Identifying assembly code .......................................................................................... 5-12

ARM/Thumb Unified Assembly Language Instructions
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

Chapter 7

Instruction pipelines ...................................................................................................... 4-7
Branch prediction ........................................................................................................ 4-10

ARM memory ordering model ....................................................................................... 9-4

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

iv

Contents

9.2
9.3

Chapter 10

Exception Handling
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8

Chapter 11

Compiler optimizations ............................................................................................... 17-3
ARM memory system optimization ............................................................................. 17-8
Source code modifications ........................................................................................ 17-14
Cortex-A9 micro-architecture optimizations .............................................................. 17-19

Floating-point basics and the IEEE-754 standard ...................................................... 18-2
VFP support in GCC ................................................................................................... 18-9
VFP support in the ARM Compiler ............................................................................ 18-10
VFP support in Linux ................................................................................................ 18-11
Floating-point optimization ........................................................................................ 18-12

Introducing NEON
19.1
19.2

ARM DEN0013B
ID082411

Profiler output ............................................................................................................. 16-3

Floating-Point
18.1
18.2
18.3
18.4
18.5

Chapter 19

Procedure Call Standard ............................................................................................ 15-2
Mixing C and assembly code ...................................................................................... 15-7

Optimizing Code to Run on ARM Processors
17.1
17.2
17.3
17.4

Chapter 18

Endianness ................................................................................................................. 14-2
Alignment .................................................................................................................... 14-6
Miscellaneous C porting issues .................................................................................. 14-8
Porting ARM assembly code to ARMv7 .................................................................... 14-11
Porting ARM code to Thumb .................................................................................... 14-12

Profiling
16.1

Chapter 17

Booting a bare-metal system ...................................................................................... 13-2
Configuration .............................................................................................................. 13-6
Booting Linux .............................................................................................................. 13-7

Application Binary Interfaces
15.1
15.2

Chapter 16

12-2
12-4
12-5
12-6

Porting
14.1
14.2
14.3
14.4
14.5

Chapter 15

Abort handler ..............................................................................................................
Undefined instruction handling ...................................................................................
SVC exception handling .............................................................................................
Linux exception program flow .....................................................................................

Boot Code
13.1
13.2
13.3

Chapter 14

External interrupt requests .......................................................................................... 11-2
Generic Interrupt Controller ........................................................................................ 11-5

Other Exception Handlers
12.1
12.2
12.3
12.4

Chapter 13

Types of exception ...................................................................................................... 10-2
Entering an exception handler .................................................................................... 10-4
Exit from an exception handler ................................................................................... 10-5
Exception mode summary .......................................................................................... 10-6
Vector table ................................................................................................................. 10-8
Distinction between FIQ and IRQ ............................................................................... 10-9
Return instruction ...................................................................................................... 10-10
Privilege model in ARMv7-A Virtualization Extensions ............................................. 10-11

Interrupt Handling
11.1
11.2

Chapter 12

Memory barriers ............................................................................................................ 9-6
Cache coherency implications .................................................................................... 9-12

SIMD ........................................................................................................................... 19-2
NEON architecture overview ...................................................................................... 19-4
Copyright © 2011 ARM. All rights reserved.
Non-Confidential

v

Contents

19.3

Chapter 20

Writing NEON Code
20.1
20.2
20.3

Chapter 21

ARM debug hardware .................................................................................................
ARM trace hardware ...................................................................................................
Debug monitor ............................................................................................................
Debugging Linux applications .....................................................................................
ARM tools supporting debug and trace ......................................................................

28-2
28-3
28-6
28-7
28-8

Instruction Summary ..................................................................................................... A-2

NEON and VFP Instruction Summary
B.1
B.2
B.3
B.4

ARM DEN0013B
ID082411

ARMv7-A Virtualization Extensions ............................................................................ 27-3
Hypervisor exception model ....................................................................................... 27-5
Relationship between virtualization and ARM Security Extensions ............................ 27-6

Instruction Summary
A.1

Appendix B

TrustZone hardware architecture ................................................................................ 26-2

Debug
28.1
28.2
28.3
28.4
28.5

Appendix A

Thread safety and reentrancy ..................................................................................... 25-2
Performance issues .................................................................................................... 25-3
Profiling in SMP systems ............................................................................................ 25-5

Virtualization
27.1
27.2
27.3

Chapter 28

24-2
24-4
24-5
24-8

Security
26.1

Chapter 27

Decomposition methods .............................................................................................
Threading models .......................................................................................................
Threading libraries ......................................................................................................
Synchronization mechanisms in the Linux kernel .......................................................

Issues with Parallelizing Software
25.1
25.2
25.3

Chapter 26

Cache coherency ........................................................................................................ 23-2
TLB and cache maintenance broadcast ..................................................................... 23-4
Handling interrupts in an SMP system ........................................................................ 23-5
Exclusive accesses ..................................................................................................... 23-6
Booting SMP systems ................................................................................................. 23-9
Private memory region .............................................................................................. 23-11

Parallelizing Software
24.1
24.2
24.3
24.4

Chapter 25

Multi-processing ARM systems .................................................................................. 22-3
Symmetric multi-processing ........................................................................................ 22-5
Asymmetric multi-processing ...................................................................................... 22-7

SMP Architectural Considerations
23.1
23.2
23.3
23.4
23.5
23.6

Chapter 24

Power and clocking ..................................................................................................... 21-2

Introduction to Multi-processing
22.1
22.2
22.3

Chapter 23

NEON C Compiler and assembler .............................................................................. 20-2
Optimizing NEON assembler code ............................................................................. 20-6
NEON power saving ................................................................................................... 20-9

Power Management
21.1

Chapter 22

NEON comparisons with other SIMD solutions ........................................................ 19-11

NEON general data processing instructions ................................................................. B-6
NEON shift instructions ............................................................................................... B-12
NEON logical and compare operations ...................................................................... B-16
NEON arithmetic instructions ...................................................................................... B-22

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

vi

Contents

B.5
B.6
B.7
B.8

Appendix C

B-30
B-33
B-39
B-45

Building Linux for ARM Systems
C.1
C.2
C.3

ARM DEN0013B
ID082411

NEON multiply instructions .........................................................................................
NEON load and store element and structure instructions ...........................................
VFP instructions ..........................................................................................................
NEON and VFP pseudo-instructions ..........................................................................

Building the Linux kernel ............................................................................................... C-2
Creating the Linux filesystem ........................................................................................ C-6
Putting it together .......................................................................................................... C-8

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

vii

Preface

This book provides an introduction to ARM technology for programmers using ARM Cortex-A
series processors that conform to the ARM ARMv7–A architecture. The “v7” refers to version 7 of
the architecture, while the “A” indicates the architecture profile that describes Application
processors. This includes the Cortex-A5, Cortex-A8, Cortex-A9 and Cortex-A15 processors. The
book complements rather than replaces other ARM documentation that is available for Cortex-A
series processors, such as the ARM Technical Reference Manual (TRMs) for the processors
themselves, documentation for individual devices or boards and of course, most importantly, the
ARM Architecture Reference Manual (or the “ARM ARM”).
Although much of the book is also applicable to other ARM processors, we do not explicitly cover
processors that implement older versions of the Architecture. The Cortex-R series and M-series
processors are mentioned but not described. Our intention is to provide an approachable
introduction to the ARM architecture, covering the feature set in detail and providing practical
advice on writing both C and assembly language programs to run efficiently on a Cortex-A series
processor. We assume familiarity with the C language and some knowledge of microprocessor
architectures, although no ARM-specific background is needed. We hope that the text will be well
suited to programmers who have a desktop PC or x86 background and are taking their first steps
into the ARM-based world.
The first dozen chapters of the book cover the basic features of the ARM Cortex-A series
processors. An introduction to the fundamentals of the ARM architecture and some background on
individual processors is provided in Chapter 2. In Chapter 3, we briefly consider some of the tools
and platforms available to those getting started with ARM programming. Chapters 4, 5 and 6
provide a brisk introduction to ARM assembly language programming, covering the various
registers, modes and assembly language instructions. We then switch our focus to the memory
system and look at Caches, Memory Management and Memory Ordering in Chapters 7, 8 and 9.
Dealing with interrupts and other exceptions is described in Chapters 10 to 12.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

viii

Preface

The remaining chapters of the book provide more advanced programming information.
Chapter 13 provides an overview of boot code. Chapter 14 looks at issues with porting C and
assembly code to the ARMv7 architecture, from other architectures and from older versions of
the ARM architecture. Chapter 15 covers the Application Binary Interface, knowledge of which
is useful to both C and assembly language programmers. Profiling and optimizing of code is
covered in Chapters 16 and 17. Many of the techniques presented are not specific to the ARM
architecture, but we also provide some processor-specific hints. We look at floating-point and
the ARM Advanced SIMD extensions (NEON) in Chapters 18-20. These chapters are only an
introduction to the relevant topics. It would take significantly longer to cover all of the powerful
capabilities of NEON and how to apply these to common signal processing algorithms.
Power management is an important part of ARM programming and is covered in Chapter 21.
Chapters 22-25 cover the area of multi-processing. We take a detailed look at how this is
implemented by ARM and how you can write code to take advantage of it. The final chapters
of the book provide a brief coverage of the ARM Security Extensions (TrustZone®), the ARM
Virualization extensions (Chapter 27) and the powerful hardware debug features available to
programmers (Chapter 28). Appendices A and B give a summary of the available ARM, NEON
and VFP instructions and Appendix C gives step by step instructions for configuring and
building Linux for ARM systems.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

ix

Preface

References
Cohen, D. “On Holy Wars and a Plea for Peace”, USC/ISI IEN April, 1980,
http://www.ietf.org/rfc/ien/ien137.txt.
Furber, Steve. “ARM System-on-chip Architecture”, 2nd edition, Addison-Wesley, 2000, ISBN:
9780201675191.
Hohl, William. “ARM Assembly Language: Fundamentals and Techniques” CRC Press, 2009.
ISBN: 9781439806104.
Sloss, Andrew N.; Symes, Dominic.; Wright, Chris. “ARM System Developer's Guide:
Designing and Optimizing System Software”, Morgan Kaufmann, 2004, ISBN:
9781558608740.
Yiu, Joseph. “The Definitive Guide to the ARM Cortex-M3”, 2nd edition, Newnes, 2009, ISBN:
9780750685344.
ANSI/IEEE Std 754-1985, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 754-2008, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 1003.1-1990, “Standard for Information Technology - Portable Operating
System Interface (POSIX) Base Specifications, Issue 7”.
ANSI/IEEE Std 1149.1-2001, “IEEE Standard Test Access Port and Boundary-Scan
Architecture”.
The ARM Architecture Reference Manual (known as the ARM ARM) is a must-read for any
serious ARM programmer. It is available (after registration) from the ARM website. It fully
describes the ARMv7 instruction set architecture, programmer’s model, system registers, debug
features and memory model. It forms a detailed specification to which all implementations of
ARM processors must adhere.
References to the ARM Architecture Reference Manual in this document are to:
ARM Architecture Reference Manual - ARMv7-A and ARMv7-R edition (ARM DDI 0406).
Note
In the event of a contradiction between this book and the ARM ARM, the ARM ARM is
definitive and must take precedence.
ARM Generic Interrupt Controller Architecture Specification (ARM IHI 0048).
ARM Compiler Toolchain Assembler Reference (DUI 0489).
The individual processor Technical Reference Manuals provide a detailed description of the
processor behavior. They can be obtained from the ARM website documentation area,
http://infocenter.arm.com/help/index.jsp.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

x

Preface

Typographical conventions
This book uses the following typographical conventions:
italic

Highlights important notes, introduces special terminology, denotes
internal cross-references, and citations.

bold

Used for terms in descriptive lists, where appropriate.

monospace

Denotes text that you can enter at the keyboard, such as commands, file
and program names, instruction names, parameters and source code.

monospace italic

Denotes arguments to monospace text where the argument is to be
replaced by a specific value.

< and >

Enclose replaceable terms for assembler syntax where they appear in code
or code fragments. For example:
MRC p15, 0, , , , 

“term”

ARM DEN0013B
ID082411

We use quotation marks to identify unfamiliar or configuration specific
terms when they are first used, For example: “flat mapping”.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xi

Preface

Feedback on this book
We have tried to ensure that the Cortex-A Series Programmer’s Guide is both easy to read, and
still covers the material in enough depth to provide the comprehensive introduction to using the
processors that we originally intended.
If you have any comments on this book, don’t understand our explanations, think something is
missing or could be better explained, or think that it is incorrect, send an e-mail to
errata@arm.com. Give:
•
The title: The Cortex-A Series Programmer’s Guide
•
the number, ARM DEN0013B
•
the page number(s) to which your comments apply
•
what you think needs to be changed.
ARM also welcomes general suggestions for additions and improvements.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xii

Preface

Terms and abbreviations
Terms used in this document are defined here.

ARM DEN0013B
ID082411

AAPCS

ARM Architecture Procedure Call Standard.

ABI

Application Binary Interface.

ACP

Accelerator Coherency Port.

AHB

Advanced High-Performance Bus.

AMBA®

Advanced Microcontroller Bus Architecture.

AMP

Asymmetric Multi-Processing.

APB

Advanced Peripheral Bus.

ARM ARM

The ARM Architecture Reference Manual.

ASIC

Application Specific Integrated Circuit.

APSR

Application Program Status Register.

ASID

Address Space ID.

ATPCS

ARM Thumb® Procedure Call Standard.

AXI

Advanced eXtensible Interface.

BE8

Byte Invariant Big-Endian Mode.

BSP

Board Support Package.

BTAC

Branch Target Address Cache.

BTB

Branch Target Buffer.

CISC

Complex Instruction Set Computer.

CP15

Coprocessor 15 - System control coprocessor.

CPSR

Current Program Status Register.

DAP

Debug Access Port.

DBX

Direct Bytecode Execution.

DDR

Double Data Rate (SDRAM).

DMA

Direct Memory Access.

DMB

Data Memory Barrier.

DS-5™

The ARM Development Studio.

DSB

Data Synchronization Barrier.

DSP

Digital Signal Processing.

DSTREAM®

An ARM debug and trace unit.

DVFS

Dynamic Voltage/Frequency Scaling.

EABI

Embedded ABI.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xiii

Preface

ECC

Error Correcting Code.

ECT

Embedded Cross Trigger.

ETB

Embedded Trace Buffer™.

ETM

Embedded Trace Macrocell™.

FIQ

An interrupt type (formerly fast interrupt).

FPSCR

Floating-Point Status and Control Register.

GCC

GNU Compiler Collection.

GIC

Generic Interrupt Controller.

GIF

Graphics Interchange Format.

GPIO

General Purpose Input/Output.

Gprof

GNU profiler.

Harvard architecture
Architecture with physically separate storage and signal pathways for
instructions and data.

ARM DEN0013B
ID082411

IDE

Integrated development environment.

IPA

Intermediate Physical Address.

IRQ

Interrupt Request (normally external interrupts).

ISA

Instruction Set Architecture.

ISB

Instruction Synchronization Barrier.

ISR

Interrupt Service Routine.

Jazelle™

The ARM bytecode acceleration technology.

JIT

Just In Time.

L1/L2

Level 1/Level 2.

LPAE

Large Physical Address Extension.

LSB

Least Significant Bit.

MESI

A cache coherency protocol with four states, Modified, Exclusive, Shared
and Invalid.

MMU

Memory Management Unit.

MPU

Memory Protection Unit.

MSB

Most Significant Bit.

NEON™

The ARM Advanced SIMD Extensions.

NMI

Non-Maskable Interrupt.

Oprofile

A Linux system profiler.

QEMU

A processor emulator.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xiv

Preface

ARM DEN0013B
ID082411

PCI

Peripheral Component Interconnect. A computer bus standard.

PIPT

Physically Indexed, Physically Tagged.

PLE

Preload Engine.

PMU

Performance Monitor Unit.

PoC

Point of Coherency.

PoU

Point of Unification.

PPI

Private Peripheral Input.

PSR

Program Status Register.

PTE

Page Table Entry.

RCT

Runtime Compiler Target.

RISC

Reduced Instruction Set Computer.

RVCT

RealView® Compilation Tools (the “ARM Compiler”).

SCU

Snoop Control Unit.

SGI

Software Generated Interrupt.

SIMD

Single Instruction, Multiple Data.

SiP

System in Package.

SMP

Symmetric Multi-Processing.

SoC

System on Chip.

SP

Stack Pointer.

SPI

Shared Peripheral Interrupt.

SPSR

Saved Program Status Register.

Streamline

A graphical performance analysis tool.

SVC

Supervisor Call. (Previously SWI.)

SWI

Software Interrupt.

SYS

System Mode.

TAP

Test Access Port (JTAG Interface).

TCM

Tightly Coupled Memory.

TDMI®

Thumb, Debug, Multiplier, ICE.

TEX

Type Extension.

Thumb®

An instruction set extension to ARM.

Thumb-2

A technology extending the Thumb instruction set to support both 16- and
32-bit instructions.

TLB

Translation Lookaside Buffer.

TLS

Thread Local Storage.
Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xv

Preface

ARM DEN0013B
ID082411

TrustZone

The ARM security extension.

TTB

Translation Table Base.

UAL

Unified Assembly Language.

UART

Universal Asynchronous Receiver/Transmitter.

UEFI

Unified Extensible Firmware Interface.

U-Boot

A Linux Bootloader.

USR

User mode, a non-privileged processor mode.

VFP

The ARM floating-point instruction set. Before ARMv7, the VFP
extension was called the Vector Floating-Point architecture, and was used
for vector operations.

VIC

Vectored Interrupt Controller.

VIPT

Virtually Indexed, Physically Tagged.

VMID

Virtual Machine ID.

VMSA

Virtual Memory Systems Architecture.

XN

Execute Never.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

xvi

Chapter 1
Introduction

ARM processors are everywhere. More than 10 billion ARM based devices had been manufactured
by the end of 2008 and at the time of writing (early 2011), it is estimated that around one quarter
of electronic products contain one or more ARM processors. By the end of 2010 over 20 billion
ARM processors had been shipped. It is likely that readers of this book own products containing
ARM-based devices – a mobile phone, personal computer, television or car. It might come as a
surprise to programmers more used to the personal computer to learn that the x86 architecture
occupies a much smaller (but still highly lucrative) position in terms of total microprocessor
shipments, with around three billion devices.
The ARM architecture has advanced significantly since the first ARM1 silicon in 1985. The ARM
processor is not a single processor, but a whole family of processors which share common
instruction sets and programmer’s models and have some degree of backward compatibility.
The purpose of this book is to bring together information from a wide variety of sources to provide
a single guide for programmers who want to develop applications for the latest Cortex-A series of
processors. We will cover hardware concepts such as caches and Memory Management Units, but
only where this is valuable to the application writer. The book is intended to provide information
that will be useful to both assembly language and C programmers. We will look at how complex
operating systems, such as Linux, make use of ARM features, and how to take full advantage of
the many advanced capabilities of the ARM processor, in particular writing software for
multi-processing and using the SIMD capabilities of the device.
This is not an introductory level book. We assume knowledge of the C programming language and
microprocessors, but not any ARM-specific background. In the allotted space, we cannot hope to
cover every topic in detail. In some chapters, we suggest further reading (referring either to books
or websites) that can give a deeper level of background to the topic in hand, but in this book we will

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-1

Introduction

focus on the ARM-specific detail. We do not assume the use of any particular tool chain. We
will mention both GNU and ARM tools in the course of the book. Let’s begin, however, with a
brief look at the history of ARM.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-2

Introduction

1.1

History
The first ARM processor was designed within Acorn Computers Limited by a team led by
Sophie Wilson and Steve Furber, with the first silicon (which worked first time!) produced in
April 1985. This ARM1 was quickly replaced by the ARM2 (which added multiplier hardware),
which was used in real systems, including Acorn’s Archimedes personal computer.
ARM Limited was formed in Cambridge, England in November 1990, as Advanced RISC
Machines Ltd. It was a joint venture between Apple Computers, Acorn Computers and VLSI
Technology and has outlived two of its parents. The original 12 employees came mainly from
the team within Acorn Computers. One reason for spinning ARM off as a separate company was
that the processor had been selected by Apple Computers for use in its Newton product.
The new company quickly decided that the best way forward for their technology was to license
their Intellectual Property (IP). Instead of designing, manufacturing and selling the chips
themselves, they would sell rights to their designs to semiconductor companies. These
companies would design the ARM processor into their own products, in a partnership model.
This IP licensing business is how ARM continues to operate today. ARM was quickly able to
sign up licensees with Sharp, Texas Instruments and Samsung among prominent early
customers. In 1998, ARM Holdings floated on the London Stock Exchange and Nasdaq. At the
time of writing, ARM has nearly 2000 employees and has expanded somewhat from its original
remit of processor design. ARM also licenses “Physical IP” – libraries of cells (NAND gates,
RAM and so forth), graphics and video accelerators and software development products such as
compilers, debuggers, boards and application software.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-3

Introduction

1.2

System-on-Chip (SoC)
Chip designers today can produce chips with many millions of transistors. Designing and
verifying such complex circuits has become an extremely difficult task. It is increasingly rare
for all of the parts of such systems to be designed by a single company. In response to this, ARM
Limited and other semiconductor IP companies design and verify components (so-called IP
blocks or processors). These are licensed by semiconductor companies who use these blocks in
their own designs and include microprocessors, DSPs, 3D graphics and video controllers, along
with many other functions.
The semiconductor companies take these blocks and integrate many other parts of a particular
system onto the chip, to form a System-on-Chip (SoC). The architects of such devices must
select the appropriate processor(s), memory controllers, on-chip memory, peripherals, bus
interconnect and other logic (perhaps including analog or radio frequency components), in order
to produce a system.
The term Application Specific Integrated Circuit (ASIC) is one that we will also use in the book.
This is an IC design that is specific to a particular application. An individual ASIC might well
contain an ARM processor, memory and so forth. Clearly there is a large overlap with devices
which can be termed SoCs. The term SoC usually refers to a device with a higher degree of
integration, including many of the parts of the system in a single device, possibly including
analog, mixed-signal or radio frequency circuits.
The large semiconductor companies investing tens of millions of dollars to create these devices
will typically also make a large investment in software to run on their platform. It would be
uncommon to produce a complex system with a powerful processor without at least having
ported one or more operating systems to it and written device drivers for peripherals.
Of course, powerful operating systems like Linux require significant amounts of memory to run,
more than is usually possible on a single silicon device. The term System-on-Chip is therefore
not always named entirely accurately, as the device does not always contain the whole system.
Apart from the issue of silicon area, it is also often the case that many useful parts of a system
require specialist silicon manufacturing processes that preclude them from being placed on the
same die. An extension of the SoC that addresses this to some extent is the concept of
System-in-Package (SiP) that combines a number of individual chips within a single physical
package. Also widely seen is package-on-package stacking. The package used for the SoC chip
contains connections on both the bottom (for connection to a PCB) and top (for connection to a
separate package that might contain a flash memory or a large SDRAM device).
This book is not targeted at any particular SoC device and does not replace the documentation
for the individual product you intend to use for your application. It is important to be aware of,
and be able to distinguish between specifications of the processor and behavior (for example,
physical memory maps, peripherals and other features) that are specific to the device you are
using.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-4

Introduction

1.3

Embedded systems
An embedded system is conventionally defined as a piece of computer hardware running
software designed to perform a specific task. Examples of such systems might be TV set-top
boxes, smartcards, routers, disk drives, printers, automobile engine management systems, MP3
players or photocopiers. These contrast with what is generally considered as a computer system,
that is, one that runs a wide range of general purpose software and possesses input and output
devices such as a keyboard, and a graphical display of some kind.
This distinction is becoming increasingly blurred. Consider the cellular or mobile phone. A
basic model might just perform the task of making phone calls, but a smartphone can run a
complex operating system to which many thousands of applications are available for download.
Embedded systems can contain very simple 8-bit microprocessors, such as an Intel 8051 or PIC
micro-controllers, or some of the more complex 32- or 64-bit processors, such as the ARM
family that form the subject matter for this book. They need some Random Access Memory
(RAM) and some form of Read Only Memory (ROM) or other non-volatile storage to hold the
program(s) to be executed by the system. Systems will almost always have additional
peripherals, relating to the actual function of the device – typically including Universal
Asynchronous Receiver/Transmitters (UARTs), interrupt controllers, timers, General Purpose
I/O (GPIO) signals, but also potentially quite complex blocks such as Digital Signal Processing
(DSP) or Direct Memory Access (DMA) controllers.
Software running on such systems is typically grouped into two separate parts, the Operating
System (OS) and applications that run on top of the OS. A wide range of operating systems are
in use, ranging from simple kernels, to complex Real-Time Operating Systems (RTOS), to
full-featured complex operating systems, of the kind that might be found on a desktop computer.
Microsoft Windows or Linux are familiar examples of the latter. In this book, we will
concentrate mainly on examples from Linux. The source code for Linux is readily available for
inspection by the reader and is likely to be familiar to many programmers. Nevertheless, lessons
learned from Linux are equally applicable to other operating systems.
Applications running in an embedded system take advantage of the services that the OS
provides, but also need to be aware of low level details of the hardware implementation, or
worry about interactions with other applications that are running on the system at the same time.
There are many constraints on embedded systems that can make programming them rather more
difficult than writing an application for a general purpose processor.
Memory Footprint
In many systems, to minimize cost (and power), memory size can be limited. The
programmer could be forced to consider the size of the program and how to
reduce memory usage while it runs.
Real-time behavior
A feature of many systems is that there are deadlines to respond to external
events. This might be a “hard” requirement (a car braking system must respond
within a certain time) or “soft” requirement (audio processing must complete
within a certain time-frame to avoid a poor user experience – but failure to do so
under rare circumstances may not render the system worthless).

ARM DEN0013B
ID082411

Power

In many embedded systems the power source is a battery, and programmers and
hardware designers must take great care to minimize the total energy usage of the
system. This can be done, for example, by slowing the clock, reducing supply
voltage and/or switching off the processor when there is no work to be done.

Cost

Reducing the bill of materials can be a significant constraint on system design.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-5

Introduction

Time to market
In competitive markets, the time to develop a working product can significantly
impact the success of that product.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

1-6

Chapter 2
The ARM Architecture

As described in Chapter 1 of this book, ARM does not manufacture silicon devices. Instead, ARM
creates microprocessor designs, which are licensed to semiconductor companies and OEMs, who
integrate them into System-on-Chip devices.
To ensure compatibility between implementations, ARM defines architecture specifications which
define how compliant products must behave. Processors implementing the ARM architecture
conform to a particular version of the architecture. There might be multiple processors with
different internal implementations and micro-architectures, different cycle timings and clock
speeds which conform to the same version of the architecture.
The programmer must distinguish between behaviors which are specific to the following:
Architecture

This defines behavior common to a set, or family, of processor designs and is
defined in the appropriate ARM Architecture Reference Manual (ARM
ARM). It covers instruction sets, registers, exception handling and other
programmer’s model features. The architecture defines behavior that is
visible to the programmer, for example, which registers are available, and
what individual assembly language instructions actually do.

Micro-architecture This defines how the visible behavior specified by the architecture is
implemented. This could include the number of pipeline stages, for example.
It can still have some programmer visible effects, such as how long a
particular instruction takes to execute, or the number of stall cycles after
which the result is available.
Processor

ARM DEN0013B
ID082411

A processor is an individual implementation of a micro-architecture. In
theory, there could be multiple processors which implement the same
micro-architecture, but in practice, each processor has unique
micro-architectural characteristics. A processor might be licensed and
Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-1

The ARM Architecture

manufactured by many companies. It might therefore, have been
integrated into a wide range of different devices and systems, with a
correspondingly wide range of memory maps, peripherals, and other
implementation specific features Processors are documented in Technical
Reference Manuals, available on the ARM website.
Core

We use this term to refer to a separate logical execution unit inside a
multi-core processor.

Individual systems A System-on-Chip (SoC) contains one or more processors and typically
also memory and peripherals. The device could be part of a system which
contains one or more of additional processors, memory, and peripherals.
Documentation is available, not from ARM, but from the supplier of the
individual SoC or board.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-2

The ARM Architecture

2.1

Architecture versions
Periodically, new versions of the architecture are announced by ARM. These add new features
or make changes to existing behaviors. Such changes are typically backwards compatible,
meaning that user code which ran on older versions of the architecture will continue to run
correctly on new versions. Of course, code written to take advantage of new features will not
run on older processors that lack these features.
In all versions of the architecture, some system features and behaviors are left as
implementation-defined. For example, the architecture does not define cycle timings for
individual instructions or cache sizes. These are determined by the individual
micro-architecture.
Each architecture version might also define one or more optional extensions. These may or may
not be implemented in a particular implementation of a processor. For example, in the ARMv7
architecture, the Advanced SIMD technology is available as an optional extension, and we
describe this at length in Chapter 19 Introducing NEON.
The ARMv7 architecture also has the concept of “profiles”. These are variants of the
architecture describing processors targeting different markets and usages.
The profiles are as follows:
A

The Application profile defines an architecture aimed at high performance
processors, supporting a virtual memory system using a Memory Management
Unit (MMU) and therefore capable of running complex operating systems.
Support for the ARM and Thumb instruction sets is provided.

R

The Real-time profile defines an architecture aimed at systems that need
deterministic timing and low interrupt latency and which do not need support for
a virtual memory system and MMU, but instead use a simpler memory protection
unit (MPU).

M

The Microcontroller profile defines an architecture aimed at lower
cost/performance systems, where low-latency interrupt processing is vital. It uses
a different exception handling model to the other profiles and supports only a
variant of the Thumb instruction set.

Throughout this book, our focus will be on version 7 of the architecture (ARMv7), particularly
ARMv7-A, the Application profile. This is the newest version of the architecture at the time of
writing (2011). It is implemented by the latest high performance processors, such as the
Cortex-A5, Cortex-A8, Cortex-A9, and Cortex-A15 processors, and also by processors from
Marvell and Qualcomm, among others. We will, where appropriate, point out differences
between ARMv7 and older versions of the architecture.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-3

The ARM Architecture

2.2

Architecture history and extensions
In this section, we look briefly at the development of the architecture through previous versions.
Readers unfamiliar with the ARM architecture shouldn’t worry if parts of this description use
terms they don’t know, as we will describe all of these topics later in the text.
The ARM architecture changed relatively little between the first test silicon in the mid-1980s
through to the first ARM6 and ARM7 devices of the early 1990s. The first version of the
architecture was implemented only by the ARM1. Version 2 added multiply and
multiply-accumulate instructions and support for coprocessors, plus some further innovations.
These early processors only supported 26-bits of address space. Version 3 of the architecture
separated the program counter and program status registers and added several new modes,
enabling support for 32-bits of address space. Version 4 adds support for halfword load and store
operations and an additional kernel-level privilege mode.
The ARMv4T architecture, which introduced the Thumb (16-bit) instruction set, was
implemented by the ARM7TDMI® and ARM9TDMI® processors, products which have shipped
in their billions. The ARMv5TE architecture added improvements for DSP-type operations and
saturated arithmetic and to ARM/Thumb interworking. ARMv6 made a number of
enhancements, including support for unaligned memory access, significant changes to the
memory architecture and for multi-processor support, plus some support for SIMD operations
operating on bytes/halfwords within the 32-bit general purpose registers. It also provided a
number of optional extensions, notably Thumb-2 and Security Extensions (TrustZone).
Thumb-2 extends Thumb to be a variable length (16-bit and 32-bit) instruction set. The
ARMv7-A architecture makes the Thumb-2 extensions mandatory and adds the Advanced
SIMD extensions (NEON), described in Chapter 19 and Chapter 20.
A brief note on the naming of processors might be useful for readers. For a number of years,
ARM adopted a sequential numbering system for processors with ARM9 following ARM8,
which came after ARM7. Various numbers and letters were appended to the base family to
denote different variants. For example, the ARM7TDMI processor has T for Thumb, D for
Debug, M for a fast multiplier and I for EmbeddedICE. For the ARMv7 architecture, ARM
Limited adopted the brand name Cortex for many of its processors, with a supplementary letter
indicating which of the three profiles (A, R, or M) the processor supports. Figure 2-1 on
page 2-5 shows how different versions of the architecture correspond to different processor
implementations. The figure is not comprehensive and does not include all architecture versions
or processor implementations.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-4

The ARM Architecture

Architecture
v4 / v4T

Architecture
v5

ARM7TDMI
ARM920T
StrongARM

ARM926EJ-S
ARM946E-S
XScale

Architecture
v6

Architecture
v7

ARMv7-A
Cortex-A5
Cortex-A8
Cortex-A9

ARM1136J-S
ARM1176JZ-S
ARM1156T2-S

ARMv7-R
Cortex-R4

ARMv7-M
Cortex-M3
ARMv6-M
Cortex-M0

ARMv7E-M
Cortex-M4

Figure 2-1 Architecture and processors

In Figure 2-2, we show the development of the architecture over time, illustrating additions to
the architecture at each new version. Almost all architecture changes are backwards-compatible,
meaning unprivileged software written for the ARMv4T architecture can still be used on
ARMv7 processors.

4T

5

Halfword and signed
halfword/byte support

Improved ARM/Thumb
Interworking

System mode

CLZ

Thumb instruction
set

Saturated arithmetic
DSP multiply-accumulate
Instructions
Extensions:
Jazelle (v5TEJ)

6

7

SIMD instructions

Thumb technology

Multi-processing

NEON

v6 memory architecture

TrustZone

Unaligned data support

Profiles:
v7-A (Applications)
NEON

Extensions:
Thumb-2 (v6T2)
TrustZone (v6Z)
Multiprocessor (v6K)
Thumb only (v6-M)

v7-R (Real-time)
Hardware divide
NEON
v7-M (Microcontroller)
Hardware divide
Thumb only
Figure 2-2 Architecture history

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-5

The ARM Architecture

Individual chapters of this book will cover these architecture topics in greater detail, but here
we will briefly introduce a number of architecture elements.
2.2.1

DSP multiply-accumulate and saturated arithmetic instructions
These instructions, added in the ARMv5TE architecture, improve the capability for digital
signal processing and multimedia software and are denoted by the letter E. The new instructions
provide many variations of signed multiply-accumulate, saturated add and subtract, and count
leading zeros and are present in all later versions of the architecture. In many cases, this made
it possible to remove a simple separate DSP from the system.

2.2.2

Jazelle
Jazelle-DBX (Direct Bytecode eXecution) enables a subset of Java bytecodes to be executed
directly within hardware as a third execution state (and instruction set). Support for this is
denoted by the J in the ARMv5TEJ architecture. Support for this state is mandatory from
ARMv6, although a specific ARM processor can optionally implement actual Jazelle hardware
acceleration, or handle the bytecodes through software emulation. The Cortex-A5, Cortex-A9,
and Cortex-A15 processors offer configurable support for Jazelle.
Jazelle-DBX is best suited to providing high performance Java in very memory limited systems
(for example, feature phone or low-cost embedded use). In today’s systems, it is mainly used for
backwards compatibility.

2.2.3

Thumb Execution Environment (ThumbEE)
This is also described as Jazelle-RCT (Runtime Compilation Target). It involves small changes
to the Thumb instruction set that make it a better target for code generated at runtime in
controlled environments (for example, by managed languages like Java, Dalvik, C#, Python or
Perl). The feature set includes automatic null pointer checks on loads and stores and instructions
to check array bounds, plus special instructions to call a handler. These are small sections of
critical code, used to implement a specific feature of a high level language. These changes come
from re-purposing a handful of opcodes.
ThumbEE is designed to be used by high-performance just-in-time or ahead-of-time compilers,
where it can reduce the code size of recompiled code. Compilation of managed code is outside
the scope of this document.

2.2.4

Thumb-2
Thumb-2 technology was added in ARMv6T2. This technology extended the original 16-bit
Thumb instruction set to support 32-bit instructions. The combined 16-bit and 32-bit Thumb
instruction set achieves similar code density to the original Thumb instruction set, but with
performance similar to the 32-bit ARM instruction set. The resulting Thumb instruction set
provides virtually all the features of the ARM instruction set, plus some additional capabilities.

2.2.5

Security Extensions (TrustZone)
The TrustZone extensions were added in ARMv6Z and are present in the ARMv7-A profile
covered in this book. TrustZone provides two virtual processors with rigorously enforced
hardware access control between the two. This means that the processor provides two “worlds”,
Secure and Normal, with each world operating independently of the other in a way which
prevents information leakage from the secure world to the non-secure and which stops
non-trusted code running in the secure world. This is described in more detail, in Chapter 26
Security.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-6

The ARM Architecture

2.2.6

VFP
Before ARMv7, the VFP extension was called the Vector Floating-Point Architecture, and was
used for vector operations. VFP is an extension which implements single-precision and
optionally, double-precision floating-point arithmetic, compliant with the ANSI/IEEE Standard
for Floating-Point Arithmetic.

2.2.7

Advanced SIMD (NEON)
The ARM NEON technology provides an implementation of the Advanced SIMD instruction
set, with separate register files (shared with VFP). Some implementations have a separate
NEON pipeline back-end. It supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit)
floating-point data, which can be operated on as vectors in 64-bit and 128-bit registers.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-7

The ARM Architecture

2.3

Key points of the ARM Cortex-A series architecture
Here we summarize a number of key points common to the Cortex-A family of devices.

ARM DEN0013B
ID082411

•

32-bit RISC processor, with 16 × 32-bit visible registers with mode-based register
banking.

•

Modified Harvard Architecture (separate, concurrent access to instructions and data).

•

Load/Store Architecture.

•

Thumb-2 technology as standard.

•

VFP and NEON options which are expected to become standard in general purpose
applications processor space.

•

Backward compatibility with code from previous ARM processors.

•

Full 4GB virtual and physical address spaces, with no restrictions imposed by the
architecture.

•

Efficient hardware page table walking for virtual to physical address translation.

•

Virtual Memory for page sizes of 4KB, 64KB, 1MB and 16MB. Cacheability and access
permissions can be set on a per-page basis.

•

Big-endian and little-endian support.

•

Unaligned access support for load/store instructions with 8-,16- and 32-bit integer data
sizes.

•

SMP support on MPCore™ variants, with full data coherency from the L1 cache level.
Automatic cache and TLB maintenance propagation provides high efficiency SMP
operation.

•

Physically indexed, physically tagged (PIPT) data caches.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-8

The ARM Architecture

2.4

Processors and pipelines
In this section, we briefly look at some ARM processors and identify which processor
implements which architecture version. We then take a slightly more detailed look at some of
the individual processors which implement architecture version v7-A, which forms the main
focus of this book. Some terminology will be used in this chapter which may be unfamiliar to
the first-time user of ARM processors and which will not be explained until later in the book.
Table 2-1 indicates the architecture version implemented by a number of older ARM processors.
Table 2-1 Older ARM processors and architectures
Architecture version

Applications processor

Embedded processor

v4T

ARM720T™
ARM920T™
ARM922T™

ARM7TDMI

v5TE

ARM946E-S™
ARM966E-S™
ARM968E-S

v5TEJ

ARM926EJ-S™

v6K

ARM1136J(F)-S™
ARM11 MPCore

v6T2

ARM1156T2-S™

v6K + security extensions

ARM1176JZ(F)-S™

Table 2-2 shows the Cortex family of processors.
Table 2-2 Cortex processors and architecture versions
v7-A (Applications)

v7-R (Real Time)

v6-M/v7-M (Microcontroller)

Cortex-A5 (Single/MP)

Cortex-R4

Cortex-M0 (ARMv6-M)

Cortex-A8

Cortex-M1™ (ARMv6-M)

Cortex-A9 (Single/MP)

Cortex-M3™ (ARMv7-M)

Cortex-A15 (MP)

Cortex-M4(F) (ARMv7E-M)

In the next sections, we’ll take a closer look at each of the processors which implement the
ARMv7-A architecture.
2.4.1

The Cortex-A5 processor
The Cortex-A5 processor supports all ARMv7-A architectural features, including the TrustZone
Security Extensions and the NEON multimedia processing engine. It is extremely area and
power efficient, but has lower maximum performance than other Cortex-A series processors.
Both single and multi-core versions of the Cortex-A5 processor are available.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-9

The ARM Architecture

Embedded trace macrocell
(ETM) interface
Cortex-A5
processor

APB interface

Debug

Data processing unit (DPU)

Prefetch unit and branch predictor (PFU)

Data micro-TLB

Data store
buffer (STB)

Data cache
unit (DCU)

Instruction micro-TLB

Main translation
lookaside buffer (TLB)

Instruction cache
unit (ICU)

CP15

Bus interface unit (BIU)

AXI interface

Figure 2-3 The Cortex-A5 processor

The Cortex-A5 processor shown in Figure 2-3 has a single-issue, 8-stage pipeline. It can
dual-issue branches in some circumstances and contains sophisticated branch prediction logic
to reduce penalties associated with pipeline refills. Both NEON and floating-point hardware
support are optional. The Cortex-A5 processor VFP implements VFPv4, which adds both the
half-precision extensions and the Fused Multiply Add instructions to the features of VFPv3.
Support for half-precision was optional in VFPv3. It supports the ARM and Thumb instruction
sets plus the Jazelle-DBX and Jazelle-RCT technology. The size of the level 1 instruction and
data caches is configurable (by the hardware implementer) from 4KB to 64KB.
2.4.2

The Cortex-A8 processor
The Cortex-A8 processor was the first to implement the ARMv7-A architecture. It is available
in a number of difference devices, including the S5PC100 from Samsung, the OMAP3530 from
Texas Instruments and the i.MX515 from Freescale. A wide range of device performances are
available, with some giving clock speeds of more than 1GHz.
The Cortex-A8 processor has a considerably more complex micro-architecture compared with
previous ARM processors. Its integer processor has dual symmetric, 13 stage instruction
pipelines, with in-order issue of instructions. The NEON pipeline has an additional 10 pipeline
stages, supporting both integer and floating-point 64/128-bit SIMD. VFPv3 floating-point is
supported, as is Jazelle-RCT.
Figure 2-4 on page 2-11 is a block diagram showing the internal structure of the Cortex-A8
processor, including the pipelines.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-10

The ARM Architecture

Integer ALU

Instruction
Decode

L1
instruction
cache miss

ALU pipe 0
MUL pipe 0

ALU pipe 1
LS pipe 0/1

Integer MUL
Integer shift
FP ADD
FP MUL
IEEE FP

L2 data

L1 data

L1 data
cache miss

Load/Store
data queue

LS permute

NEON store data

BIU pipeline
L2 tag

NEON Register File

Instruction
Fetch

NEON

NEON
Instruction Decode

Replay
penalty

ALU and Load/Store Execute
Integer register writeback

Architectural Register File

Branch mispredict
penalty

Embedded Trace Macrocell

L2 data

L3 memory system

External
trace port

Figure 2-4 The Cortex-A8 processor integer and NEON pipelines

The separate instruction and data level 1 caches are 16KB or 32KB in size. They are
supplemented by an integrated, unified level 2 cache, which can be up to 1MB in size, with a
16-word line length. The level 1 data cache and level 2 cache both have a 128-bit wide data
interface to the processor. The level 1 data cache is virtually indexed, but physically tagged,
while level 2 uses physical addresses for both index and tags. Data used by NEON is, by default,
not allocated to L1 (although NEON can read and write data that is already in the L1 data cache).
2.4.3

The Cortex-A9 processor
The Cortex-A9MPCore processor and the Cortex-A9 uniprocessor provide higher performance
than the Cortex-A5 or Cortex-A8 processors, with clock speeds in excess of 1GHz and
performance of 2.5DMIPS/MHz. The ARM, Thumb, Thumb-2, TrustZone, Jazelle-RCT and
DBX technologies are all supported.
The level 1 cache system provides hardware support for cache coherency for between one and
four processors for multi-core software. A level 2 cache is optionally connected outside of the
processor. ARM supplies a level 2 cache controller (PL310/L2C-310) which supports caches of
up to 8MB in size. The processor also contains an integrated interrupt controller, an
implementation of the ARM Generic Interrupt Controller (GIC) architecture specification. This
can be configured to provide support for up to 224 interrupt sources.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-11

The ARM Architecture

Profiling
Monitor Block

Virtual to
Physical
Register
pool
Register
Rename stage

ALU/MUL

ALU
Instruction queue
and
Dispatch
FPU/NEON

Branches

Prediction
queue

Instruction
queue

Dual-instruction
Decode Stage

Out-of-Order
multi-issue with
speculation

Out-of-order Write-back Stage

CoreSight
Debug Access Port

3+1 Dispatch Stage
Address

Memory System
Auto-prefetcher

Fast Loop Mode
Branch
Prediction
Instruction
Cache

Load-Store Unit

uTLB

MMU

Program
Trace
Unit

Data Cache

Figure 2-5 Block diagram of Cortex-A9 single core

Devices containing the Cortex-A9 processor include nVidia’s dual-core Tegra-2, the
SPEAr1300 from ST and TI’s OMAP4 platform.
2.4.4

The Cortex-A15 processor
The Cortex-A15 MPCore processor is currently the highest performance available ARM
processor. It is application compatible with the other processors described in this book. The
Cortex-A15 MPCore processor introduces some new capabilities, including support for full
hardware virtualization and Large Physical Address Extensions (LPAE), which enables
addressing of up to 1TB of memory. In this book, we will describe the LPAE extension and
provide an introduction to virtualization, but as the Cortex-A15 MPCore processor will not be
encountered by most readers for some time, we do not provide detailed coverage throughout the
text.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-12

The ARM Architecture

CoreSight MultiCore Debug and Trace Architecture

Generic Interrupt Control and Distribution

FPU/NEON
Data Engine

FPU/NEON
Data Engine

FPU/NEON
Data Engine

FPU/NEON
Data Engine

Integer CPU
Virtualization
40-bit PA

Integer CPU
Virtualization
40-bit PA

Integer CPU
Virtualization
40-bit PA

Integer CPU
Virtualization
40-bit PA

L1 Caches
with ECC

L1 Caches
with ECC

L1 Caches
with ECC

L1 Caches
with ECC

Snoop Control Unit (SCU) and L2 Cache
Direct
Cache
Transfers

Snoop
Filtering

Private
peripherals

Accelerator
Coherence

Error
Correction

128-bit AMBA4 - Advanced Coherent Bus Interface

Figure 2-6 Cortex-A15 MPCore block diagram

Snoop Control Unit (SCU)
The SCU is responsible for managing the interconnect, arbitration,
communication, cache-to-cache and system memory transfers, cache coherence
and other capabilities for the processor.
Accelerator Coherence Port
This AMBA 4 AXI compatible slave interface on the SCU provides an
interconnect point for masters which need to be interfaced directly with the
Cortex-A15 processor.
Generic Interrupt Controller
This handles inter-processor communication and the routing and prioritization of
system interrupts. Supporting up to 224 independent interrupts, under software
control, each interrupt can be distributed across the processors, hardware
prioritized, and routed between the operating system and TrustZone software
management layer.
The Cortex-A15 MPCore processor has the following features:

ARM DEN0013B
ID082411

•

an out-of-order superscalar pipeline

•

32kB L1 Instruction and 32kB L1 Data caches

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-13

The ARM Architecture

2.4.5

•

tightly-coupled low-latency level-2 cache (up to 4MB in size)

•

improved floating-point and NEON media performance

•

full hardware virtualization

•

Large Physical Address Extension (LPAE) addressing up to 1TB of memory

•

error correction capability for fault-tolerance and soft-fault recovery

•

Multicore 1-4X SMP within a single processor cluster

•

multiple coherent multi-core processor clusters through AMBA4 technology

•

AMBA4 Cache Coherent Interconnect (CCI) allowing full cache coherency between
multiple Cortex-A15 MPCore processors.

Qualcomm Scorpion
ARM is not the only company which designs processors compliant with the ARMv7-A
instruction set architecture. In 2005, Qualcomm Inc. announced that it was creating its own
implementation under license from ARM, with the name Scorpion. The Scorpion processor is
available as part of Qualcomm’s Snapdragon platform, which contains the features necessary to
implement netbooks, smartphones or other mobile internet devices.
Relatively little information has been made publicly available by Qualcomm, although it has
been mentioned that Scorpion has a number of similarities with the Cortex-A8 processor. It is
an implementation of ARMv7-A, is superscalar and dual issue and has support for both VFP
and NEON (called the VeNum media processing engine in Qualcomm press releases). There are
a number of differences, however. Scorpion can process 128 bits of data in parallel in its NEON
implementation. Scorpion has a 13-stage load/store pipeline and two integer pipelines. One of
these is 10 stages long and can execute only simple arithmetic instructions (for example adds or
subtracts), while the other is 12 stages and can execute all data processing operations, including
multiplies. Scorpion also has a 23-stage floating-point/SIMD pipeline, and VFPv3 operations
are pipelined.
We will not specifically mention Scorpion again in this text. However, as the processor
conforms to the ARMv7-A architecture specification, most of the information presented here
will apply also to Scorpion.

2.4.6

Marvell Sheeva
Marvell is another company which designs and sells processors based on the ARM Architecture.
At the time of writing, Marvell has four families of ARM processors, the Armada 100, Armada
500, Armada 600, and Armada 1000. Marvell has designed a number of ARM processor
implementations, ranging from the Sheeva PJ1 (ARMv5 compatible) to Sheeva PJ4 (ARMv7
compatible). The latter is used in the Armada 500 and Armada 600 family devices.
The Marvell devices do not support the NEON SIMD instruction set, but instead use the
Wireless MMX2 technology, acquired from Intel. The Armada 510 contains 32KB I and D
caches plus an integrated 512KB level 2 cache and support for VFPv3. The Armada 610 is built
on a “low power” silicon process and has a smaller (256KB) level 2 cache and can be clocked
at the slightly slower rate than Armada 510. We will not specifically mention these processors
again in this book.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

2-14

Chapter 3
Tools, Operating Systems and Boards

ARM processors can be found in a very wide range of devices, running a correspondingly wide
range of software. Many readers will have ready access to appropriate hardware, tools and
operating systems, but before we proceed to look at the underlying architecture, it might be useful
to some readers to present an overview of some of these readily available compilation tools,
ARM-based hardware and Linux operating system distributions.
In this chapter, we will provide a brief mention of a number of interesting commercially available
development boards. We will provide some information about the Linux Operating System and
some useful associated tools. However, information about open source software and off-the-shelf
boards is likely to change rapidly.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-1

Tools, Operating Systems and Boards

3.1

Linux distributions
Linux is a Unix-like operating system kernel, originally developed by Linus Torvalds, who
continues to maintain the official kernel. It is open source, distributed under the GNU Public
License, widely-used and available on a large number of different processor architectures.
A number of free Linux distributions exist for ARM processors, including Debian, Ubuntu,
Fedora and Gentoo.
You can obtain pre-built Linux images, or read the Linux on ARM Wiki at
http://www.linux-arm.org/.
In Appendix C, we will look at how to build Linux for your ARM device. Before doing that, we
will briefly look at the basics of Linux for ARM systems.

3.1.1

Linux for ARM systems
Support for the ARM architecture has been included in the standard Linux kernel for many
years. Development of this port is ongoing, with significant input from ARM to provide kernel
support for new processors and architecture versions. The ARM Embedded Linux distribution
only includes the kernel. The filesystem and U-Boot bootloader are available through Linaro.
It might seem strange to some readers that a book about the Cortex-A series of processors
contains information about Linux. There are several reasons for this. Linux source code is
available to all readers and represents a huge learning resource. In addition, it is easy to use to
program, and there are many useful resources with existing code and explanations. Many
readers will be familiar with Linux, as it can be run on most processor architectures. By
explaining how Linux features like virtual memory, multi-tasking, shared libraries and so forth
are implemented in Linux for ARM systems, readers will be able to apply their understanding
to other operating systems commonly used on ARM processors. The scalability of Linux is
another factor – it can run on the most powerful ARM processors, and its derivative uClinux is
also commonly used on much smaller processors, including the Cortex-M3 or ARM7TDMI
processors. It can run on both the ARM and Thumb ISAs, in little- or big-endian and with or
without a memory management unit.
Linux makes large amounts of system and kernel information available to user applications by
using virtual filesystems. These virtual files mean that we don’t have to know how to program
the kernel to access many hardware features. An example is /proc/cpuinfo. Reading this file on
a Cortex-A8 processor might give an output like that in Example 3-1. This lets code determine
useful information about the system it is running on, without having to directly interact with the
hardware.
Example 3-1 Output of /proc/cpuinfo on the Cortex-A8 processor

Processor
:
BogoMIPS
:
Features
:
CPU implementer :
CPU architecture:
CPU variant
:
CPU part
:
CPU revision
:

ARM DEN0013B
ID082411

ARMv7 Processor rev 7 (v7l)
499.92
swp half thumb fastmult vfp edsp neon vfpv3
0x41
7
0x1
0xc08
7

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-2

Tools, Operating Systems and Boards

In this book, we can merely scratch the surface of what there is to be said about Linux
development. What we hope to do here is to show some ways in which programming for an
embedded ARM architecture based system differs from a desktop x86 environment and to give
some pointers to useful tools, which the reader might care to investigate further.
3.1.2

Linaro
Linaro is a non-profit organization which works on a range of open source software running on
ARM processors, including kernel related tools and software and middleware. It is a
collaborative effort between a number of technology companies to provide engineering help and
resources to the open source community. Linaro does not produce a Linux distribution, nor is it
tied to any particular distribution or board. Instead, Linaro works to produce software and tools
which interact directly with the ARM processor, to provide a common software platform for use
by board support package developers. Its focus is on tools to help you write and debug code, on
low-level software which interacts with the underlying hardware and on key pieces of
middleware. Linaro engineers work on the kernel and tools, graphics and multimedia and power
management. Linaro provides patches to upstream projects and makes monthly source tree
tarballs available, with an integrated build every six months to consolidate the work.
See http://www.linaro.org/ for more information about Linaro.

3.1.3

Linux terminology
Here, we define some terms which we will use when describing how the Linux kernel interacts
with the underlying ARM architecture:
Process

A process is the kernel's view of an executing unprivileged application. The same
application (for example, bin/bash) can be running in several simultaneous
instances in the system – and each of these instances will be a separate process.
The process has resources associated with it, such as a memory map and file
descriptors. A process can consist of one or more threads.

Thread

A thread is a context of software execution within a process. It is the entity which
is scheduled by the kernel, and actually executes the instructions that make up the
application. A process can consist of multiple threads, each executing with their
own program counter, stack pointer and register set – all existing within the
memory map and operating on the file descriptors held by the process as a whole.
In a multi-processor system, threads inside the same process can execute
concurrently on separate processors. Different threads within the same process
can be configured to have different scheduling priorities.
There are also threads executing inside the kernel, to manage various tasks
asynchronously, such as file cache management, or watchdog tickling (which is
not as exciting as it sounds).

Scheduler

ARM DEN0013B
ID082411

This is a vital part of the kernel which has a list of all the current threads. It knows
which threads are ready to be run and which are currently not able to run. It
dynamically calculates priority levels for each thread and schedules the highest
priority thread to be run next. It is called after an interrupt has been handled. The
scheduler is also explicitly called by the kernel via the schedule() function, for
example, when an application executing a system call needs to sleep. The system
will have a timer based interrupt which results in the scheduler being called at
regular intervals. This enables the OS to implement time-division multiplexing,
where many threads share the processor, each running for a certain amount of
time, giving the user the illusion that many applications are running
simultaneously.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-3

Tools, Operating Systems and Boards

System calls Linux applications run in user (unprivileged) mode. Many parts of the system are
not directly accessible in User mode. For example, the kernel might prevent User
mode programs from accessing peripherals, kernel memory space and the
memory space of other User mode programs. Access to some features of the
system control coprocessor (CP15) is not permitted in User mode. The kernel
provides an interface (via the SVC instruction) which permits an application to call
kernel services. Execution is transferred to the kernel through the SVC exception
handler, which returns to the user application when the system call is complete.

3.1.4

Libraries

Linux applications are, with very few exceptions, not loaded as complete
pre-built binaries. Instead, the application relies on external support code linked
from files called shared libraries. This has the advantage of saving memory space,
in that the library only needs to be loaded into RAM once and is more likely to be
in the cache as it can be used by other applications. Also, updates to the library
do not require every application to be rebuilt. However, this dynamic loading
means that the library code must not rely on being in a particular location in
memory.

Files

These are essentially blocks of data which are referred to using a pathname
attached to them. Device nodes have pathnames like files, but instead of being
linked to blocks of data, they are linked to device drivers which handle real I/O
devices like an LCD display, disk drive or mouse. When an application opens,
reads from or writes to a device, control is passed to specific routines in the kernel
that handle that device.

Embedded Linux
Linux-based systems are used all the way from servers via the desktop, through mobile devices
down to high-performance micro-controllers in the form of uClinux for processors lacking an
MMU. However, while the kernel source code base is the same, different priorities and
constraints mean that there can be some fundamental differences between the Linux running on
your desktop and the one running in your set-top box, as well as between the development
methodologies used.
In a desktop system, a form of bootloader executes from ROM – be it BIOS or UEFI. This has
support for mass-storage devices and can then load a second-stage loader (for example GRUB)
from a CD, a hard drive or even a USB memory stick. From this point on, everything is loaded
from a general-purpose mass storage device.
In an embedded device, the initial bootloader is likely to load a kernel directly from on-board
flash into RAM and execute it. In severely memory constrained systems, it might have a kernel
built to “execute in place” (XiP), where all of the read-only portions of the kernel remain in
ROM, and only the writable portions use RAM. Unless the system has a hard drive (or for fault
tolerance reasons), the root filesystem on the device is likely to be located in flash. This can be
a read-only filesystem, with portions that need to be writable overlaid by tmpfs mounts, or it can
be a read-write filesystem. In both cases, the storage space available is likely to be significantly
less than in a typical desktop computer. For this reason, they might use software components
such as uClibc and BusyBox to reduce the overall storage space required for the base system. A
general desktop Linux distribution is usually supplied preinstalled with a lot of software that you
might find useful at some point. In a system with limited storage space, this is not really optimal.
Instead, you want to be able to select exactly the components you need to achieve what you want
with your system. Various specific embedded Linux distributions exist to make this easier.
In addition, embedded systems often have lower performance than general purpose computers.
In this situation, development can be significantly speeded up by compiling software for the
target device on a faster desktop computer and then moving it across in so-called
cross-compiling.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-4

Tools, Operating Systems and Boards

3.1.5

Board Support Package
Getting Linux to run on a particular platform requires a Board Support Package (BSP). We can
divide the platform-specific code into a number of areas:

ARM DEN0013B
ID082411

•

Architecture-specific code. This is found in the arch/arm/ directory and forms part of the
kernel porting effort carried out by the ARM Linux maintainers.

•

Processor-specific code. This is found in arch/arm/mm/ and arch/arm/include/asm/. This
takes care of MMU and cache functions (page table setup, TLB and cache invalidation,
memory barriers etc.). On SMP processors, spinlock code will be enabled.

•

Generic device drivers are found under drivers/.

•

Platform-specific code will be placed in arch/arm/mach-*/. This is code which is most
likely to be altered by people porting to a new board containing a processor with existing
Linux support. The code will define the physical memory map, interrupt numbers,
location of devices and any initialization code specific to that board.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-5

Tools, Operating Systems and Boards

3.2

Useful tools
Let’s take a brief look at some available tools which can be useful to developers of ARM
architecture based Linux systems. These are all extensively documented elsewhere. In this
section, we merely point out that these tools can be useful, and provides short descriptions of
their purpose and function.

3.2.1

QEMU
QEMU is a fast, open source machine emulator. It was originally developed by Fabrice Bellard
and is available for a number of architectures, including ARM. It can run operating systems and
applications made for one machine (for example, an ARM processor) on a different machine,
such as a PC or Mac. It uses dynamic translation of instructions and can achieve useful levels
of performance, enabling it to boot complex operating systems like Linux, without the need for
any target hardware.

3.2.2

BusyBox
BusyBox is a piece of software which provides many standard Unix tools, in a very small
executable, which is ideal for many embedded systems and could be considered to be a de facto
standard. It includes most of the Unix tools which can be found in the GNU Core Utilities, with
less commonly used command switches removed, and many other useful tools including init,
dhclient, wget and tftp.
BusyBox calls itself the “Swiss Army Knife of Embedded Linux” – a reference to the large
number of tools packed into a small package. BusyBox is a single binary executable which
combines many applications. This reduces the overheads introduced by the executable file
format and enables code to be shared between multiple applications without needing to be part
of a library.

3.2.3

Scratchbox
If your development experience has been limited to writing code for personal computers, you
may not be familiar with cross-compiling. The general principle is to use one system (the host)
to compile software which runs on some other system (the target).
The target is a different architecture to the host and so the host cannot natively run the resulting
image. For example, you might have a powerful desktop x86 machine and want to develop code
for a small battery-powered ARM based device which has no keyboard. Using the desktop
machine will make code development simpler and compilation faster. There are some
difficulties with this process. Some build environments will try to run programs during
compilation and of course this is not possible. In addition, tools which during the build process
try to discover information about the machine (for software portability reasons), do not work
correctly when cross-compiling.
Scratchbox is a cross-compilation toolkit which solves these problems and gives the necessary
tools to cross-compile a complete Linux distribution. It can use either QEMU or a target board
to execute the cross-compiled binaries it produces.

3.2.4

U-Boot
“Das U-Boot” (Universal Bootloader) is a universal bootloader that can easily be ported to new
hardware processors or boards. It provides serial console output which makes it easy to debug
and is designed to be small and reliable. In an x86 system, we have BIOS code which initializes
the processor and system and then loads an intermediate loader such as GRUB or syslinux,
which then in turn loads and starts the kernel. U-Boot essentially covers both functions.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-6

Tools, Operating Systems and Boards

3.2.5

UEFI and Tianocore
The Unified Extensible Firmware Interface (UEFI) is the specification of an interface to
hand-off control of a system from the pre-boot environment to an operating system, such as
Windows or Linux. Its modular design permits flexibility in the functionality provided in the
pre-boot environment and eases porting to new hardware. The UEFI forum is a non-profit
collaborative trade organization formed to promote and manage the UEFI standard.
UEFI is processor architecture independent and the Tianocore EFI Development Kit 2 (EDK2)
is available under a BSD license. It contains UEFI support for ARM platforms, including ARM
Versatile Express boards and the BeagleBoard (see BeagleBoard on page 3-13).
See http://www.uefi.org and http://sourceforge.net/apps/mediawiki/tianocore for more
information.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-7

Tools, Operating Systems and Boards

3.3

Software toolchains for ARM processors
There are a wide variety of compilation and debug tools available for ARM processors. In this
section, we will focus on two toolchains, the GNU toolchain which includes the GNU Compiler
(gcc), and the ARM Compiler toolchain which includes the armcc compiler tool.
Figure 3-1 shows how the various components of a software toolchain interact to produce an
executable image.

Libraries

C files(.c)

C compiler
(gcc or armcc)

Object
files(.o)

Assembly
files(.s)
Assembler
(gas or armasm)

Linker

Executable
image

Linkerscript
or scatter
file

Figure 3-1 Using a software toolchain to produce an image

3.3.1

GNU toolchain
The GNU toolchain is a collection of programming tools from the GNU project used both to
develop the Linux kernel and to develop applications (and indeed other operating systems). Like
Linux, the GNU tools are available on a large number of processor architectures and are actively
developed to make use of the latest features incorporated in ARM processors.
The toolchain includes the following components:

ARM DEN0013B
ID082411

•

GNU make

•

GNU Compiler Collection (GCC)

•

GNU binutils (linker, assembler (gas) etc.)

•

GNU Debugger (GDB)

•

GNU build system (autotools)

•

GNU C library (glibc or eglibc).

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-8

Tools, Operating Systems and Boards

Although glibc is available on all GNU/Linux host systems and provides portability; wide
compliance with standards, and is performance optimized, it is quite large for some embedded
systems (approaching 2MB in size). Other libraries may be preferred in smaller systems. For
example, uClibc provides most features and is around 400KB in size, and produces significantly
smaller application binaries.
Prebuilt versions of GNU toolchains
If you are using a complete Linux distribution on your target platform, and you are not
cross-compiling, you can install the toolchain packages using the standard package manager.
For example, on a Debian-based distribution such as Ubuntu you can use the command:
sudo apt-get install gcc g++

Additional required packages such as binutils will also be pulled in by this command, or you
can add them explicitly on the command line. In fact, if g++ is specified this way, gcc is
automatically pulled in. This toolchain will then be accessible in the way you would expect in
a normal Linux system, by just calling gcc, g++, as, or similar.
If you are cross-compiling, you will need to install a suitable cross-compilation toolchain. The
cross compilation toolchain consists of the GNU Compiler Collection (GCC) but also the GNU
C library (glibc) which is necessary for building applications (but not the kernel).
Ubuntu distributions from Maverick (10.10) onwards include specific packages for this. These
can be run using the command:
sudo apt-get install g++-arm-linux-gnueabi

The resulting toolchain will be able to build Linux kernels, applications and libraries for the
same Ubuntu version that is used on the target platform. It will however, have a prefix added to
all of the individual tool commands in order to avoid problems distinguishing it from the native
tools for the workstation. For example, the cross-compiling gcc will be accessible as
arm-linux-gnueabi-gcc.
If your workstation uses an older Ubuntu distribution, an alternative Linux distribution or even
Windows, another toolchain must be used. CodeSourcery provide pre-built toolchains for both
Linux and Windows from http://www.codesourcery.com. The GNU/Linux version of this
toolchain can be used to build the Linux kernel. It can also build applications and libraries,
providing that the basic C library used on the target is compatible with the one used by the
toolchain. Like for the Ubuntu toolchain, a prefix is added to the tool commands. For the
CodeSourcery GNU/Linux toolchain, the prefix is arm-none-linux-gnueabi – so the C compiler
is called arm-none-linux-gnueabi-gcc.
Source code distributions of cross-compilation toolchains can also be downloaded from
http://www.linaro.org.
3.3.2

ARM Compiler toolchain
The ARM Compiler toolchain can be used to build programs from C, C++, or ARM assembly
language source. It generates optimized code for the 32-bit ARM and variable length (16-bit and
32-bit) Thumb instruction sets, and supports full ISO standard C and C++. It also supports the
NEON SIMD instruction set with the vectorizing NEON compiler.
The ARM Compiler toolchain comprises the following components:
armcc

The ARM and Thumb compiler. This compiles your C and C++ code. It supports
inline and embedded assemblers, and also includes the NEON vectorizing
compiler, invoked using the command:
armcc --vectorize

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-9

Tools, Operating Systems and Boards

armasm

The ARM and Thumb assembler. This assembles ARM and Thumb assembly
language sources.

armlink

The linker. This combines the contents of one or more object files with selected
parts of one or more object libraries to produce an executable program.

armar

The librarian. This enables sets of ELF format object files to be collected together
and maintained in libraries. You can pass such a library to the linker in place of
several ELF files. You can also use the library for distribution to a third party for
further application development.

fromelf

The image conversion utility. This can also generate textual information about the
input image, such as disassembly and its code and data size.

C libraries

The ARM C libraries provide:
•

an implementation of the library features as defined in the C and C++
standards

•

extensions specific to the ARM Compiler, such as _fisatty(),
__heapstats(), and __heapvalid()

•

GNU extensions

•

common nonstandard extensions to many C libraries

•

POSIX extended functionality

•

functions standardized by POSIX.

C++ libraries
The ARM C++ libraries provide:
•

helper functions when compiling C++

•

additional C++ functions not supported by the Rogue Wave library.

Rogue Wave C++ libraries
The Rogue Wave library provides an implementation of the standard C++ library.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-10

Tools, Operating Systems and Boards

3.4

ARM DS-5
ARM DS-5 is a professional software development solution for Linux, Android and bare-metal
embedded systems based on ARM-based hardware platforms. DS-5 covers all the stages in
development, from boot code and kernel porting to application debug. See
http://www.arm.com/products/tools/software-tools/ds-5/index.php.
ARM DS-5 features an application and kernel space graphical debugger with trace, system-wide
performance analyzer, real-time system simulator, and compiler. These features are included in
an Eclipse-based IDE.

Figure 3-2 DS-5 Debugger

A full list of the hardware platforms that are supported by DS-5 is available from
http://www.arm.com/products/tools/software-tools/ds-5/supported-platforms.php.

ARM DS-5 includes the following components:

ARM DEN0013B
ID082411

•

Eclipse-based IDE combines software development with the compilation technology of
the DS-5 tools. Tools include a powerful C/C++ editor, project manager and integrated
productivity utilities such as the Remote System Explorer (RSE), SSH and Telnet
terminals.

•

DS-5 Compilation Tools. Both GCC and the ARM Compiler are provided. See ARM
Compiler toolchain on page 3-9 for more information about the ARM Compiler.

•

Real-time simulation model of a complete ARM Cortex-A8 processor-based device and
several Linux-based example projects that can run on this model. Typical simulation
speeds are above 250 MHz.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-11

Tools, Operating Systems and Boards

•

The DS-5 Debugger, shown in Figure 3-2 on page 3-11, together with a supported debug
target, enables debugging of kernel space and application programs and complete control
over the flow of program execution to quickly isolate and correct errors. It provides
comprehensive and intuitive views, including synchronized source and disassembly, call
stack, memory, registers, expressions, variables, threads, breakpoints, and trace.

•

DS-5 Streamline, a system-wide software profiling and performance analysis tool for
ARM-based Linux and Android platforms. DS-5 Streamline supports SMP
configurations, native Android applications and libraries.
Streamline only requires a standard TCP/IP network connection to the target in order to
acquire and analyze system-wide performance data from Linux and Android systems,
therefore making it an affordable solution to make software optimization possible from
the early stages of the development cycle.
See DS-5 Streamline on page 16-4 for more information.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-12

Tools, Operating Systems and Boards

3.5

Example platforms
In this section we’ll mention a few widely available, off-the-shelf ARM-based platforms which
are suitable for use by students or hobbyists for Linux development. This list is likely to become
outdated quickly, as newer and better boards are frequently announced.

3.5.1

BeagleBoard
The BeagleBoard is a readily available, inexpensive board which provides performance levels
similar to that of a laptop from a single fan-less board, powered through a USB connection. It
contains the OMAP 3530 device from Texas Instruments, which includes a Cortex-A8
processor with a 256KB level 2 cache, clocked at 720MHz. The board provides a wide range of
connection options, including DVI-D for monitors, S-Video for televisions, stereo audio and
compatibility with a wide range of USB devices, while code and data can be provided through
an MMC+/SD interface. It is highly extensible and the design information is freely available. It
is intended for use by the Open Source community and not to form a part of any commercial
product.

3.5.2

Pandora
The Pandora device also uses OMAP3530 (a Cortex-A8 processor clocked at 600MHz). It has
controls typically found on a gaming console and in fact, looks like a typical handheld gaming
device, with an 800x480 LCD.

3.5.3

nVidia Tegra 200 series developer board
This board is intended for smartbook or netbook development and contains nVidia’s Tegra2
high-performance dual-core implementation of a Cortex-A9 processor running at 1GHz, along
with 1GB DDR2 and a wide range of standard laptop peripherals. It is a small 10cm square
board that includes 2x mini-PCI-E slots, onboard Ethernet, 3xUSB, SDcard, HDMI and analog
VGA. nVidia provides BSP support for WindowsCE, Android and Linux. The performance
exceeds many low-cost x86 platforms, at much lower power.

3.5.4

ST Ericsson STE MOP500
This has a dual-core ARM Cortex-A9 processor, based on the U8500 chip design with 256MB
of memory and the Mali-400 GPU.

3.5.5

Gumstix
This derives its name from the fact that the board is the same size as a stick of chewing gum.
The Gumstix Overo uses the OMAP3503 device from TI, containing a Cortex-A8 processor
clocked at 600MHz and runs Linux 2.6 with the BusyBox utilities and OpenEmbedded build
environment.

3.5.6

PandaBoard
PandaBoard is a single-board computer based on the Texas Instruments OMAP4430 device,
including a dual-core 1GHz ARM Cortex-A9 processor, a 3D Accelerator video processor and
1GB of DDR2 RAM. Its features include ethernet, Bluetooth plus DVI and HDMI interfaces.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

3-13

Chapter 4
ARM Registers, Modes and Instruction Sets

In this chapter, we will introduce the fundamental features of ARM processors, including details of
registers, modes and instruction sets. We will also touch on some details of processor
implementation features including instruction pipelines and branch prediction.
ARM is a 32-bit processor architecture. It is a load/store architecture, meaning that data-processing
instructions operate on values in registers rather than external memory. Only load and store
instructions access memory. Internal registers are also 32 bits. Throughout the book, when we refer
to a word, we mean 32 bits. A doubleword is therefore 64 bits and a halfword is 16 bits wide.
Individual processor implementations do not necessarily have 32-bit width for all blocks and
interconnections. For example, we might have 64-bit wide paths for instruction fetches or for data
load and store operations.
Processors which implement the ARMv7-A architecture do not have a memory map which is fixed
by the architecture. The processor has access to a 4GB address space addressed as bytes and
memory and peripherals can be mapped freely within that space. We will describe memory further,
in Chapter 7 and Chapter 8, when we look at the caches and Memory Management Unit (MMU).

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-1

ARM Registers, Modes and Instruction Sets

4.1

Instruction sets
Historically, most ARM processors support more than one instruction set:
•

ARM – a full 32-bit instruction set

•

Thumb – a 16-bit compressed subset of the full ARM instruction set, with better code
density (but reduced performance compared with ARM code).

The processor can switch back and forth between these two instruction sets, under program
control.
Newer ARM processors, such as the Cortex-A series covered in this book, implement Thumb-2
technology, which extends the Thumb instruction set. This gives a mixture of 32-bit and 16-bit
instructions which gives approximately the code density of the original Thumb instruction set
with the performance of the original ARM instruction set. For this reason, most code developed
for Cortex-A series processors will use the Thumb instruction set.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-2

ARM Registers, Modes and Instruction Sets

4.2

Modes
The ARM architecture has seven processor modes. There are six privileged modes and a
non-privileged User mode. In this latter mode, there are limitations on certain operations, such
as MMU access. Table 4-1 summarizes the available modes. Note that modes are associated
with exception events, which are described further in Chapter 10 Exception Handling.
Table 4-1 ARM processor modes
Mode encoding
in the PSRs

Function

Supervisor
(SVC)

10011

Entered on reset or when a Supervisor Call instruction (SVC) is executed

FIQ

10001

Entered on a fast interrupt exception

IRQ

10010

Entered on a normal interrupt exception

Abort (ABT)

10111

Entered on a memory access violation

Undef (UND)

11011

Entered when an undefined instruction executed

System (SYS)

11111

Privileged mode, which uses the same registers as User mode

User (USR)

10000

Unprivileged mode in which most applications run

Mode

There is an extra mode (Secure Monitor), which we will describe when we look at the ARM
Security Extensions, in Chapter 26.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-3

ARM Registers, Modes and Instruction Sets

4.3

Registers
The ARM architecture has a number of registers, as shown in Figure 4-1.

R0
R1
R2
R3
R4
R5

User
mode
R0-R7,
R15
and
CPSR

User
mode
R0-R12,
R15
and
CPSR

User
mode
R0-R12,
R15
and
CPSR

User
mode
R0-R12,
R15
and
CPSR

User
mode
R0-R12,
R15
and
CPSR

User
mode
R0-R12,
R15
and
CPSR

R6
R7
R8

R8

R9

R9

R10

R10

R11

R11

R12

R12

R13 (SP)

R13 (SP)

R13 (SP)

R13 (SP)

R13 (SP)

R13 (SP)

R13 (SP)

R14 (LR)

R14 (LR)

R14 (LR)

R14 (LR)

R14 (LR)

R14 (LR)

R14 (LR)

SPSR

SPSR

SPSR

SPSR

SPSR

SPSR

FIQ

IRQ

ABT

SVC

UND

MON

R15 (PC)
CPSR

User

Figure 4-1 The ARM register set

There are a number of general purpose registers. In addition, there is R15, the program counter,
and six program status registers, which contain flags, modes etc. Many of these registers are
banked and not visible to the processor except in specific processor modes. These banked-out
registers are automatically switched in and out when a different processor mode is entered.
So, for example, if the processor is in IRQ mode, we can see R0, R1 … R12 (the same registers
we can see in User mode), plus R13_IRQ and R14_IRQ (registers visible only while we are in
IRQ mode) and R15 (the program counter, PC). R13_USR and R14_USR are not directly
visible. We do not normally need to specify the mode in the register name in the way we have
just done. If we (for example) refer to R13 in a line of code, the processor will access the R13
register of the mode we are currently in.
At any given moment, the programmer has access to 16 registers (R0-R15) and the Current
Program Status Register (CPSR). R15 is hard-wired to be the program counter and holds the
current program address (actually, it always points eight bytes ahead of the instruction that is
executing in ARM state and four bytes ahead of the current instruction in Thumb state). We can
write to R15 to change the flow of the program. R14 is the link register, which holds a return
address for a function or exception (although it can occasionally be used as a general purpose
register when not holding either of these values). R13, by convention is used as a stack pointer.
R0-R12 are general purpose registers. Some 16-bit Thumb instructions have limitations on
which registers they can access – the accessible subset is called the low registers and comprises
R0-R7. Figure 4-2 on page 4-5 shows the subset of registers visible to general data processing
instructions.
ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-4

ARM Registers, Modes and Instruction Sets

R0
R1
R2
R3
Low Registers

R4
R5
R6

General Purpose
Registers

R7
R8
R9
R10
R11
High Registers

R12
R13 (SP) Stack pointer
R14 (LR) Link register
R15 (PC) Program Counter
CPSR

Current Program Status Register

Figure 4-2 Programmer visible registers for user code

The reset values of R0-R14 are unpredictable. R13, the stack pointer, must be initialized (for
each mode) by boot code before software makes use of the stack. The AAPCS/AEABI (see
Chapter 15 Application Binary Interfaces) specifies how software should use the general
purpose registers in order to interoperate between different toolchains or programming
languages.
Implementations which support the Virtualization Extensions have additional registers available
in Hypervisor (Hyp) mode, which are not shown in Figure 4-1 on page 4-4. Hyp mode has
access to its own versions of R13 (SP) and SPSR. It uses the User mode link register as well as
a dedicated new register (ELR). We'll discuss this in Chapter 27 Virtualization.
4.3.1

Program Status Registers
The program status registers form an additional set of banked registers. Six are used as Saved
Program Status Registers (SPSR) and save a copy of the pre-exception CPSR when switching
modes upon an exception. These are not accessible from system or User modes. So, for example,
in User mode, we can see only CPSR. In FIQ mode, we can see CPSR and SPSR_FIQ, but have
no direct access to SPSR_IRQ, SPSR_ABT, etc.
The ARM Architecture Reference Maunual describes how program status is reported in the
32-bit Application Program Status Register (APSR), with other status and control bits (system
level information) remaining in the CPSR. In the ARMv7-A architecture covered in this book,
the APSR is in fact the same register as the CPSR, despite the fact that they have two separate
names. The APSR must be used only to access the N, Z, C, V, Q, and GE[3:0] bits. These bits
are not normally accessed directly, but instead set by condition code setting instructions and

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-5

ARM Registers, Modes and Instruction Sets

tested by instructions which are executed conditionally. The renaming is therefore an attempt to
clean-up the mixed access CPSR of the older ARM architectures. Figure 4-3 shows the make-up
of the CPSR.

31

27 26 25 24 23

N Z C V Q

IT
[1:0]

J

20 19

Reserved

16 15

GE[3:0]

10

IT[7:2]

9

8

7

6

5

E

A

I

F

T

4

0

M[4:0]

Figure 4-3 CPSR bits

The individual bits represent the following:
•

N – Negative result from ALU

•

Z – Zero result from ALU

•

C – ALU operation Carry out

•

V – ALU operation oVerflowed

•

Q – cumulative saturation (also described as “sticky”)

•

J – indicates if processor is in Jazelle state

•

GE[3:0] – used by some SIMD instructions

•

IT [7:2] – If-Then conditional execution of Thumb-2 instruction groups

•

E bit controls load/store endianness

•

A bit disables imprecise data aborts

•

I bit disables IRQ

•

F bit disables FIQ

•

T bit – T = 1 indicates processor in Thumb state

•

M[4:0] – specifies the processor mode (FIQ, IRQ, etc. as described in Table 4-1 on
page 4-3).

The processor can change between modes using instructions which directly write to the CPSR
mode bits (not possible when in User mode). More commonly, the processor changes mode as
a result of exception events.
We will consider these bits in more detail in Chapter 6 and Chapter 10.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-6

ARM Registers, Modes and Instruction Sets

4.4

Instruction pipelines
All modern processors use an instruction pipeline, as a way to increase instruction throughput.
The basic concept is that the execution of an instruction is broken down into a series of
independent steps. Each instruction moves from one step to another, over a number of clock
cycles. Each pipeline stage handles a part of the process of executing an instruction, so that on
any given clock cycle, a number of different instructions can be in different stages of the
pipeline. The total time to execute an individual instruction does not change much compared
with a non-pipelined implementation, but the overall throughput is significantly raised. The
overall speed of the processor is then governed by the speed of the slowest step, which is
significantly less than the time needed to perform all steps. A non-pipelined architecture is
inefficient because some blocks within the processor will be idle most of the time during the
instruction execution.

Instruction 1

Fetch
Instruction

Instruction 2

Instruction 3

Decode

Execute

Fetch
Instruction

Decode

Execute

Fetch
Instruction

Decode

Execute

Clock cycles

Figure 4-4 Pipeline instruction flow

The classic pipeline comprises three stages – Fetch, Decode and Execute as shown in
Figure 4-4. More generally, an instruction pipeline might be divided into the following broad
definitions:
•

Instruction prefetch (deciding from which locations in memory instructions are to be
fetched, and performing associated bus accesses).

•

Instruction fetch (reading instructions to be executed from the memory system).

•

Instruction decode (working out what instruction is to be executed and generating
appropriate control signals for the datapaths).

•

Register fetch (providing the correct register values to act upon).

•

Issue (issuing the instruction to the appropriate execute unit).

•

Execute (the actual ALU or multiplier operation, for example).

•

Memory access (performing data loads or stores).

•

Register write-back (updating processor registers with the results).

In individual processor implementations, some of these steps can be combined into a single
pipeline stage and/or some steps can be spread over several cycles. A longer pipeline means
fewer logic gates in the critical path between each pipeline stage which results in faster
execution. However, there are typically many dependencies between instructions. If an
instruction depends on the result of a previous instruction, the control logic might need to insert
a stall (or bubble) into the pipeline until the dependency is resolved. Additional logic is needed
to detect and resolve such dependencies (for example, forwarding logic, which feeds the output

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-7

ARM Registers, Modes and Instruction Sets

of a pipeline stage back to earlier pipeline stages). This makes processors with longer pipelines
significantly more complex to design and validate. More importantly, it makes the processor
larger and therefore more expensive.
In general, the ARM architecture tries to hide pipeline effects from the programmer. This means
that the programmer can determine the pipeline structure only by reading the processor manual.
Some pipeline artifacts are still present, however. For example, the program counter register
(R15) points two instructions ahead of the instruction that is currently executing in ARM state,
a legacy of the three stage pipeline of the original ARM1 processor.
A further drawback of a long pipeline is that sometimes the sequential execution of instructions
from memory will be interrupted. This can happen as a result of execution of a branch
instruction, or by an exception event (such as an interrupt). When this happens, the processor
cannot determine the correct location from which the next instruction should be fetched until
the branch is resolved. In typical code, many branch instructions are conditional (as a result of
loops or if statements). Therefore, whether or not the branch will be taken cannot be determined
at the time the instruction is fetched. If we fetch instructions which follow a branch and the
branch is taken, the pipeline must be flushed and a new set of instructions from the branch
destination must be fetched from memory instead. As pipelines get longer, the cost of this
“branch penalty” becomes higher.
Cortex-A series processors have branch prediction logic which aims to reduce the effect of the
branch penalty. In essence, the processor guesses whether a branch will be taken or not and
fetches instructions either from the instructions immediately following the branch (if the
prediction is that the conditional branch will not be taken), or from the target instruction of the
branch (if the prediction is that the branch will be taken). If the prediction is correct, the branch
does not flush the pipeline. If the prediction is wrong, the pipeline must be flushed and
instructions from the correct location fetched to refill it. We will look at this in more detail in
Branch prediction on page 4-10.
4.4.1

Multi-issue pipelines
A refinement of the processor pipeline is that we can duplicate logic within pipeline stages. In
the ARM11 processor family, for example, there are three parallel back-end pipelines – an ALU
pipeline, a load/store pipeline and a multiply pipeline. Instructions can be issued into any of
these pipelines. A logical development of this idea is to have multiple instances of the execute
hardware – for example two ALU pipelines. We can then issue more than one instruction per
cycle into these parallel pipelines – an example of instruction level parallelism. Such a processor
is said to be superscalar. The Cortex-A8, Cortex-A9, and Cortex-A15 processors are
superscalar processors – they can potentially decode and issue more than one instruction in a
single clock cycle. The Cortex-A5 processor is more limited and can only dual-issue certain
combinations of instructions – for example, a branch and a data-processing instruction can be
issued in the same cycle. The instructions are still issued from a sequential stream of instructions
in memory. Extra hardware logic is required to check for dependencies between instructions, as,
for example, in the case where one instruction must wait for the result of the other.
The processor pipeline is too complex for the programmer to take care of all pipeline effects and
dependencies.
Out-of-order execution provides scope for increasing pipeline efficiency. If instructions are
processed sequentially, one instruction is completely retired before the next is dealt with. In
out-of order processing, multiple memory accesses can be outstanding at once, and can
complete in a different order from their original program order.
Often, an instruction must be stalled due to a dependency (for example, the need to use a result
from a previous instruction). We can execute following instructions which do not share this
dependency, provided that logical hazards between instructions are rigorously respected. The

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-8

ARM Registers, Modes and Instruction Sets

Cortex-A9 processor achieves very high levels of efficiency and instruction throughput using
this technique. It can be considered to have a pipeline of variable length, as the pipeline length
depends upon which back-end execution pipeline an instruction uses. It can execute instructions
speculatively and can sustain two instructions per clock, but has the ability to issue up to four
instructions on an individual clock cycle. This can improve performance if the pipeline has
become unblocked having previously been stalled for some reason.
4.4.2

Register renaming
The Cortex-A9 processor has an interesting micro-architectural implementation which makes
use of a register renaming scheme. The set of registers which form a standard part of the ARM
architecture are visible to the programmer, but the hardware implementation of the processor
actually has a much larger pool of physical registers, with logic to dynamically map the
programmer visible registers to the physical ones. Figure 4-5 shows the separate pools of
architectural and physical registers.
Consider the case where code writes the value of a register to external memory and shortly
thereafter reads the value of a different memory location into the same register. This might cause
a pipeline stall in previous processors, even though in this particular case, there is no actual data
dependency. Register renaming avoids this problem by ensuring that the two instances of R0 are
renamed to different physical registers, removing the dependency. This permits a compiler or
assembler programmer to reuse registers without the need to consider architectural penalties for
reusing registers when there are no inter-instruction dependencies. Importantly, it also allows
out-of-order execution of write-after-write and write-after-read sequences. (A write-after-write
hazard could occur when we write values to the same register in two separate instructions. The
processor must ensure that an instruction which comes after the two writes sees the result of the
later instruction.)

Architectural
R0

Physical
CPSR

R1

P0

Flag 0

P1

Flag 1

P2
P3

LR_USR

Figure 4-5 Register renaming

To avoid dependencies between instructions related to flag setting and comparisons, the APSR
flags also use a similar technique.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-9

ARM Registers, Modes and Instruction Sets

4.5

Branch prediction
As we have seen, branch prediction logic is an important factor in achieving high throughput in
Cortex-A series processors. With no branch prediction, we would have to wait until a
conditional branch executes before we could determine where to fetch the next instruction from.
The first time that a conditional jump instruction is fetched, there is little information on which
to base a prediction about the address of the next instruction. Older ARM processors used static
branch prediction. This is the simplest branch prediction method as it needs no prior information
about the branch. We speculate that backward branches will be taken, and forward branches will
not. A backward branch has a target address that is lower than its own address. This can easily
be recognized in hardware as the branch offset is encoded as a two’s complement number. We
can therefore look at a single opcode bit to determine the branch direction. This technique can
give reasonable prediction accuracy owing to the prevalence in code of loops, which almost
always contain backward-pointing branches and are taken more often than not taken. Due to the
pipeline length of Cortex-A series processors, we get better performance by using more
complex branch prediction schemes, which give better prediction accuracy. This comes with a
small price, as additional logic is required.
Dynamic prediction hardware can further reduce the average branch penalty by making use of
history information about whether conditional branches were taken or not taken on previous
execution. A Branch Target Address Cache (BTAC), also called Branch Target Buffer (BTB) in
the Cortex-A8 processor, is a cache which holds information about previous branch instruction
execution. It enables the hardware to speculate on whether a conditional branch will or will not
be taken.
The processor must still evaluate the condition code attached to a branch instruction. If the
branch prediction hardware predicts correctly, the pipeline does not need to be stalled. If the
branch prediction hardware speculation was wrong, the processor will flush the pipeline and
refill it.

4.5.1

Return stack
Readers who are not at all familiar with ARM assembly language may want to omit this section
until they have read Chapter 5 and Chapter 6.
The description in Branch prediction looked at strategies the processor can use to predict
whether branches are taken or not. For most branch instructions, the target address is fixed (and
encoded in the instruction). However, there is a class of branches where the branch target
destination cannot be determined by looking at the instruction. For example, if we perform a
data processing operation which modifies the PC (for example, MOV, ADD or SUB) we must wait for
the ALU to evaluate the result before we can know the branch target. Similarly if we load the
PC from memory, using an LDR, LDM or POP instruction, we cannot know the target address until
the load completes.
Such branches (often called indirect branches) cannot, in general, be predicted in hardware.
There is, however, one common case that can usefully be optimized, using a last-in-first-out
stack in the pre-fetch hardware (the return stack). Whenever a function call (BL or BLX)
instruction is executed, we enter the address of the following instruction into this stack.
Whenever we encounter an instruction which can be recognized as being a function return
instructions (BX LR, or a stack pop which contains the PC in its register list), we can speculatively
pop an entry from the stack and start fetching instructions from that address. When the return
instruction actually executes, the hardware compares the address generated by the instruction
with that predicted by the stack. If there is a mismatch, the pipeline is flushed and we restart
from the correct location.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-10

ARM Registers, Modes and Instruction Sets

The return stack is of a fixed size (eight entries in the Cortex-A8 or Cortex-A9 processors, for
example). If a particular code sequence contains a large number of nested function calls, the
return stack can predict only the first eight function returns. The effect of this is likely to be very
small, as most functions do not invoke eight levels of nested functions.
4.5.2

Programmer’s view
For the majority of application level programmers, branch prediction is a part of the hardware
implementation which can safely be ignored. However, knowledge of the processor behavior
with branches can be useful when writing highly optimized code. The hardware performance
monitor counters can generate information about the numbers of branches correctly or
incorrectly predicted. This hardware is described further in Chapter 17.
Branch prediction logic is disabled at reset. Part of the boot code sequence will typically be to
set the Z bit in the CP15:SCTLR, System Control Register, which enables branch prediction.
There is one other situation where the programmer might need to take care. When moving or
modifying code at an address from which code has already been executed in the system, it might
be necessary (and is always prudent) to remove stale entries from the branch history logic by
using the CP15 instruction which invalidates all entries.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

4-11

Chapter 5
Introduction to Assembly Language

Assembly language is a human-readable representation of machine code. There is in general a
one-to-one relationship between assembly language instructions (mnemonics) and the actual
binary opcode executed by the processor. The purpose of this chapter is not to teach assembly
language programming. We describe the ARM and Thumb instruction sets, highlighting features
and idiosyncrasies that differentiate it from other microprocessor families.
Many programmers writing at the application level will have little need to code in assembly
language. However, knowledge of assembly code can be useful in cases where highly optimized
code is required, when writing JIT compilers, or where low level use of features not directly
available in C is needed. It might be required for portions of boot code, device drivers or when
performing OS development. Finally, it can be useful to be able to read assembly code when
debugging C, and particularly, to understand the mapping between assembly instructions and C
statements.
Programmers seeking a more detailed description of ARM assembly language should also refer to
the ARM Compiler Toolchain Assembler Reference.
The ARM architecture supports implementations across a very wide range of performance points.
Its simplicity leads to very small implementations, and this enables very low power consumption.
Implementation size, performance, and very low power consumption are key attributes of the ARM
architecture.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-1

Introduction to Assembly Language

5.1

Comparison with other assembly languages
All processors have basic data processing instructions which permit them to perform arithmetic
operations (such as ADD) and logical bit manipulation (for example AND). They also need to
transfer program execution from one part of the program to another, in order to support loops
and conditional statements. Processors always have instructions to read and write external
memory, too.
The ARM instruction set is generally considered to be simple, logical and efficient. It has
features not found in other processors, while at the same time lacking operations found in some
other processors. For example, it cannot perform data processing operations directly on
memory. To increment a value in a memory location, the value must be loaded to an ARM
register, the register incremented and a third instruction is required to write the updated value
back to memory. The Instruction Set Architecture (ISA) includes instructions that combine a
shift with an arithmetic or logical operation, auto-increment and auto-decrement addressing
modes for optimized program loops, Load, and Store Multiple instructions which allow efficient
stack and heap operations, plus block copying capability and conditional execution of almost all
instructions.
As many readers will already be familiar with one or more assembly languages, it might be
useful to compare some code sequences, showing the x86, 68K and ARM instructions to
perform equivalent tasks.
Like the x86 (but unlike the 68K), ARM instructions typically have a two or three operand
format, with the first operand in most cases specifying the destination for the result (LDM and
store instructions, for example, being an exception to this rule). The 68K, by contrast, places the
destination as the last operand. For ARM instructions, there are generally no restrictions on
which registers can be used as operands. Example 5-1 and Example 5-2 give a flavor of the
differences between the different assembly languages.
Example 5-1 Instructions to add 100 to a value in a register

x86:

add

eax, #100

68K:

ADD

#100, D0

ARM:

add

r0, r0, 100

Example 5-2 Load a register with a 32-bit value from a register pointer

x86:

mov

eax, DWORD PTR [ebx]

68K:

MOVE.L

(A0), D0

ARM:

ldr

r0, [r1]

An ARM processor is a Reduced Instruction Set Computer (RISC) processor. Complex
Instruction Set Computer (CISC) processors, like the x86, have a rich instruction set capable of
doing complex things with a single instruction. Such processors often have significant amounts
of internal logic which decode machine instructions to sequences of internal operations
(microcode). RISC architectures, in contrast, have a smaller number of more general purpose
instructions, which might be executed with significantly fewer transistors, making the silicon
cheaper and more power efficient. Like other RISC architectures, ARM processors have a large

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-2

Introduction to Assembly Language

number of general-purpose registers and many instructions execute in a single cycle. It has
simple addressing modes, where all load/store addresses can be determined from just register
contents and instruction fields.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-3

Introduction to Assembly Language

5.2

Instruction sets
As described in Chapter 4, many ARM processors are able to execute two or even three different
instruction sets, while some (for example, the Cortex-M3 processor) do not in fact execute the
original ARM instruction set. There are at least two instruction sets that ARM processors can
use.
ARM (32-bit instructions)
This is the original ARM instruction set.
Thumb

The Thumb instruction set was first added in the ARM7TDMI processor and
contained only 16-bit instructions, which gave much smaller programs (memory
footprint can be a major concern in smaller embedded systems) at the cost of
some performance. Recent processors, including those in the Cortex-A series,
support Thumb-2 technology, which extends the Thumb instruction set to provide
a mix of 16-bit and 32-bit instructions. This gives the best of both worlds,
performance similar to that of ARM, with code size similar to that of Thumb. Due
to its size and performance advantages, it is increasingly common for all code to
be compiled or assembled to take advantage of Thumb-2 technology.

The currently used instruction set is indicated by the CPSR T bit and the processor is said to be
in ARM state or Thumb state. Code has to be explicitly compiled or assembled to one state or
the other. An explicit instruction is used to change between instruction sets. Calling functions
which are compiled for a different state is known as inter-working. We’ll take a more detailed
look at this in Interworking on page 5-11.
For Thumb assembly code, there is often a choice of 16-bit and 32-bit instruction encodings,
with the 16-bit versions being generated by default. The .W (32-bit) and .N (16-bit) width
specifiers can be used to force a particular encoding (if such an encoding exists).

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-4

Introduction to Assembly Language

5.3

ARM tools assembly language
The Unified Assembly Language (UAL) format now used by ARM tools enables the same
canonical syntax to be used for both ARM and Thumb instruction sets. The assembler syntax of
ARM tools is not identical to that used by the GNU Assembler, particularly for preprocessing
and pseudo-instructions which do not map directly to opcodes. In the next chapter, we will look
at the individual assembly language instructions in a little more detail. Before doing that, we
take a look at the basic syntax used to specify instructions and registers. Assembly language
examples in this book use both UAL and GNU Assembly syntax.
UAL gives the ability to write assembler code which can be assembled to run on all ARM
processors. In the past, it was necessary to write code explicitly for ARM or Thumb state. Using
UAL the same code can be assembled for different instruction sets at the time of assembly, not
at the time the code is written. This can be either through the use of command line switches or
inline directives. Legacy code will still assemble correctly.
The format of assembly language instructions consists of a number of fields. These comprise
the actual opcode or an assembler directive or pseudo-instruction, plus (optionally) fields for
labels, operands and comments. Each field is delimited by a space or tab, with commas being
used to separate operands and a semicolon marking the start of the comment field on a line.
Entire lines can be marked as comment with an asterisk. Instructions, pseudo-instructions and
directives can be written in either lower-case, or upper-case (the convention used in this book),
but cases cannot be mixed. Symbol names are case-sensitive.

5.3.1

ARM assembly language syntax
ARM assembly language source files consist of a sequence of statements, one per line.
Each statement has three optional parts, ordered as follows:
label instruction ; comment

A label lets you identify the address of this instruction. This can then be used as a target for
branch instructions or for load and store instructions.
Everything on the line after the ; symbol is treated as a comment and ignored (unless it is inside
a string). C style comment delimiters “/*” and “*/” can also be used.
The instruction can be either an assembly instruction, or an assembler directive. These are
pseudo-instructions that tell the assembler itself to do something. These are required, amongst
other things, to control sections and alignment, or create data.
5.3.2

Label
A label is required to start in the first character of a line. If the line does not have a label, a space
or tab delimiter is needed to start the line. If there is a label, the assembler makes the label equal
to the address in the object file of the corresponding instruction. Labels can then be used as the
target for branches or for loads and stores.
Example 5-3 A simple example showing use of a label

Loop

ARM DEN0013B
ID082411

MUL R5, R5, R1
SUBS R1, R1, #1
BNE Loop

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-5

Introduction to Assembly Language

In Example 5-3 on page 5-5, Loop is a label and the conditional branch instruction (BNE Loop) will
be assembled in a way which makes the offset encoded in the branch instruction point to the
address of the MUL instruction which is associated with the label Loop.
5.3.3

Directives
Most lines will normally have an actual assembly language instruction, to be converted by the
tool into its binary equivalent, but can also be a directive which tells the assembler to do
something. It can also be a pseudo-instruction (one which will be converted into one or more
real instructions by the assembler). We’ll look at the actual instructions available in hardware in
the next chapter and focus mainly on the assembler directives here. These perform a wide range
of tasks. They can be used to place code or data at a particular address in memory, create
references to other programs and so forth.
The DEFINE CONSTANT (DCD, DCB, DCW) directives let us place data into a piece of code. This can be
expressed numerically (in decimal, hex, binary) or as ASCII characters. It can be a single item
or a comma separated list. DCB is for byte sized data, DCD can be used for word sized data, and
DCW for half-word sized data items.
For example, we might have:
MESSAGE DCB “Hello World!”,0

This will produce a series of bytes corresponding to the ASCII characters in the string, with a 0
termination. MESSAGE is a label which we can use to get the address of this data. Similarly, we
might have data items expressed in hex:
Masks DCD 0x100, 0x80, 0x40, 0x20, 0x10

The EQU pseudo-instruction lets us assign names to address or data values. For example:
CtrlD EQU 4
TUBE EQU 0x30000000

We can then use these labels in other instructions, as parts of expressions to be evaluated. EQU
does not actually cause anything to be placed in the program executable – it merely equates a
name to a value, for use in other instructions, in the symbol table for the assembler. It is
convenient to use such names to make code easier to read, but also so that if we change the
address or value of something in a piece of code, we need only modify the original definition,
rather than having to change all of the references to it individually. It is usual to group together
EQU definitions, often at the start of a program or function, or in separate include files.
The AREA pseudo-instruction is used to tell the assembler about how to group together code or
data into logical sections for later placement by the linker. For example, exception vectors might
need to be placed at a fixed address. The assembler keeps track of where each instruction or
piece of data is located in memory and the AREA directive can be used to modify that.
The ALIGN directive lets you align the current location to a specified boundary. It usually does
this by padding (where necessary) with zeros or NOP instructions, although it is also possible to
specify a pad value with the directive. The default behavior is to set the current location to the
next word (four byte) boundary, but larger boundary sizes and offsets from that boundary can
also be specified. This can be required to meet alignment requirements of certain instructions
(for example LDRD and STRD doubleword memory transfers), or to align with cache boundaries.
END is used to denote the end of the assembly language source program. Failure to use the END
directive will result in an error being returned. INCLUDE tells the assembler to include the contents

of another file into the current file. Include files can be used as an easy mechanism for sharing
definitions between related files.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-6

Introduction to Assembly Language

5.4

Introduction to the GNU Assembler
The GNU Assembler, part of the GNU tools, is used to convert assembly language source code
into binary object files. The assembler is extensively documented in the GNU Assembler
Manual, which can be found online at , http://sourceware.org/binutils/docs/as/index.html or
(if you have GNU tools installed on your system) in the gnutools/doc sub-directory.
What follows is a brief description, intended to highlight differences in syntax between the GNU
Assembler and standard ARM assembly language, and to provide enough information to allow
programmers to get started with the tools.
The names of GNU tool components will have prefixes indicating the target options selected,
including operating system. An example would be arm-none-eabi-gcc, which might be used for
bare metal systems using the ARM EABI (described in Chapter 20 Writing NEON Code).

5.4.1

Invoking the GNU Assembler
You can assemble the contents of an ARM assembly language source file by running the
arm-none-eabi-as program.
arm-none-eabi-as -g -o filename.o filename.s

The option -g requests the assembler to include debug information in the output file.
When all of your source files have been assembled into binary object files (with the extension
.o), you use the GNU Linker to create the final executable in ELF format.
This is done by executing:
arm-none-eabi-ld -o filename.elf filename.o

For more complex programs, where there are many separate source files, it is more common to
use a utility like make to control the build process.
You can use the debugger provided by either arm-none-eabi-gdb or arm-none-eabi-insight to run
the executable files on your host, as an alternative to a real target processor.
5.4.2

GNU Assembly language syntax
The GNU Assembler can target many different processor architectures and is not ARM specific.
This means that its syntax is somewhat different from other ARM assemblers, such as the ARM
toolchain. The GNU Assembler uses the same syntax for all of the many processor architectures
that it supports.
Assembly language source files consist of a sequence of statements, one per line.
Each statement has three optional parts, ordered as follows:
label: instruction @ comment

A label lets you identify the address of this instruction. This can then be used as a target for
branch instructions or for load and store instructions. A label can be a letter followed
(optionally) by a sequence of alphanumeric characters, followed by a colon.
Everything on the line after the @ symbol is treated as a comment and ignored (unless it is inside
a string). C style comment delimiters “/*” and “*/” can also be used.
The instruction can be either an ARM assembly instruction, or an assembler directive. These
are pseudo-instructions that tell the assembler itself to do something. These are required,
amongst other things, to control sections and alignment, or create data.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-7

Introduction to Assembly Language

At link an entry point can be specified on the command line if one has not been explicitly
provided in the source code.
5.4.3

Sections
An executable program with code will have at least one section, which by convention will be
called .text. Data can be included in a .data section.
Directives with the same names enable you to specify which of the two sections should hold
what follows in the source file. Executable code should appear in a .text section and read/write
data in the .data section. Also read-only constants can appear in a .rodata section. Zero
initialized data will appear in .bss. The Block Started by Symbol (bss) segment defines the
space for uninitialized static data.

5.4.4

Assembler directives
This is a key area of difference between GNU tools and other assemblers.
All assembler directives begin with a period “.” A full list of these is described in the GNU
documentation. Here, we give a subset of commonly used directives.
.align

This causes the assembler to pad the binary with bytes of zero value, in data
sections, or NOP instructions in code, ensuring the next location will be on a word
boundary.

.ascii “string”

Insert the string literal into the object file exactly as specified, without a NUL
character to terminate. Multiple strings can be specified using commas as
separators.
.asciiz

Does the same as .ascii, but this time additionally followed by a NUL character
(a byte with the value 0 (zero)).

.byte expression, .hword expression, .word expression

Inserts a byte, halfword, or word value into the object file. Multiple values can be
specified using commas as separators. The synonyms .2byte and .4byte can also
be used.
.data

Causes the following statements to be placed in the data section of the final
executable.

.end

Marks the end of this source code file.

.equ symbol, expression

Sets the value of symbol to expression. The “=” symbol and .set have the same
effect.
.extern symbol

Indicates to the assembler (and more importantly, to anyone reading the code) that
symbol is defined in another source code file.
.global symbol

Tells the assembler that symbol is to be made globally visible to other source files
and to the linker.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-8

Introduction to Assembly Language

.include “filename”

Inserts the contents of filename into the current source file and is typically used
to include header files containing shared definitions.
.text

This switches the destination of following statements into the text section of the
final output object file. Assembly instructions must always be in the text section.

For reference, Table 5-1 shows common assembler directives alongside GNU and ARM tools.
Not all directives are listed, and in some cases there is not a 100% correspondence between
them.
Table 5-1 Comparison of syntax

ARM DEN0013B
ID082411

GNU
Assembler

armasm

Description

@

;

Comment

#&

#0x

An immediate hex value

.if

IFDEF, IF

Conditional (not 100% equivalent)

.else

ELSE

.elseif

ELSEIF

.endif

ENDIF

.ltorg

LTORG

|

:OR:

OR

&

:AND:

AND

<<

:SHL:

Shift Left

>>

:SHR:

Shift Right

.macro

MACRO

Start macro definition

.endm

ENDM

End macro definition

.include

INCLUDE

GNU Assembler needs “file”

.word

DCD

A data word

.short

DCW

.long

DCD

.byte

DCB

.req

RN

.global

IMPORT,
EXPORT

.equ

EQU

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-9

Introduction to Assembly Language

5.4.5

Expressions
Assembly instructions and assembler directives often require an integer operand. In the
assembler, this is represented as an expression to be evaluated. Typically, this will be an integer
number specified in decimal, hexadecimal (with a 0x prefix) or binary (with a 0b prefix) or as
an ASCII character surrounded by quotes.
In addition, standard mathematical and logical expressions can be evaluated by the assembler
to generate a constant value. These can utilize labels and other pre-defined values. These
expressions produce either absolute or relative values. Absolute values are
position-independent and constant. Relative values are specified relative to some linker-defined
address, determined when the executable image is produced – an example might be some offset
from the start of the .data section of the program.

5.4.6

GNU tools naming conventions
Registers are named in GCC as follows:
•

General registers: R0 - R15

•

Stack pointer register: SP(R13)

•

Frame pointer register: FP(R11)

•

Link register: LR(R14)

•

Program counter: PC(R15)

•

Status register flags (x = C current or S saved): xPSR, xPSR_all, xPSR_f, xPSR_x,
xPSR_ctl, xPSR_fs, xPSR_fx, xPSR_f, xPSR_cs, xPSR_cf. xPSR_cx etc.

Note
In Chapter 15 Application Binary Interfaces we will see how all of the registers are assigned a
role within the procedure call standard and that the GNU Assembler lets us refer to the registers
using their PCS names. See Table 15-1 on page 15-2.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-10

Introduction to Assembly Language

5.5

Interworking
When the processor executes ARM instructions, it is said to be operating in ARM state. When
it is operating in Thumb state, it is executing Thumb instructions. A processor in a particular
state can only sensibly execute instructions from that instruction set. We must make sure that
the processor does not receive instructions of the wrong instruction set.
Each instruction set includes instructions to change processor state. ARM and Thumb code can
be mixed, if the code conforms to the requirements of the ARM and Thumb Procedure Call
Standards (described in Chapter 15). Compiler generated code will always do so, but assembly
language programmers must take care to follow the specified rules.
Selection of processor state is controlled by the T bit in the current program status register.
When T is 1, the processor is in Thumb state. When T is 0, the processor is in ARM state.
However, when the T bit is modified, it is also necessary to flush the instruction pipeline (to
avoid problems with instructions being decoded in one state and then executed in another).
Special instructions are used to accomplish this. These are BX (Branch with eXchange) and BLX
(Branch and Link with eXchange). LDR of PC and POP/LDM of PC also have this behavior. In addition
to changing the processor state with these instructions, assembly programmers must also use the
appropriate directive to tell the assembler to generate code for the appropriate state.
The BX or BLX instruction branches to an address contained in the specified register, or an offset
specified in the opcode. The value of bit [0] of the branch target address determines whether
execution continues in ARM state or Thumb state. Both ARM (aligned to a word boundary) and
Thumb (aligned to a halfword boundary) instructions do not use bit [0] to form an address. This
bit can therefore safely be used to provide the additional information about whether the BX or
BLX instruction should change the state to ARM (address bit [0] = 0) or Thumb (address bit [0]
= 1). BL label will be turned into BLX label as appropriate at link time if the instruction set of the
caller is different from the instruction set of label assuming that it is unconditional.
A typical use of these instructions is when a call from one function to another is made using the
BL or BLX instruction, and a return from that function is made using the BX LR instruction.

Alternatively, we can have a non-leaf function, which pushes the link register onto the stack on
entry and pops the stored link register from the stack into the program counter, on exit. Here,
instead of using the BX LR instruction to return, we instead have a memory load. Memory load
instructions which modify the PC might also change the processor state depending upon the
value of bit [0] of the loaded address.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-11

Introduction to Assembly Language

5.6

Identifying assembly code
When faced with a piece of assembly language source code, it can be useful to be able to quickly
determine which instruction set will be used and which kind of assembler it is targeted at.
Older ARM Assembly language code can have three (or even four) operand instructions present
(for example, ADD R0, R1, R2) or conditional execution of non-branch instructions (for example,
ADDNE R0, R0, #1). Filenames will typically be .s or .S.
Code targeted for the newer UAL, will contain the directive .syntax unified but will otherwise
appear similar to traditional ARM Assembly language. The pound (or hash) symbol # can be
omitted in front of immediate operands. Conditional instruction sequences must be preceded
immediately by the IT instruction (described in Chapter 6). Such code assembles either to
fixed-size 32-bit (ARM) instructions, or mixed-size (16-/32-bit) Thumb instructions, depending
on the presence of the directives .code, .thumb or .arm.
You can, on occasion, encounter code written in 16-bit Thumb assembly language. This can
contain directives like .code 16, .thumb or .thumb_func but will not specify .syntax unified. It
uses two operands for most instructions, although ADD and SUB can sometimes have three. Only
branches can be executed conditionally.
All GCC inline assembler (.c, .h, .cpp, .cxx, .c++ and so on) code can build for Thumb or ARM,
depending on GCC configuration and command-line switches (-marm or –mthumb).

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

5-12

Chapter 6
ARM/Thumb Unified Assembly Language
Instructions

This chapter is a general introduction to ARM/Thumb assembly language; we do not aim to provide
detailed coverage of every instruction. As mentioned in the previous chapter, instructions can
broadly be placed in one of a number of classes:
•

data operations (ALU operations like ADD)

•

memory operations (load and stores to memory)

•

branches (for loops, goto, conditional code and other program flow control)

•

DSP (operations on packed data, saturated mathematics and other special instructions
targeting codecs)

•

miscellaneous (coprocessor, debug, mode changes and so forth).

We’ll take a brief look at each of those in turn. Before we do that, let us examine capabilities which
are common to different instruction classes.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-1

ARM/Thumb Unified Assembly Language Instructions

6.1

Instruction set basics
There are a number of features common to all parts of the instruction set.

6.1.1

Constant values
ARM or Thumb assembly language instructions have a length of only 16 or 32 bits. This
presents something of a problem. It means that we cannot encode an arbitrary 32-bit value
within the opcode.
Constant values encoded in an instruction can be one of the following in Thumb:
•

a constant that can be produced by rotating an 8-bit value by any even number of bits
within a 32-bit word

•

a constant of the form 0x00XY00XY

•

a constant of the form 0xXY00XY00

•

a constant of the form 0xXYXYXYXY.

Where XY is a hexadecimal number in the range 0x00 to 0xFF.
In the ARM instruction set, as opcode bits are used to specify condition codes, the instruction
itself and the registers to be used, only 12 bits are available to specify an immediate value. We
have to be somewhat creative in how these 12 bits are used. Rather than enabling a constant of
size –2048 to +2047 to be specified, instead the 12 bits are divided into an 8-bit constant and
4-bit rotate value. The rotate value enables the 8-bit constant value to be rotated right by a
number of places from 0 to 30 in steps of 2 (that is, 0, 2, 4, 6, 8 and so on)
So, we can have immediate values like 0x23 or 0xFF. And we can produce other useful immediate
values (for example, addresses of peripherals or blocks of memory). For example, 0x23000000
can be produced by expressing it as 0x23 ROR 8. But many other constants, like 0x3FF, cannot be
produced within a single instruction. For these values, you must either construct them in
multiple instructions, or load them from memory. Programmers do not typically concern
themselves with this, except where the assembler gives an error complaining about an invalid
constant. Instead, we can use assembly language pseudo-instructions to generate the required
constant.
The MOVW instruction (move wide), will move a 16-bit constant into a register, while zeroing the
top 16 bits of the target register. MOVT (move top) will move a 16-bit constant into the top half of
a given register, without changing the bottom 16 bits. This permits a MOV32 pseudo-instruction
which is able to construct any 32-bit constant. The assembler provides some further help here.
The prefixes :upper16: and :lower16: allow you to extract the corresponding half from a 32-bit
constant:
MOVW R0, #:lower16:label
MOVT R0, #:upper16:label

Although this needs two instructions, it does not require any extra space to store the constant,
and there is no need to read a data item from memory.
We can also use pseudo-instructions LDR Rn, = or LDR Rn, =label. (This was the only
option for older processors which lacked MOVW and MOVT.) The assembler will then use the best
sequence to generate the constant in the specified register (one of MOV, MVN or an LDR from a literal
pool). A literal pool is an area of constant data held within the code section, typically after the
end of a function and before the start of another. If it is necessary to manually control literal pool
placement, this can be done with an assembler directive – LTORG for armasm, or .ltorg when
using GNU tools. The register loaded could be the program counter, which would cause a

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-2

ARM/Thumb Unified Assembly Language Instructions

branch. This can be useful for absolute addressing or for references outside the current section;
obviously this will result in position-dependent code. The value of the constant can be
determined either by the assembler, or by the linker.
ARM tools also provides the related pseudo-instruction ADR Rn, =label. This uses a PC-relative
ADD or SUB, to place the address of the label into the specified register, using a single instruction.
If the address is too far away to be generated this way, the ADRL pseudo-instruction is used. This
requires two instructions, which gives a better range. This can be used to generate addresses for
position-independent code, but only within the same code section.
6.1.2

Conditional execution
A feature of the ARM instruction set is that nearly all instructions are conditional. On most other
architectures, only branches/jumps can be executed conditionally. This can be useful in avoiding
conditional branches in small if/then/else constructs or for compound comparisons.
As an example of this, consider code to find the smaller of two values, in registers R0 and R1
and place the result in R2. This is shown in Example 6-1. The suffix LT indicates that the
instruction should be executed only if the most recent flag-setting instruction returned “less
than”; GE means “greater than or equal”.
Example 6-1 Example code showing branches (GNU)

@ Code using branches
CMP
R0, R1
BLT
.Lsmaller
MOV
R2, R1
B
.Lend
.Lsmaller:
MOV
R2, R0
.Lend:

@ if R0>2 is done
as MOV R0, R1, LSR #2. Equally, it is common to combine shifts with ADD, SUB or other instructions.
For example, to multiply R0 by 5, we might write:
ADD R0, R0, R0, LSL #2

A left shift of n places is effectively a multiply by 2 to the power of n, so this effectively makes
R0 = R0 + (4 * R0). A right shift provides the corresponding divide operation, although ASR
rounds negative values differently than would division in C.
Apart from multiply and divide, another common use for shifted operands is array index
look-up. Consider the case where R1 points to the base element of an array of int (32-bit)
values and R2 is the index which points to the nth element in that array. We can obtain the array
value with a single load instruction which uses the calculation R1 + (R2 * 4) to get the
appropriate address. Example 6-3 on page 6-8 provides examples of differing operand 2 types
used in ARM instructions.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-7

ARM/Thumb Unified Assembly Language Instructions

Example 6-3 Examples of different ARM instructions showing a variety of operand 2
types:

add
add
add
add

ARM DEN0013B
ID082411

R0,
R0,
R0,
R0,

R1,
R1,
R1,
R1,

#1
R2
R2, LSL #4
R2, LSL R3

R0
R0
R0
R0

=
=
=
=

R2
R1
R1
R1

+
+
+
+

1
R2
R2<<#4
R2<, c0, c0, 0

The result, placed in register Rt, tells software which processor it is running on. For an ARM
Cortex processor the interpretation of the results is as follows:

6.8.3

•

Bits [31:24] – implementer, will be 0x41 for an ARM designed processor.

•

Bits [23:20] – variant, shows the revision number of the processor.

•

Bits [19:16] – architecture, will be 0xF for ARM architecture v7.

•

Bits [15:4] – part number (for example 0xC08 for the Cortex-A8 processor).

•

Bits [3:0] – revision, shows the patch revision number of the processor.

SVC
The SVC (supervisor call) instruction, when executed, causes a supervisor call exception. This is
described further in Chapter 10 Exception Handling. The instruction includes a 24-bit (ARM)
or 8-bit (Thumb) number value, which can be examined by the SVC handler code. Through the
SVC mechanism, an operating system can specify a set of privileged operations which
applications running in User mode can request. This instruction was originally called SWI
(Software Interrupt).

6.8.4

PSR modification
Several instructions allow the PSR to be written to, or read from:
•

MRS transfers the CPSR or SPSR value to a general purpose register. MSR transfers a general
purpose register to the CPSR or SPSR. Either the whole status register, or just part of it
can be updated. In User mode, all bits can be read, but only the condition flags (_f) are
permitted to be modified.

•

The Change Processor State (CPS) instruction can be used to directly modify the mode and
interrupt enable/disable (I/F) bits in the CPSR in a privileged mode. See Figure 4-3 on
page 4-6.

•

SETEND modifies a single CPSR bit, the E (Endian) bit. This can be used in systems with

mixed endian data to temporarily switch between little- and big-endian memory access.
6.8.5

Bit manipulation
There are instructions which allow bit manipulation of values in registers:

ARM DEN0013B
ID082411

•

the Bit Field Insert (BFI) instruction allows a series of adjacent bits from one register
(specified by supplying a width value and LSB position) to be placed into another.

•

the Bit Field Clear (BFC) instruction allows adjacent bits within a register to be cleared.

•

the SBFX and UBFX instructions (Signed and Unsigned Bit Field Extract) copy adjacent bits
from one register to the least significant bits of a second register, and sign extend or zero
extend the value to 32 bits.

•

the RBIT instruction reverses the order of all bits within a register.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-23

ARM/Thumb Unified Assembly Language Instructions

6.8.6

Cache preload
Cache preloading is described further in Chapter 17 Optimizing Code to Run on ARM
Processors. Two instructions are provided, PLD (data cache preload) and PLI (instruction cache
preload). Both instructions act as hints to the memory system that an access to the specified
address is likely to occur soon. Implementations that do not support these operations will treat
a preload as a NOP, but all of the Cortex-A family processors described in this book are able to
preload the cache.

6.8.7

Byte reversal
Instructions to reverse byte order can be useful for dealing with quantities of the opposite
endianness or other data re-ordering operations.
•

the REV instruction reverses the bytes in a word

•

the REV16 reverses the bytes in each halfword of a register

•

the REVSH reverses the bottom two bytes, and sign extends the result to 32 bits.

Figure 6-7 illustrates the operation of the REV instruction, showing how four bytes within a
register have their ordering within a word reversed.

Bit[31:24] Bit[23:16]

0

7 8

Bit15:8

15 16

Bit[7:0]

23 24

31

Figure 6-7 Operation of the REV instruction

6.8.8

Other instructions
A few other instructions are available:

ARM DEN0013B
ID082411

•

The breakpoint instruction (BKPT) will either cause a prefetch abort or cause the processor
to enter debug state (depending on the whether the processor is configured for monitor or
halt mode debug). This instruction is used by debuggers and does not form part of actual
application code.

•

Wait For Interrupt (WFI) puts the processor into standby mode, which is described further
in Chapter 21 Power Management. The processor stops execution until “woken” by an
interrupt or debug event. Note that if WFI is executed with interrupts disabled, an interrupt
will still wake the processor, but no interrupt exception is taken. The processor proceeds
to the instruction after the WFI. In older ARM processors, WFI was implemented as a CP15
operation. WFE (Wait for Event) will also be described in Chapter 21.
Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-24

ARM/Thumb Unified Assembly Language Instructions

•

ARM DEN0013B
ID082411

A NOP instruction (no-operation) does nothing. It may or may not take time to execute, so
the NOP instruction should not be used to insert timing delays into code, instead it is
intended to be used as padding, although it is usually better to use the assembler align
directive.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

6-25

Chapter 7
Caches

The word cache derives from the French verb cacher, “to hide”. A cache is a hidden storage place.
The application of this word to a processor is obvious – a cache is a place where the processor can
store instructions and data, hidden from the programmer and system. In many cases, it would be
true to say that the cache is transparent to, or hidden from the programmer. But very often, as we
shall see, it is important to understand the operation of the cache in detail.
When the ARM architecture was first developed, the clock speed of the processor and the access
speeds of memory were broadly similar. Today’s processors are much more complicated and can
be clocked orders of magnitude faster. However, the frequency of the external bus and of memory
devices has not scaled to the same extent. It is possible to implement small blocks of on-chip
SRAM which can operate at the same speeds of the processor, but such RAM is very expensive in
comparison to standard DRAM blocks, which can have thousands of times more capacity. In many
ARM-based systems, access to external memory will take tens or even hundreds of cycles.
Essentially, a cache is a small, fast block of memory which (conceptually at least) sits between the
processor core and main memory. It holds copies of items in main memory. Accesses to the cache
memory happen significantly faster than those to main memory. As the cache holds only a subset
of the contents of main memory, it must store both the address of the item in main memory and the
associated data. Whenever the processor wants to read or write a particular address, it will first look
for it in the cache. Should it find the address in the cache, it will access the data in the cache, rather
than having to perform an access to main memory. This significantly increases the potential
performance of the system, by reducing the effect of slow external memory access times. An access
to an external off-chip memory can require hundreds of processor cycles. It also reduces the power
consumption of the system, by avoiding the need to drive external devices.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-1

Caches

Cache sizes are small relative to the overall memory used in the system. Larger caches make for
more expensive chips. In addition, making an internal processor cache larger can potentially
limit the maximum speed of the processor. Significant research has gone into identifying how
hardware can determine what it should keep in the cache. Efficient use of this limited resource
is a key part of writing efficient applications to run on a processor.
So, we can use on-chip SRAM to implement caches, which hold temporary copies of
instructions and data from main memory. Code and data have the properties of temporal and
spatial locality. This means that programs tend to re-use the same addresses over time (temporal
locality) and tend to use addresses which are near to each other (spatial locality). Code, for
instance, can contain loops, meaning that the same code gets executed repeatedly or a function
can be called multiple times. Data accesses (for example, to the stack) can be limited to small
regions of memory. When the memory used over short time periods is not close together, it is
often likely that the same data will be re-used. It is this fact that access to RAM by the processor
exhibits such locality, and is not truly random, that enables caches to be successful.
Access ordering rules obey the “weakly ordered” model. A read from a location followed by a
write to the same location by the same processor guarantees that the read returns the value that
was at the address before the write occurred.
We will also look at the write buffer. This is a block which decouples writes being done by the
processor (when executing store instructions, for example) from the external memory bus. The
processor places the address, control and data values associated with the store into a set of
FIFOs. This is the write buffer. Like the cache, it sits between the processor core and main
memory. This enables the processor to move on and execute the next instructions without having
to stop and wait for the slow main memory to actually complete the write operation.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-2

Caches

7.1

Why do caches help?
Caches speed things up, as we have seen, because program execution is not random. Programs
access the same sets of data repeatedly and execute the same sets of instructions repeatedly. By
moving code or data into faster memory when it is first accessed, following accesses to that code
or data become much faster. The initial access which provided the data to the cache is no faster
than normal. It is any subsequent accesses to the cached values which are faster, and it is from
this the performance increase derives. The processor hardware will check all instruction fetches
and data reads or writes in the cache, although obviously we need to mark some parts of memory
(those containing peripheral devices, for example) as non-cacheable. As the cache holds only a
subset of main memory, we need a way to determine (quickly) whether the address we are
looking for is in the cache.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-3

Caches

7.2

Cache drawbacks
It may seem that caches and write buffers are automatically a benefit, as they speed up program
execution. However, they also add some problems which are not present in an uncached
processor. One such drawback is that program execution time can become non-deterministic.
What this means is that, because the cache is small and holds only a subset of main memory, it
fills rapidly as a program executes. When the cache is full, existing code or data is replaced, to
make room for new items. So, at any given time, it is not normally possible to be certain whether
or not a particular instruction or data item is to be found in the cache.
This means that the execution time of a particular piece of code can vary significantly. This can
be something of a problem in hard real-time systems where strongly deterministic behavior is
needed.
Furthermore, as we shall see, we need a way to control how different parts of memory are
accessed by the cache and write buffer. In some cases, we want the processor to read an external
device, such as a peripheral. It would not be sensible to use a cached value of a timer peripheral,
for example. Sometimes we want the processor to stop and wait for a store to complete. So
caches and write buffers give the programmer some extra work to do.
Sometimes, we need to think about the fact that data in the cache and data in external memory
may not be the same. This is referred to as coherency. This can be a particular problem when we
have multiple processors or memory agents like an external DMA. We will consider such
coherency issues in greater detail later in the book.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-4

Caches

7.3

Memory hierarchy
A memory hierarchy in computer science refers to a hierarchy of memory types, with
faster/smaller memories closer to the processor and slower/larger memory further away. In most
systems, we can have secondary storage, such as disk drives and primary storage such as flash,
SRAM and DRAM. In embedded systems, we typically further sub-divide this into on-chip and
off-chip memory. Memory which is on the same chip (or at least in the same package) as the
processor will typically be much faster.
A cache can be included at any level in the hierarchy and should improve system performance
where there is an access time difference between parts of the memory system.
In ARM-based systems, we typically have level 1 (L1) caches, which are connected directly to
the processor logic that fetches instructions and handles load and store instructions. These are
Harvard caches (that is, there are separate caches for instructions and for data) in all but the
lowest performing members of the ARM family and effectively appear as part of the processor.
Over the years, the size of L1 caches has increased, due to SRAM size and speed improvements.
At the time of writing, 16KB or 32KB cache sizes are most common, as these are the largest
RAM sizes capable of providing single cycle access at a processor speed of 1GHz or more.
Many ARM systems have, in addition, a level 2 (L2) cache. This is larger than the L1 cache
(typically 256KB, 512KB or 1MB), but slower and unified (holding both instructions and data).
It can be inside the processor itself, as in the Cortex-A8 and Cortex A15 processors, or be
implemented as an external block, placed between the processor and the rest of the memory
system. The ARM PL310 is an example of such an external L2 cache controller block.
In addition, we have processors which can be implemented in multi-processor clusters in which
each processor has its own cache. Such systems require mechanisms to maintain coherency
between caches, so that when one processor changes a memory location, that change is made
visible to other processors which share that memory. We describe this in more detail when we
look at multi-processing.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-5

Caches

7.4

Cache terminology
A brief summary of terms used in the description may be helpful:
•

A line refers to the smallest loadable unit of a cache, a block of contiguous words from
main memory.

•

The index is the part of a memory address which determines in which line(s) of the cache
the address can be found.

•

A way is a subdivision of a cache, each way being of equal size and indexed in the same
fashion. The line associated with a particular index value from each cache way grouped
together forms a set.

•

The tag is the part of a memory address stored within the cache which identifies the main
memory address associated with a line of data.

Offset
Line
Data RAM

Tag RAM
32-bit address
Tag

Index

Index Offset Byte

Set
Way

Tag

Figure 7-1 Cache terminology

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-6

Caches

7.5

Cache architecture
In a von Neumann architecture, there is a single cache used for instruction and data (a unified
cache). A modified Harvard architecture has separate instruction and data buses and therefore
there are two caches, an instruction cache (I-cache) and a data cache (D-cache). In many ARM
systems, we can have distinct instruction and data level 1 caches backed by a unified level 2
cache.
Let’s consider how a cache memory is constructed. The cache needs to hold an address, some
data and some status information. The address tells the cache where the information came from
in main memory and is known as a tag. The total cache size is a measure of the amount of data
it can hold; the RAMs used to hold tag values are not included in the calculation. The tag does,
however, take up physical space on the silicon. It would be inefficient to hold one word of data
for each tag address, so we typically store a line of data – normally 8 words for the Cortex-A5
and Cortex-A9 processors or 16 words for the Cortex-A8 and Cortex-A15 processors with each
address. This means that the bottom few bits of the address are not required to be stored in the
tag – we need to record the address of a line, not of each byte within the line, so the five or six
least significant bits will always be 0.
Associated with each line of data are one or more status bits. Typically, we will have a valid bit,
which marks the line as containing data that can be used. (This means that the address tag
represents some real value.) We will also have one or more dirty bits which mark whether the
cache line (or part of it) holds data which is not the same as (newer than) the contents of main
memory. We will treat this in more detail later in the chapter.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-7

Caches

7.6

Cache controller
This is a hardware block which has the task of managing the cache memory, in a way which is
(largely) invisible to the program. It automatically writes code or data from main memory into
the cache. It takes read and write memory requests from the processor and performs the
necessary actions to the cache memory and/or the external memory.
When it receives a request from the processor it must check to see whether the requested address
is to be found in the cache. This is known as a cache look-up. It does this by comparing a subset
of the address bits of the request with tag values associated with lines in the cache. If there is a
match (a hit) and the line is marked valid then the read or write will happen using the cache
memory.
If there is no match with the cache tags or the tag is not valid, we have a cache miss and the
request must be passed to the next level of the memory hierarchy – an L2 cache, or external
memory. It can also cause a cache linefill. A cache linefill causes the contents of a piece of main
memory to be copied into the cache. At the same time, the requested data or instructions are
streamed to the processor. This process happens transparently and is not directly visible to the
programmer.
The processor need not wait for the linefill to complete before using the data. The cache
controller will typically access the critical word within the cache line first. For example, if we
perform a load instruction which misses in the cache and triggers a cache linefill, the first read
to external memory will be that of the actual address supplied by the load instruction. This
critical data is supplied to the processor pipeline, while the cache hardware and external bus
interface then read the rest of the cache line, in the background.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-8

Caches

7.7

Direct mapped caches
We now look at various different ways of implementing caches used in ARM processors,
starting with the simplest, a direct mapped cache.

Main memory

Cache

0x0000.0000
0x0000.0010
0x0000.0020
0x0000.0030
0x0000.0040
0x0000.0050

32-bit address

0x0000.0060
Tag

0x0000.0080
0x0000.0090

Index Offset Byte

Lines (Index)

0x0000.0070

=
Data

Hit

Figure 7-2 Direct mapped cache operation

In a direct mapped cache, each location in main memory maps to a single location in the cache.
As main memory is many times larger than the cache, many addresses map to the same cache
location. Figure 7-2 shows a small cache, with four words per line and four lines. This means
that the cache controller will use two bits of the address (bits 3:2) as the offset to select a word
within the line and two bits of the address (bits 5:4) as the index to select one of the four
available lines. The remaining bits of the address (bits 31:6) will be stored as a tag value.
To look up a particular address in the cache, the hardware extracts the index bits from the
address and reads the tag value associated with that line in the cache. If the two are the same and
the valid bit indicates that the line contains valid data, it has a hit. It can then extract the data
value from the relevant word of the cache line, using the offset and byte portion of the address.
If the line contains valid data, but does not generate a hit (that is, the tag shows that the cache
holds a different address in main memory) then the cache line is removed and is replaced by data
from the requested address.
It should be clear that all main memory addresses with the same value of bits [5:4] will map to
the same line in the cache. Only one of those lines can be in the cache at any given time. This
means that we can easily get a problem called thrashing. Consider a loop which repeatedly
accesses address 0x40 and 0x80, as in Figure 7-2. When we first read address 0x40, it will not be
in the cache and so a linefill takes place putting the data from 0x40 to 0x4F into the cache. When
we then read address 0x80, it will not be in the cache and so a linefill takes place putting the data
from 0x80 to 0x8F into the cache – and in the process we lose the data from address 0x40 to 0x4F
from the cache. The same thing will happen on each iteration of the loop and our software will

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-9

Caches

perform poorly. Direct mapped caches are therefore not typically used in the main caches of
ARM processors, but we do see them in some places – for example in the branch target address
cache of the ARM1136 processor.
Processors can have hardware optimizations for situations where the whole cache line is being
written to. This is a condition which can take a significant proportion of total cycle time in some
systems. For example, this can happen when memcpy()- or memset()-like functions which
perform block copies or zero initialization of large blocks are executed. In such cases, there is
no benefit in first reading the data values which will be over-written.
Cache allocate policies act as a “hint” to the processor, they do not guarantee that a piece of
memory will be read into the cache, and as a result, programmers should not rely on that.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-10

Caches

7.8

Set associative caches
The main cache(s) of ARM processors are always implemented using a set associative cache.
This significantly reduces the likelihood of the cache thrashing seen with direct mapped caches,
thus improving program execution speed and giving more deterministic execution. It comes at
the cost of increased hardware complexity and a slight increase in power (because multiple tags
are compared on each cycle).
With this kind of cache organization, we divide the cache into a number of equally-sized pieces,
called ways. The index field of the address continues to be used to select a particular line, but
now it points to an individual line in each way. Commonly, there are 2- or 4-ways, but some
ARM implementations have used higher numbers (for example, the ARM920T processor has a
64-way cache). Level 2 cache implementations (such as the ARM PL310) can have larger
numbers of ways (higher associativity) due to their much larger size. The cache lines with the
same index value are said to belong to a set. To check for a hit, we must look at each of the tags
in the set.

Main memory

Cache way 0

0x0000.0000
0x0000.0010
0x0000.0020
0x0000.0030
0x0000.0040
Cache way 1

0x0000.0050
0x0000.0060
0x0000.0070
0x0000.0080
0x0000.0090

Figure 7-3 A 2-way set-associative cache

In Figure 7-3, a cache with 2-ways is shown. Data from address 0 (or 0x40, or 0x80) may be
found in line 0 of either (but not both) of the two cache ways.
Increasing the associativity of the cache reduces the probability of thrashing. The ideal case is
a fully associative cache, where any main memory location can map anywhere within the cache.
However, building such a cache is impractical for anything other than very small caches (for
example, those associated with MMU TLBs – see Chapter 8). In practice, performance
improvements are minimal for Level1 caches above 4-way associativity, with 8-way or 16-way
associativity being more useful for larger level 2 caches.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-11

Caches

7.9

A real-life example
Before we move on to look at write buffers, let’s consider an example which is more realistic
than those shown in the previous two diagrams. Figure 7-4 is a 4-way set associative 32KB data
cache, with an 8-word cache line length. This kind of cache structure might be found on the
Cortex-A9 or Cortex-A5 processors.
The cache line length is eight words (32 bytes) and we have 4-ways. 32KB divided by 4, divided
by 32 gives us a figure of 256 lines in each way. This means that we need the eight bits to index
a line within a way (bits [12:5]). We need to use bits [4:2] of the address to select from the eight
words within the line. The remaining bits [31:13] will be used as a tag.

Address
Tag
13

12

19

V

Tag

Word

Set

31

5
8

Data line 0

4

2

Byte
1 0

3

7 6 5 4 3 2 1 0 D D

D D

Data line 1
Data line 2
Data line 3

Data line 254
Data line 255

Figure 7-4 A 32KB 4-way set associative cache

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-12

Caches

7.10

Virtual and physical tags and indexes
This section assumes some knowledge of the address translation process. Readers unfamiliar
with virtual addressing may wish to revisit this section after reading Chapter 8.
In A real-life example on page 7-12, we were a little imprecise about specification of exactly
which address is used to perform cache lookups. Early ARM processors (for example, the
ARM720T or ARM926EJ-S processors) used virtual addresses to provide both the index and
tag values. This has the advantage that the processor can do a cache look-up without the need
for a virtual to physical address translation. The drawback is that changing the virtual to physical
mappings in the system means that the cache must first be cleaned and invalidated and this can
have a significant performance impact. We will go into more detail about these terms in
Invalidating and cleaning cache memory on page 7-20.
ARM11 family processors use a different cache tag scheme. Here, the cache index is still
derived from a virtual address, but the tag is taken from the physical address. The advantage of
a physical tagging scheme is that changes in virtual to physical mappings do not now require
the cache to be invalidated. This can have significant benefits for complex multi-tasking
operating systems which can frequently modify page table mappings. Using a virtual index has
some hardware advantages. It means that the cache hardware can read the tag value from the
appropriate line in each way in parallel without actually performing the virtual to physical
address translation, giving a fast cache response. Such a cache is often described as Virtually
Indexed, Physically Tagged (VIPT). The Cortex-A8 processor uses a VIPT implementation in
its instruction cache, but not its data cache.
However, there is a drawback to a VIPT implementation. For a 4-way set associative 32KB or
64KB cache, bits [12] and/or [13] of the address are needed to select the index. If 4KB pages
are used in the MMU, bits [13:12] of the virtual address may not be equal to bits [13:12] of the
physical address. There is therefore scope for potential cache coherence problems if multiple
virtual address mappings point to the same physical address. This is resolved by placing certain
restrictions on such multiple mappings which kernel page table software must obey. This is
described as a “page coloring” issue and exists on other processor architectures for the same
reasons.
This problem is avoided by using a Physically Indexed, Physically Tagged (PIPT) cache
implementation, shown in Figure 7-4 on page 7-12. The Cortex-A series of processors
described in this book use such a scheme for their data caches. It means that page coloring issues
are avoided, but at the cost of hardware complexity.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-13

Caches

7.11

Cache policies
There are a number of different choices which can be made in cache operation. We need to
consider what causes a line from external memory to be placed into the cache (allocation
policy). We need to look at how the controller decides which line within a set associative cache
to use for the incoming data (replacement policy). And we need to control what happens when
the processor performs a write which hits or misses in the cache (write policy).

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-14

Caches

7.12

Allocation policy
When the processor does a cache look-up and the address it wants is not in the cache, it must
determine whether or not to perform a cache linefill and copy that address from memory.

ARM DEN0013B
ID082411

•

A read allocate policy allocates a cache line only on a read. If a write is performed by the
processor which misses in the cache, the cache is not affected and the write goes to main
memory.

•

A write allocate policy allocates a cache line for either a read or write which misses in the
cache (and so might more accurately be called a read-write cache allocate policy). For
both memory reads which miss in the cache and memory writes which miss in the cache,
a cache linefill is performed. This is typically used in combination with a write-back write
policy on current ARM processors, as we shall see in Section 7.14.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-15

Caches

7.13

Replacement policy
When there is a cache miss, the cache controller must select one of the cache lines in the set for
the incoming data. The cache line selected is called the victim. If the victim contains valid, dirty
data, the contents of that line must be written to main memory before new data can be written
to the victim cache line. This is called eviction.
The replacement policy is what controls the victim selection process. The index bits of the
address are used to select the set of cache lines, and the replacement policy selects the specific
cache line from that set which is to be replaced.
•

Round-robin or cyclic replacement means that we have a counter (the victim counter)
which cycles through the available ways and cycles back to 0 when it reaches the
maximum number of ways.

•

Pseudo-random replacement randomly selects the next cache line in a set to replace. The
victim counter is incremented in a pseudo-random fashion and can point to any line in the
set.

Most ARM processors support both policies.
A round-robin replacement policy is generally more predictable, but can suffer from poor
performance in certain rare use cases and for this reason, the pseudo-random policy is often
preferred.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-16

Caches

7.14

Write policy
When the processor executes a store instruction, a cache lookup on the address(es) to be written
is performed. For a cache hit on a write, there are two choices.

ARM DEN0013B
ID082411

•

Write-through. With this policy writes are performed to both the cache and main memory.
This means that the cache and main memory are kept coherent. As there are more writes
to main memory, a write-through policy is slower than a write-back policy if the write
buffer fills and therefore is less commonly used in today’s systems (although it can be
useful for debug).

•

Write-back. In this case, writes are performed only to the cache, and not to main memory.
This means that cache lines and main memory can contain different data. The cache line
holds newer data, and main memory contains older data (said to be stale). To mark these
lines, each line of the cache has an associated dirty bit (or bits). When a write happens
which updates the cache, but not main memory, the dirty bit is set. If the cache later evicts
a cache line whose dirty bit is set (a dirty line), it writes the line out to main memory.
Using a write-back cache policy can significantly reduce traffic to slow external memory
and therefore improve performance and save power. However, if there are other agents in
the system which can access memory at the same time as the processor, we may need to
worry about coherency issues. This is described in more detail later.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-17

Caches

7.15

Write and Fetch buffers
A write buffer is a hardware block inside the processor (but sometimes in other parts of the
system as well), implemented using a number of FIFOs. It accepts address, data and control
values associated with processor writes to memory. When the processor executes a store
instruction, it may place the relevant details (the location to write to, the data to be written, the
transaction size and so forth) into the buffer. The processor does not have to wait for the write
to be completed to main memory. It can proceed with executing the next instructions. The write
buffer itself will drain the writes accepted from the processor, to the memory system.
A write buffer can increase the performance of the system. It does this by freeing the processor
from having to wait for stores to complete. In effect, provided there is space in the write buffer,
the write buffer is a way to hide latency. If the number of writes is low or well spaced, the write
buffer will not become full. If the processor generates writes faster than they can be drained to
memory, the write buffer will eventually fill and there will be little performance benefit.
There is a potential hazard to be considered when thinking about write buffers. What happens
if a write goes into a write buffer and then we read from that address? This is called a
read-after-write hazard. Different processors handle this problem in different ways. A simple
solution is to stall the processor on a read to the main memory system until all pending writes
have completed. A more sophisticated approach is to snoop into the write buffer and detect the
existence of a potential hazard. Hardware can then resolve the hazard either by stalling the read
until the relevant write has completed, or it can read the value directly from the write buffer.
Some write buffers support write merging (also called write combining). They can take multiple
writes (for example, a stream of writes to adjacent bytes) and merge them into one single burst.
This can reduce the write traffic to external memory and therefore boost performance.
It will be obvious to the experienced programmer that sometimes the behavior of the write
buffer is not what we want when accessing a peripheral, we might want the processor to stop
and wait for the write to complete before proceeding to the next step. Sometimes we really want
a stream of bytes to be written and we don’t want the stores to be combined. In ARM memory
ordering model on page 9-4, we’ll look at memory types supported by the ARM architecture and
how to use these to control how the caches and write buffers are used for particular devices or
parts of the memory map.
Similar components, called fetch buffers, can be used for reads in some systems. In particular,
processors typically contain prefetch buffers which read instructions from memory ahead of
them actually being inserted into the pipeline. In general, such buffers are transparent to the
programmer. We will consider some possible hazards associated with this when we look at
memory ordering rules.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-18

Caches

7.16

Cache performance and hit rate
The hit rate is defined as the number of cache hits divided by the number of memory requests
made to the cache during a specified time (normally calculated as a percentage). Similarly, the
miss rate is the number of total cache misses divided by the total number of memory requests
made to the cache. One may also calculate the number of hits or misses on reads and/or writes
only.
Clearly, a higher hit rate will generally result in higher performance. It is not really possible to
quote example figures for typical software, the hit rate is very dependent upon the size of the
critical parts of the code or data operated on and of course, the size of the cache.
There are some simple rules which can be followed to give better performance. The most
obvious of these is to enable caches and write buffers and to use them wherever possible
(typically for all parts of the memory system which contain code and more generally for RAM
and ROM, but not peripherals). Performance will be considerably increased in Cortex-A series
processors if instruction memory is cached.
Placing frequently accessed data together in memory can be helpful. Fetching a data value in
memory involves fetching a whole cache line; if none of the other words in the cache line will
be used, there will be little or no performance gain. Smaller code may cache better than larger
code and this can sometimes give (seemingly) paradoxical results. For example, a piece of C
code may fit entirely within cache when compiled for Thumb (or for the smallest size) but not
when compiled for ARM (or for maximum performance) and as a consequence can actually run
faster than the more “optimized” version. We describe cache considerations in much more detail
in Chapter 17 Optimizing Code to Run on ARM Processors.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-19

Caches

7.17

Invalidating and cleaning cache memory
Cleaning and/or invalidation can be required when the contents of external memory have been
changed and the programmer wishes to remove stale data from the cache. It can also be required
after MMU related activity such as changing access permissions, cache policies, or virtual to
physical address mappings.
The word flush is often used in descriptions of clean/invalidate operations. We avoid the term
in this text, as it is not used in a consistent fashion on different microprocessor architectures.
ARM generally uses only the terms clean and invalidate.
•

Invalidation of a cache (or cache line) means to clear it of data. This is done by clearing
the valid bit of one or more cache lines. The cache always needs to be invalidated after
reset. This can be done automatically by hardware, but may need to be done explicitly by
the programmer (for example, the Cortex-A9 processor does not do this automatically). If
the cache might contain dirty data, it is generally incorrect to invalidate it. Any updated
data in the cache from writes to write-back cacheable regions would be lost by simple
invalidation.

•

Cleaning a cache (or cache line) means to write the contents of dirty cache lines out to
main memory and clear the dirty bit(s) in the cache line. This makes the contents of the
cache line and main memory coherent with each other. Clearly, this is only applicable for
data caches in which a write-back policy is used.

Copying code from one location to another (or other forms of “self-modifying” code) may
require the programmer to clean and/or invalidate the cache. The memory copy code will use
load and store instructions and these will operate on the data side of the processor. If the data
cache is using a write-back policy for the area to which code is written, it will be necessary to
clean that data from the cache before the code can be executed. This ensures that the instructions
stored as data go out into main memory and are then available for the instruction fetch logic. In
addition, if the area to which code is written was previously used for some other program, the
instruction cache could contain stale code (from before main memory was re-written).
Therefore, it may also be necessary to invalidate the instruction cache before branching to the
newly copied code.
The commands to clean and/or invalidate the cache are CP15 operations. They are available
only to privileged code and cannot be executed in User mode. In systems where the TrustZone
Security Extensions are in use, there can be hardware limitations applied to non-secure usage of
some of these operations.
CP15 instructions exist which will clean, invalidate, or clean and invalidate level 1 data or
instruction caches. Invalidation without cleaning is safe only when it is known that the cache
cannot contain dirty data – for example a Harvard instruction cache. The programmer can
perform the operation on the entire cache, or just on individual lines. These individual lines can
be specified either by giving a virtual address to be clean and/or invalidated, or by specifying a
line number in a particular set, in cases where the hardware structure is known. The same
operations can be performed on the L2 or outer caches and we will look at this in Level 2 cache
controller on page 7-22.
Of course, these operations will be accessed through kernel code – in Linux, you will use the
__clear_cache() function, which you can find in arch/arm/mm/cache-v7.S. Equivalent functions
exist in other operating systems – Google Android has cacheflush(), for example.

A common situation where cleaning or invalidation can be required is DMA (Direct Memory
Access). When it is required to make changes made by the processor visible to external memory,
so that it can be read by a DMA controller, it might be necessary to clean the cache. When
external memory is written by a DMA controller and it is necessary to make those changes
visible to the processor, the affected addresses will need to be invalidated from the cache.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-20

Caches

7.18

Cache lockdown
One of the problems with caches is their unpredictability. It is not generally possible to be
certain that a particular piece of code or data is within the cache. This can be problematic
because sometimes (particularly in systems with hard real-time requirements), variable and
unpredictable execution times cannot be tolerated.
Cache lockdown enables the programmer to place critical code/or and data into cache memory
and protect it from eviction, thus avoiding any cache miss penalty. A typical usage scenario
might be for code and data of a critical interrupt handler. As cache memory used for lockdown
is unavailable for other parts of main memory, use of lockdown effectively reduces the usable
cache size. Note that locked down code or data can still be invalidated and the locked down area
can still be immune from replacement. This can make use of lockdown impractical in systems
which frequently invalidate the cache.
The smallest lockable unit for the cache is typically quite large; normally one or more cache
ways. In a 4-way set associative cache, for example, lockdown of a single way therefore uses a
quarter of the available cache space.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-21

Caches

7.19

Level 2 cache controller
At the start of this chapter, we briefly described the partitioning of the memory system and
explained how many systems have a multi-layer cache hierarchy. The Cortex-A5 and Cortex-A9
processors do not have an integrated level 2 cache. Instead, the system designer can opt to
connect the ARM L2 cache controller (PL310) outside of the processor or MPCore instance.
This cache controller can support a cache of up 8MB in size, with a set associativity of between
four and sixteen ways. The size and associativity are fixed by the SoC designer. The level 2
cache can be shared between multiple processors (or indeed between the processor and other
agents, for example a graphics processor). It is possible to lockdown cache data on a per-master
per-way basis, enabling management of cache sharing between multiple components.

7.19.1

Level 2 cache maintenance
We saw in Virtual and physical tags and indexes on page 7-13 how the programmer may need
the ability to clean and/or invalidate some or all of a cache. This can be done by writing to
memory-mapped registers within the L2 cache controller in the case where the cache is external
to the processor (as with the Cortex-A5 and Cortex-A9 processors), or through CP15 (where the
level 2 cache is implemented inside the processor (as with the Cortex-A8 processor). Where
such operations are performed by having the processor perform memory-mapped writes, the
processor needs a way of determining when the operation is complete. It does this by polling a
further memory-mapped register within the L2 cache.
The PL310 Level 2 cache controller operates only on physical addresses. Therefore, to perform
cache maintenance operations, it may be necessary for the program to perform a virtual to
physical address translation. The PL310 provides a “cache sync” operation which forces the
system to wait for pending operations to complete.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-22

Caches

7.20

Point of coherency and unification
For set/way based clean or invalidate, the point to which the operation can be defined is
essentially the next level of cache. For operations which use a virtual address, the architecture
defines two conceptual points.
•

Point of Coherency (PoC). For a particular address, the PoC is the point at which all blocks
(for example, processors, DSPs, or DMA engines) which can access memory are
guaranteed to see the same copy of a memory location. Typically, this will be the main
external system memory.

•

Point of Unification (PoU). The PoU for a processor is the point at which the instruction
and data caches and the page table walks of the processor are guaranteed to see the same
copy of a memory location. For example, a unified level 2 cache would be the point of
unification in a system with Harvard level 1 caches and a TLB to cache page table entries.
Readers unfamiliar with the terms page table walk or TLB will find these described in
Chapter 8. If no external cache is present, main memory would be the PoU.

In the case of an MPCore (an example of a Inner Shareable shareability domain, using the
terminology from Chapter 9 Memory Ordering), the PoU is where instruction and data caches
and page table walks of all the processors within the MPCore cluster are guaranteed to see the
same copy of a memory location.
Knowledge of the PoU enables self-modifying code to ensure future instruction fetches are
correctly made from the modified version of the code. They can do this by using a two-stage
process:
•
clean the relevant data cache entries (by address)
•
invalidate instruction cache entries (by address).
In addition, the use of memory barriers will be required here, the ARM Architecture Reference
Manual provides detailed examples of the necessary code sequences.
Similarly, we can use the clean data cache entry and invalidate TLB operations to ensure that all
writes to the page tables are visible to the MMU.
7.20.1

Exclusive cache mode
The Cortex-A series processors can be connected to a level 2 cache controller which supports
an exclusive cache mode. This makes the processor data cache and the L2 cache exclusive. At
any specific time, an address can be cached in either a L1 data cache or in the L2 cache, but not
both. This increases the usable space and efficiency of the L2 cache connected to the processor.
In practice, exclusive mode is not widely used. It can be difficult to correctly invalidate lines in
multi-processor systems, for example.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-23

Caches

7.21

Parity and ECC in caches
So-called soft errors are increasingly a concern in today’s systems. Smaller transistor
geometries and lower voltages give circuits an increased sensitivity to perturbation by cosmic
rays and other background radiation, alpha particles from silicon packages or from electrical
noise. This is particularly true for memory devices which rely on storing small amounts of
charge and which also occupy large proportions of total silicon area. In some systems,
mean-time-between-failure could be measured in seconds if appropriate protection against soft
errors was not employed.
The ARM architecture provides support for parity and Error Correcting Code (ECC) in the
caches (and tightly coupled memories). Parity means that we have an additional bit which marks
whether the number of bits with the value one is even (or odd, depending upon the scheme
chosen). This provides a simple check against single bit errors. An ECC scheme enables
detection of multiple bit failures and possible recovery from soft errors, but recovery
calculations can take several cycles. Implementing a processor which is tolerant of level 1 cache
RAM accesses taking multiple clock cycles significantly complicates the processor design.
ECC is therefore more commonly used only on blocks of memory (for example, the Level 2
cache), outside the processor.
Parity is checked on reads and writes, and can be implemented on both tag and data RAMs.
Parity mismatch generates a prefetch or data abort exception, and the fault status/address
registers are updated appropriately.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-24

Caches

7.22

Tightly coupled memory
An alternative to using cache lockdown is the so-called Tightly Coupled Memory (TCM). TCM
is a block of fast memory located next to the processor core. It appears as part of the memory
system within the address map and gives similar response times to a cache. It can be used to hold
instructions or data required for real-time code which needs deterministic behavior. It can be
regarded as fast scratchpad memory local to the processor. The contents of TCM are not
initialized by the processor. It is the responsibility of the programmer to copy required code and
data into the tightly coupled memory before enabling use of the TCM. The programmer may
also have to tell the processor which physical address to locate the TCMs within the memory
map.
As TCM is more suited to systems running simpler operating systems and where determinism
is a key requirement, it is not supported in Cortex-A series processors and so is not described
further here.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

7-25

Chapter 8
Memory Management Unit

This chapter describes the main concepts of virtual memory systems and memory management. An
important function of a Memory Management Unit (MMU) is to allow us to manage tasks as
independent programs running in their own private virtual memory space. A key feature of such a
virtual memory system is address relocation, which is the translation of the (virtual) address issued
by the processor core to a different (physical) address in main memory. The translation is done by
the MMU hardware and is transparent to the application (ignoring for the moment any performance
issues).
In multi-tasking embedded systems, we typically need a way to partition the memory map and
assign permissions and memory attributes to these regions of memory. In more advanced systems,
running more complex operating systems, like Linux, we need even greater control over the
memory system. More advanced operating systems typically need to use a hardware-based MMU.
The MMU enables tasks or applications to be written in a way which requires them to have no
knowledge of the physical memory map of the system, or about other programs which may be
running at the same time. This makes programming of applications much simpler, as it enables us
to use the same virtual memory address space for each. This virtual address space is separate from
the actual physical map of memory in the system. The MMU translates addresses of code and data
from the virtual view of memory to the physical addresses in the real system. Applications are
written, compiled and linked to run in the virtual memory space. Virtual addresses are those used
by the programmer, compiler and linker when placing code in memory. Physical addresses are
those used by the actual hardware system. It is the responsibility of the operating system to program
the MMU to translate between these two views of memory. Figure 8-1 on page 8-2 shows an
example system, illustrating the virtual and physical views of memory.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-1

Memory Management Unit

Virtual Memory Map

Physical Memory Map

Vectors page

Peripherals
Kernel

Heap
ROM
Dynamic libraries

Data
ZI data
RAM
Code

Figure 8-1 Virtual and physical memory

The ARM MMU carries out this translation of virtual addresses into physical addresses. In
addition, it controls memory access permissions and memory ordering, cache policies etc. for
each region of memory. When the MMU is disabled, all virtual addresses map one-to-one to the
same physical address (a “flat mapping”). If the MMU cannot translate an address, it generates
an abort exception on the processor and provides information to the processor about what the
problem was.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-2

Memory Management Unit

8.1

Virtual memory
The MMU enables multiple programs to be run simultaneously at the same virtual address while
being actually stored at different physical addresses. In fact, the MMU allows us to build
systems with multiple virtual address maps. Each task can have its own virtual memory map.
The OS kernel places code and data for each application in physical memory, but the application
itself does not need to know where that is.
The key feature of the MMU hardware is address translation. It does this using page tables.
These contain a series of entries, each of which describes the physical address translation for
part of the memory map. Page table entries are organized by virtual address. In addition to
describing the translation of that virtual page to a physical page, they also provide access
permissions and memory attributes for that page.
Note
In the ARM architecture, the concept referred to in generic computer terminology as page tables
has a more specific meaning. The ARM architecture uses multilevel page tables, and defines
translation tables as a generic term for all of these. It then reserves the use of the term page
tables to describe second level translation tables when using the short-descriptor format (not
making use of LPAE).
As this book is an introduction to the ARM architecture for people who may be unfamiliar with
it, this document will use the generic terminology wherever possible.
Addresses generated by the processor core are virtual addresses. The MMU essentially replaces
the most significant bits of this virtual address with some other value, to generate the physical
address (effectively defining a base address of a piece of memory). The lower bits are the same
in both addresses (effectively defining an offset in physical memory from that base address).
The portion of memory which is represented by these lower bits is known as a page. The ARM
MMU supports a multi-level page table architecture with two levels of page table: level 1 (L1)
and level 2 (L2). We will describe the meaning of level 1 and level 2 in a moment. A single set
of page tables is used to give the translations and memory attributes which apply to instruction
fetches and to data reads or writes. The process in which the MMU accesses page tables to
translate addresses is known as page table walking.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-3

Memory Management Unit

8.2

Level 1 page tables
Let’s take a look at the process by which a virtual address is translated to a physical address
using level 1 page table entries on an ARM processor. The first step is to locate the page table
entry associated with the virtual address.
There is usually a single level 1 page table (sometimes called a master page table). It can contain
two basic types of page table entry. The L1 page table divides the full 4GB address space into
4096 equally sized 1MB sections. The L1 page table therefore contains 4096 entries, each entry
being word sized. Each entry can either hold a pointer to the base address of a level 2 page table
or a page table entries for translating a 1MB section. If the page table entry is translating a 1MB
section, it gives the base address of the 1MB page in physical memory.
To locate the relevant entry in the page table, we take the top 12 bits of the virtual address and
use those to index to one of the 4096 words within the page table. This is illustrated in
Figure 8-2.
The base address of the L1 page table is known as the Translation Table Base Address and is
held within a register in CP15 c2. It must be aligned to a 16KB boundary.

Translation Table Base Address

31

14 13

31

Virtual Address

0

20 19

31

14 13

0

2 10

First Level Descriptor Address

Figure 8-2 Finding the address of the level 1 page table entry

To take a simple example, we place our L1 page table at address 0x12300000. The processor core
issues virtual address 0x00100000. The top 12 bits [31:20] define which 1MB of virtual address
space is being accessed. In this case 0x001, so we need to read table entry [1]. Each entry is one
word (4 bytes). To get the offset into the table we must multiply the entry number by entry size:
0x001 * 4 = Address offset of 0x004

The address of the entry is 0x12300000 + 0x004 = 0x12300004.
So, upon receiving this virtual address from the processor, the MMU will read the word from
address 0x12300004. That word is an L1 page table entry. Figure 8-3 on page 8-5 shows the
format of a L1 page table entry.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-4

Memory Management Unit

An L1 page table entry can be one of four possible types:
•

A fault entry that generates an abort exception. This can be either a prefetch or data abort,
depending on the type of memory access. This effectively indicates virtual addresses
which are unmapped.

•

A 1MB section translation entry.

•

An entry that points to an L2 page table. This enables a 1MB piece of memory to be
further sub-divided into smaller pages.

•

A 16MB supersection. This is a special kind of 1MB section entry, which requires 16
entries in the page table.

The least significant two bits [1:0] in the entry define which one of these the entry contains (with
bit [18] being used to differentiate between a section and supersection).

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Fault

Ignored

Page table
Section
Supersection

0 0

Level 2 Descriptor Base Address
S
B
Z

Section Base Address
Supersection
Base Address

SBZ

P

Domain

SBZ

0

1

0

n
G

S

A
P
X

TEX

AP

P

Domain

X
N

C B 1 0

1

n
G

S

A
P
X

TEX

AP

P

Domain

X
N

C B

1 0
1 1

Reserved

Figure 8-3 Level 1 page table entry format

It can be seen that the page table entry for a section (or supersection) contains the physical base
address used to translate the virtual address. You can also observe that many other bits are given
in the page table entry, including the Access Permissions (AP) and Cacheable (C) or Bufferable
(B) types, which we will examine in Memory attributes on page 8-12. This is all of the
information required to access the corresponding physical address and in these cases, the MMU
does not need to look beyond the L1 table.
Figure 8-4 on page 8-6 summarizes the translation process for an address translated by a section
entry in the L1 page table.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-5

Memory Management Unit

Translation Table Base Address

14 13

31

0
Virtual Address

20 19

31

31
Level 1
Table

14

13

0

2 10

First Level Descriptor Address

31

0
20 19 18 17

10
2 10

Section Base Address Descriptor

31

20 19

0

Physical Address

Figure 8-4 Generating a physical address from a level 1 page table entry

In a page table entry for a 1MB section of memory, the upper 12 bits of the page table entry
replace the upper 12 bits of the virtual address when generating the physical address, as
Figure 8-3 on page 8-5 shows.
A supersection is a 16MB piece of memory, which must have both its virtual and physical base
address aligned to a 16MB boundary. As L1 page table entries each describe 1MB, we need 16
consecutive, identical entries within the table to mark a supersection. In Choice of page sizes on
page 8-11, we describe why supersections can be useful.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-6

Memory Management Unit

8.3

Level 2 page tables
An L2 page table has 256 word-sized entries, requires 1KB of memory space and must be
aligned to a 1KB boundary. Each entry translates a 4KB block of virtual memory to a 4KB block
in physical memory. A page table entry can give the base address of either a 4KB or 64KB page.
There are three types of entry used in L2 page tables, identified by the value in the two least
significant bits of the entry:
•
A large page entry points to a 64KB page.
•
A small page entry points a 4KB page.
•
A fault page entry generates an abort exception if accessed.
Figure 8-5 shows the format of L2 page table entries.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9

8

7

6

5

4

3

2

Ignored

Fault
Large page

Large Page Base Address

Small page

Small Page Base Address

X
N

TEX

1

0

0

0

n
G

S

A
P
X

SBZ

AP

C

B

0

1

n
G

S

A
P
X

TEX

AP

C

B

1

X
N

Figure 8-5 Format of a level 2 page table entry

As with the L1 page table entry, a physical address is given, along with other information about
the page. Type extension (TEX), Shareable (S), and Access Permission (AP, APX) bits are used
to specify the attributes necessary for the ARMv7 memory model. The C and B bits control
(along with TEX) the cache policies for the memory governed by the page table entry. The nG
bit defines the page as being global (applies to all processes) or non-global (used by a specific
process). We will describe all of these bits in more detail in Memory attributes on page 8-12.

Coarse Page Base Address

31

10 9

31

Virtual Address

0

31

10 9

20 19

12 11

0

2 10

Second Level Descriptor Address

Figure 8-6 Generating the address of the level 2 page table entry

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-7

Memory Management Unit

In Figure 8-6 on page 8-7 we see how the address of the L2 page table entry that we need is
calculated by taking the (1KB aligned) base address of the level 2 page table (given by the level
1 page table entry) and using 8 bits of the virtual address (bits [19:12]) to index within the 256
entries in the L2 page table.
Figure 8-7 summarizes the address translation process when using two layers of page tables.
Bits [31:20] of the virtual address are used to index into the 4096-entry L1 page table, whose
base address is given by the CP15 TTB register. The L1 page table entry points to an L2 page
table, which contains 256 entries. Bits [19:12] of the virtual address are used to select one of
those entries which then gives the base address of the page. The final physical address is
generated by combining that base address with the remaining bits of the physical address.

Translation Table Base Address

31

14 13

0
Virtual Address

31

31
Level 1
Table

14

13

20 19

12 11

0

2 10

Level 1 Descriptor Address

TTB

01
31

10

9

2 10

Level 2 Table Base Address

31
Level 2
Table

10

9

2 10

Level 2 Descriptor Address

L2TB

10
31

12 11

2 10

Small Page Base Address

31

12

11

0

Physical Address

Figure 8-7 Summary of generation of physical address using the L2 page table entry

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-8

Memory Management Unit

8.4

The Translation Lookaside Buffer
We have seen that a single memory request from the processor core can result in a total of three
memory requests to external memory – one for the level one page table walk, a second access
for the L2 page table walk and then finally the original request from the processor. This seems
like it would be ruinous for performance of the system. Fortunately, page table walks are
relatively uncommon events in the majority of systems, due to another part of the MMU.
The Translation Lookaside Buffer (TLB) is a cache of page translations within the MMU. On a
memory access, the MMU first checks whether the translation is cached in the TLB. If the
requested translation is available, we have a TLB hit, and the TLB provides the translation of
the physical address immediately. If the TLB does not have a valid translation for that address,
we have a TLB miss and an external page table walk is required. This newly loaded translation
can then be cached in the TLB for possible reuse.
The exact structure of the TLB differs between implementations of the ARM processors. What
follows is a description of a typical system, but individual implementations may vary from this.
There are one or more micro-TLBs, which are situated close to the instruction and data caches.
Addresses with entries which hit in the micro-TLB require no additional memory look-up and
no cycle penalty. However, the micro-TLB has only a small number of mappings (typically eight
on the instruction side and eight on the data side). This is backed by a larger main TLB (typically
64 entries), but there may be some penalty associated with accesses which miss in the
micro-TLB but which hit in the main TLB. Figure 8-8 shows how each TLB entry contains
physical and virtual addresses, but also attributes (such as memory type, cache policies and
access permissions) and potentially an ASID value (described in Address Space ID on
page 8-15).
The TLB is like other caches and so has a TLB line replacement policy and victim pointer, but
this is effectively transparent to the programmer (although see the next section on maintaining
TLB coherency). If the page table entry is a valid one, the virtual address, physical address and
other attributes for the whole page or section are stored as a TLB entry. If the page table entry
is not valid, the TLB will not be updated. The ARM architecture requires that only valid page
table descriptors are cached within the TLB.

VA Tag

ASID

Descriptor

PA

VA
ASID

Figure 8-8 Illustration of TLB structure

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-9

Memory Management Unit

8.5

TLB coherency
When the operating system changes page table entries, it is possible that the TLB could contain
stale translation information. The OS should therefore take steps to invalidate TLB entries.
There are several CP15 operations available, which allow a global invalidate of the TLB or
removal of specific entries. As speculative instruction fetches and data reads may cause page
table walks, it is essential to invalidate the TLB when a valid page table entry is changed. Invalid
page table entries cannot be cached in TLB and so can be changed without invalidation.
The Linux kernel has a number of functions which use these CP15 operations, including
flush_tlb_all() and flush_tlb_range(). Such functions are not typically required by device

driver or application code.
Many processors provide support for locking individual entries into the TLB. This can be useful
in systems which require a deterministic response time for a particular piece of code (for
example an interrupt service routine) and don’t want to worry about cycle counts being affected
by occasional page table walks. This is less commonly encountered in v7-A architecture profile
devices, where deterministic behavior is not typically required.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-10

Memory Management Unit

8.6

Choice of page sizes
This is essentially controlled by the operating system, but it is worth being aware of the
considerations involved when selecting page sizes. Smaller page sizes allow finer control of a
block of memory and potentially can reduce the amount of unused memory in a page. If a task
needs 7KB of data space, there is less unused space if it is allocated two 4KB pages as opposed
to a 64KB page or a 1MB section. Smaller page sizes also allow increased precision of control
over permissions, cache properties and so forth.
However, with increased page sizes, each entry in the TLB holds a reference to a larger piece of
memory. It is therefore more likely that a TLB hit will occur on any access and so there will be
fewer page table walks to slow external memory. For this reason, 16MB supersections can be
used with large pieces of memory which do not require detailed mapping. In addition, each L2
page table requires 1KB of memory. In some systems, memory usage by page tables may be
important.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-11

Memory Management Unit

8.7

Memory attributes
We have seen how page table entries allow the MMU hardware to translate virtual to physical
addresses. However, they also specify a number of attributes associated with each page,
including access permissions, memory type and cache policies.

8.7.1

Memory Access Permissions
The Access Permission (AP and APX) bits in the page table entry give the access permission
for a page. See Table 8-1.
An access which does not have the necessary permission (or which faults) will be aborted. On
a data access, this will result in a precise data abort exception. On an instruction fetch, the access
will be marked as aborted and if the instruction is not subsequently flushed before execution, a
prefetch abort exception will be taken. Faults generated by an external access will not, in
general, be precise.
Information about the address of the faulting location and the reason for the fault is stored in
CP15 (the fault address and fault status registers). The abort handler can then take appropriate
action – for example, modifying page tables to remedy the problem and then returning to the
application to re-try the access. Alternatively, the application which generated the abort may
have a problem and need to be terminated. In Linux use of page tables on page 8-18, we shall
see how Linux uses faults and aborts as a mechanism to manage use of memory by applications.
Table 8-1 Summary of Access Permission encodings
APX

AP

Privileged

Unprivileged

Description

0

00

No access

No access

Permission Fault

0

01

Read/Write

No access

Privileged Access only

0

10

Read/Write

Read

No user-mode write

0

11

Read/Write

Read/Write

Full access

1

00

-

-

Reserved

1

01

Read

No access

Privileged Read only

1

10

Read

Read

Read only

1

11

-

-

Reserved

In addition, there are two bits in the CP15:CTLR that act to over-ride the Access Permissions
from the page table entry. These are the system (S) bit and the ROM (R) bit and their use is
deprecated in ARMv6 and later versions of the ARM architecture. Setting the S bit changes all
pages with “no access” permission to allow read access for privileged modes. Setting the R bit
changes all pages with “no access” permission to allow read access. These bits can be used to
provide access to large blocks of memory without the need to change lots of page table entries.
8.7.2

Memory attributes
ARM architecture versions 4 and 5 enabled the programmer to specify the memory access
behavior of pages by configuring whether the cache and write buffer could be used for that
location. This simple scheme is inadequate for today’s more complex systems and processors,
where we can have multiple levels of caches, hardware managed coherency between multiple

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-12

Memory Management Unit

processors sharing memory and/or processors which can speculatively fetch both instructions
and data. The new memory attributes added to the ARM architecture in ARMv6 and extended
in the ARMv7 architecture are designed to meet these needs.
Table 8-2 shows how the TEX, C and B bits within the page table entry are used to set the
memory attributes of a page and also the cache policies to be used. The meaning of memory
attributes is described in Chapter 9, while the cache policies were described in Chapter 7.
The final entry within the table needs further explanation. For normal cacheable memory, the
two least significant bits of the TEX field are used to provide the outer cache policy (perhaps
for level 2 or level 3 caches) while the C and B bits give the inner cache policy (for level 1 and
any other cache which is to be treated as inner cache). This enables us to specify different cache
policies for both the inner and outer cache. For the Cortex-A15 and Cortex-A8 processors, inner
cache properties set by the page table entry apply to both L1 and L2 caches. On some older
processors, outer cache may support write allocate, while the L1 cache may not. Such
processors should still behave correctly when running code which requests this cache policy, of
course.
Table 8-2 Memory type and cacheable properties encoding in page table entry
TEX

C

B

Description

Memory Type

000

0

0

Strongly-ordered

Strongly-ordered

000

0

1

Sharable device

Device

000

1

0

Outer and Inner writethrough, no allocate on write

Normal

000

1

1

Outer and Inner write-back, no allocate on write

Normal

001

0

0

Outer and Inner non-cacheable

Normal

001

-

-

Reserved

-

010

0

0

Non-shareable device

Device

010

-

-

Reserved

-

011

-

-

Reserved

-

1XX

Y

Y

Cached memory
XX = Outer policy
YY = Inner policy

Normal

It is clear that five bits to define memory types gives 32 possibilities. It is unlikely that a kernel
will wish to use all 32 possibilities. In practice only a handful will actually be needed. Therefore
the hardware provides CP15 registers (the “primary region remap registers”) which allow us to
remap the interpretation of TEX, C and B bits, so that TEX[0], C and B are used to describe
memory types. This means that up to eight different mappings from Table 8-2 can be used and
a particular value of TEX[0], C and B associated with the mapping. In this mode of operation,
the TEX[2:1] bits are freed up and can be used in an OS defined way. In Emulation of dirty and
accessed bits on page 8-18, we will see why an OS might wish to have “spare” bits within a page
table entry that it is able to modify freely. This mode is selected through the TRE bit (TEX
Remap) in the CP15 control register.
8.7.3

Domains
The ARM architecture has an unusual feature which enables regions of memory to be tagged
with a domain ID. There are 16 domain IDs provided by the hardware and CP15 c3 contains the
Domain Access Control Register (DACR) which holds a set of 2-bit permissions for each

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-13

Memory Management Unit

domain number. This enables each domain to be marked as no-access, manager mode or client
mode. No-access causes an abort on any access to a page in this domain, irrespective of page
permissions. Manager mode ignores all page permissions and enables full access. Client mode
uses the permissions of the pages tagged with the domain.
The use of domains is deprecated in the ARMv7 architecture, but in order for access permissions
to be enforced, it is still necessary to assign a domain number to a section and to ensure that the
permission bits for that domain are set to client.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-14

Memory Management Unit

8.8

Multi-tasking and OS usage of page tables
In most systems using Cortex-A series processors, we will have a number of applications or
tasks running concurrently. Each task will have its own unique page tables residing in physical
memory. Typically, much of the memory system is organized so that the virtual-to-physical
address mapping is fixed, with page table entries that never change. This typically is used to
contain operating system code and data, and also the page tables used by individual tasks.
Whenever an application is started, the operating system will allocate it a set of page table
entries which map both the actual code and data used by the application to physical memory. If
the application requires to map in code or extra data space (for example through a malloc() call),
the kernel can subsequently modify these tables. When a task completes and the application is
no longer running, the kernel can remove any associated page table entries and re-use the space
for a new application. In this way, multiple tasks can be resident in physical memory. Upon a
task switch, the kernel switches page table entries to the page in the next thread to be run. In
addition, the dormant tasks are completely protected from the running task. This means that the
MMU does not allow the running task to access the code or data of the kernel or of other user
privilege tasks.
When the page table entries are changed, an access by code to a particular virtual address can
now translate to different location in physical memory. This can give rise to several possible
problems. The ARM architecture has some key features to try to mitigate the performance
impact of these.
Older ARM processors (from the ARM7 and ARM9 family) have cache tags which store virtual
addresses. When page table mappings are changed, the caches can contain invalid data from the
old page table mapping. To ensure memory coherency, the caches would need to be cleaned and
invalidated. This can have a significant performance impact, as often instructions and data from
a location which has just been invalidated would then need to be re-fetched from external
memory. However, all Cortex-A series processors use physically tagged caches. This means that
no coherency problems are created by changing page table entries.
In addition, the TLB may also have cached old page table entries and these will need to be
invalidated. We will describe ways in which this can be avoided in Linux use of page tables on
page 8-18.

8.8.1

Address Space ID
When we described the page table bits in Level 2 page tables on page 8-7 we noted a bit called
nG (non-global). If the nG bit is set for a particular page, it means that the page is associated
with a specific application and is not global. This means that when the MMU performs a
translation, it uses both the virtual address and an ASID value.
The ASID is a number assigned by the OS to each individual task. This value is in the range
0-255 and the value for the current task is written in the ASID register (accessed via CP15 c13).
When a page table walk occurs and the TLB is updated and the entry is marked as non-global,
the ASID value will be stored in the TLB entry in addition to the normal translation information.
Subsequent TLB look-ups will only match on that entry if the current ASID matches with the
ASID that is stored in the entry. This means that we can have multiple valid TLB entries for a
particular page (marked as non-global), but with different ASID values. This significantly
reduces the software overhead of context switches, as it avoids the need to flush the on-chip
TLBs. The ASID forms part of a larger (32-bit) process ID register that can be used in task aware
debugging.
Figure 8-9 on page 8-16 illustrates this. Here, we have multiple applications (A, B and C), each
of which is linked to run from virtual address 0. Each application is located in a separate address
space in physical memory. There is an ASID value associated with each application and this

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-15

Memory Management Unit

means that we can have multiple entries within the TLB at any particular time, which will be
valid for virtual address 0. Only the entry which matches the current ASID will be used for
translation, if the mapping is marked as non-global.

Physical Memory Map
ASID

TLB

Global

Global
Task B

Global
C

0x000

A

0x002

B

0x000

A

0x000

Task C

Task A

Figure 8-9 ASIDs in TLB mapping same virtual address

8.8.2

Page table Base Register 0 and 1
A further potential difficulty associated with managing multiple applications with their
individual page tables is that there may need to be multiple copies of the L1 page table, one for
each application. Each of these will be 16KB in size. Most of the entries will be identical in each
of the tables, as typically only one region of memory will be task-specific, with the kernel space
being unchanged in each case. Furthermore, if there is a need to modify a global page table
entry, the change will be needed in each of the tables.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-16

Memory Management Unit

To help reduce the effect of these problems, a second page table base register can be used. CP15
contains two page table base registers, TTBR0 and TTBR1. A control register (the TTB Control
Register) is used to program a value in the range 0 to 7. This value (denoted by N) tells the MMU
how many of the upper bits of the virtual address it should check to determine which of the two
TTB registers to use.
When N is 0 (the default), all virtual addresses are mapped using TTBR0. With N in the range
1-7, the hardware looks at the most significant bits of the virtual address. If the N most
significant bits are all zero, TTBR0 is used, otherwise TTBR1 is used.
For example, if N is set to 7, any address in the bottom 32MB of memory will use TTBR0 and
the rest of memory will use TTBR1. This means that the application-specific page table pointed
to by TTBR0 will contain only 32 entries (128 bytes). The global mappings are in the table
pointed to by TTBR1 and only one table needs to be maintained.
When these features are used, a context switch will typically require the operating system to
change the TTBR0 and ASID values, using CP15 instructions. However, as these are two
separate, non-atomic operations, some care is needed to avoid problems associated with
speculative accesses occurring using the new value of one register together with the older value
of the other. OS programmers making use of these features should become familiar with the
sequences recommended for this purpose in the ARM Architecture Reference Manual.
8.8.3

The Fast Context Switch Extension
The Fast Context Switch Extension (FCSE) is a deprecated feature which was added to the
ARMv4 architecture. It enabled multiple independent tasks to run in a fixed, overlapping area
at the bottom of the virtual memory space without the need to clean the cache or TLB on a
context switch. It does this by modifying virtual addresses by substituting a process ID value
into the top seven bits of the virtual address (but only if that address lies within the bottom
32MB of memory). Some ARM documentation distinguishes Modified Virtual Addresses
(MVA) from Virtual Addresses (VA). This distinction is useful only when the FCSE is used.
Since it is deprecated, we will not discuss the FCSE any further.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-17

Memory Management Unit

8.9

Linux use of page tables
We will start with a brief look at how Linux uses page tables and then move on to how this maps
to the ARM architecture implementation of page tables. The reader may find it helpful to refer
to the source code of the Linux kernel, in particular the ARM specific code associated with page
tables which can be found in the following file:
arch/arm/include/asm/pgtable.s

Some parts of the following description rely on a basic understanding of the Linux kernel,
exception handling and/or the memory ordering concepts explained in Chapter 9. The first-time
reader may therefore wish to skip over this section.
8.9.1

Levels of page tables in Linux
Linux uses a three level page table structure, but this can be implemented in a way which uses
only two levels of page tables – Page Global Directory (PGD) and Page Table Entry (PTE). The
intermediate third level Page Middle Directory (PMD) is defined as being of size 1 and folds
back into the PGD.
The ARCH include file /asm-arm/page.h contains the definition:
#define __pmd(x) ((pmd_t) { (x) } )

This tells the kernel that PMD has just one entry, effectively bypassing it.
8.9.2

Emulation of dirty and accessed bits
Linux makes use of a bit associated with each page which marks whether the page is dirty (the
page has been modified by the application and may therefore need to be written from memory
to disk when the page is no longer needed).
This is not directly supported in hardware by ARM processors, and must therefore be
implemented by kernel software. When a page is first created, it is marked as read-only. The first
write to such a page (a clean page) will cause an MMU permission fault and the kernel data abort
handler will be called. The Linux memory management code will mark the page as dirty, if the
page should indeed be writable, using the function handle_pte_fault(). The page table entry is
modified to make it allow both reads and writes and the TLB entry invalidated from the cache
to ensure the MMU now uses the newly modified entry. The abort handler then returns to re-try
the faulting access in the application.
A similar technique is used to emulate the accessed bit, which shows when a page has been
accessed. Read and Write accesses to the page generate an abort. The accessed bit is then set
within the handler, the access permissions are changed and we then return. This is all done
transparently to the application which actually accessed the memory. The kernel makes use of
these bits when swapping pages in and out of memory – it is preferred to swap out pages which
have not been used recently and of course we have to ensure that pages which have been written
to have their new contents copied to the backing store.
On older processors, it was necessary for Linux to maintain two hardware PTE tables – one that
was actually used by the processor and another one holding Linux state information. The
extensions to the TEX field mean that there are spare bits within the page table entry, available
for use by the operating system, potentially removing the need for two tables. A detailed
description of this is not useful to most programmers. Essentially, three TEX bits, plus a C bit
and a B bit giving five bits in each PTE which give information about the memory type, which
enables 32 possible types of memory. Most operating systems need only five or six different
types. Therefore we can use a feature called TEX Remap which enables us to specify which
settings we will actually use and encode those within three bits of the PTE, leaving two bits free.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-18

Memory Management Unit

8.9.3

Kernel MMU initialization
Linux creates an initial set of page tables when it boots. The kernel execution starts in
arch/arm/kernel/head.S with the MMU and both caches disabled. After determining the type of
processor it is running on, the function __create_page_tables() is called. This sets up an initial
set of MMU tables used only during this initialization. It finds the physical address of RAM in
the system and determines where to place the initial page table. This is immediately below the
kernel entry point. (Recall that an L1 page table is 16KB, so must start 16KB below.) Mappings
are created using 1MB sections, with no L2 page tables. After these initial page tables are set
up, we then branch to a function which invalidates the caches and TLB. The next step is to
enable the MMU. When the MMU is enabled, we are now working with virtual rather than
physical addresses. The code in __enable_mmu() enables the caches, branch predictors and then
the MMU. It is important that this code is in a location which has identical virtual and physical
addresses. If the virtual address and physical address of the section which holds this code are
not identical, the behavior can be unpredictable.
These page table mappings will allow the position dependent code of kernel startup to run.
These mappings will later be overwritten (albeit with identical entries) by the function
paging_init(). This is called from start_kernel() via setup_arch() and creates the master L1
page table. The mdesc->mapio() function creates these mappings and must be modified when
porting Linux to a new system. The master L1 page table (init_mm) also maps the system RAM.
The virtual address range PAGE_OFFSET to (PAGE_OFFSET + RAMsize) are mapped to the
physical address range of the RAM in the system.

8.9.4

Memory allocation (vmalloc)
Applications running under Linux may need to make system calls for memory allocation.
kmalloc() allocates memory which is contiguous in physical memory, while vmalloc() allocates
memory which is contiguous only in virtual memory space.
When a process calls vmalloc(), a range of virtual addresses are allocated and this can cause a
change to the L1 page table. Processes which were created before this can have L1 tables which
do not contain the newly allocated region. What happens is that the master L1 page table is
updated, but we do not at the same time modify all of the L1 tables for all processes. Instead,
when a process whose page table does not contain the new mapping tries to access the vmalloc
region, an abort occurs and the kernel handler copies the new mapping from the master table
into the table for that process.

8.9.5

Protecting the bottom of memory
Linux enables us to define the lowest virtual address for a user space mapping. This enables the
exception vector table (if it is placed at address 0) to be protected. It also permits a safe test for
null pointers. If we try to load data from address 0, we know something has gone wrong!
Typically, the first 4KB page of memory is protected.
#define FIRST_USER_ADDRESS

8.9.6

PAGE_SIZE

Linux use of page table entries and ASIDs
Linux will typically make use only of 4KB pages. 1MB sections can be used for linear kernel
mapping and I/O regions. A total of six different page types are used.

ARM DEN0013B
ID082411

•

Pages which contain code (applications or shareable libraries) will, by default, be marked
as read-only and can be mapped by multiple applications.

•

Data pages are writable but can have the XN bit set, to trap erroneous execution of data.

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-19

Memory Management Unit

•

User mmap() files (and pages which hold stack or heap) will be marked as “Normal,
cacheable”, as will pages for kernel modules, kernel linear mapping, and vmalloc() space.
Pages which hold heap or stack are typically allocated as a result of a page fault. New heap
pages are mapped to a read-only zero initialized page. They are allocated on the first write
to the page. The page table entries themselves are accessed through the kernel linear
mapping.

•

Device mapped pages are created as a result of a call to the function mmap() for a /dev file,
with the device driver code responsible for the memory mapping. mmap() enables driver
code to associate a range of user-space addresses to Device memory.

•

The page which contains the exception vectors will be read-only for User mode code and
will be normal, cacheable.

•

Static and Dynamic I/O mappings will be marked as Device memory, with no User mode
access

Strongly-ordered memory is not used directly by the kernel, by default, but can be selected for
specific tasks such as power management code.
ASIDs are dynamically allocated and not guaranteed to be constant during the lifetime of a
process. As the ASID register provides only eight bits of ASID space and we can have more
than 256 processes, Linux has a scheme for allocating ASIDs. For a new process, we increment
the last ASID value used. When the last value is reached, we have to take some action. The TLB
is flushed (across all processors in an SMP system). The value in the top 24 bits in the context
ID register, which can be considered as a “generation” number, is incremented. Stepping to a
new generation means that all ASID values from the previous generation are now invalid and
ASID numbering is restarted. On a context switch, processes which use an older generation
version of the context ID value are assigned a new ASID.
The ASID numbering scheme is global across processors. Individual threads within a process
have the same memory map and therefore share an ASID, but can run independently on separate
processors in an MP (multi-processing) system. We will consider multi-processing in much
greater detail in Chapter 22.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-20

Memory Management Unit

8.10

The Cortex-A15 MMU and Large Physical Address Extensions
The Cortex-A15 processor implements the ARMv7-A architecture Large Physical Address
Extension (LPAE), which introduces a number of new features:
•

Address mapping for a 40-bit address space, giving a range of accessible physical
addresses from 4GB to 1024GB (a terabyte), with a granularity of 4kB. A new page table
entry format - the “long-descriptor format” is added. The existing VMSAv7
short-descriptor format page tables are still supported.

•

Support is added for hierarchical permissions.

•

A contiguous page hint bit for page table entries. This is set to show that the entry is one
of 16 contiguous entries which point to a contiguous output address range. If set, the TLB
need cache only one translation for the group of 16 pages.

•

A new access permission setting – “privileged execute never” (PXN). This marks a page
as containing code that can be executed only in a non-privileged (user) mode. This setting
is also introduced in the legacy descriptor format. There is also a privileged execute
setting (PX) which means that code can be executed only in privileged mode.

•

The ASID is stored in the TTBR. This allows atomic ASID changes, which reduce the
overhead of context switches.

•

Simplified fault status encoding.

The Cortex-A15 processor MMU also supports virtualization. This provides a second stage of
address translation when running virtual machines. The first stage of this translation produces
an Intermediate Physical Address (IPA) and the second stage then produces the physical
address. TLB entries may also have an associated virtual machine ID (VMID), in addition to an
ASID. This will be covered in more detail in Chapter 27 Virtualization. It is possible to disable
the stage 2 MMU and have a flat mapping from IPA to PA.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

8-21

Chapter 9
Memory Ordering

Older implementations of the ARM architecture (for example, ARM7TDMI) execute all
instructions in program order and each instruction is completely executed before the next
instruction is started.
Newer micro-processors employ a number of optimizations which relate to the way memory
accesses are performed. As we have seen, the speed of execution of instructions by the processor
is significantly higher than the speed of external memory. Caches and write buffers are used to hide
the latency associated with this difference in speed. One potential effect of this is to re-order
memory accesses. The order in which load and store instructions are executed by the processor will
not necessarily be the same as the order in which the accesses are seen by external devices.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-1

Memory Ordering

Program Order of Instructions

Instruction ExecutionTimeline

STR R12, [R1]

@Access 1.

LDR R0, [SP], #4

@Access 2.

Access 2 causes a cache lookup which misses

LDR R2, [R3,#8]

@Access 3.

Access 3 causes a cache lookup which hits

Access 1 goes into write buffer

Access 3 returns data into ARM register
Cache linefill triggered by Access 2 returns data
Memory store triggered by Access 1 is performed

Time
Figure 9-1 Memory ordering example

In Figure 9-1, we have three instructions listed in program order. The first instruction performs
a write to external memory which in this example, misses in the cache (Access 1). It is followed
in program order by two reads, one which misses in the cache (Access 2) and one which hits in
the cache (Access 3). Both of the read accesses can potentially complete before the write buffer
completes the write associated with Access 1. Hit-under-miss behaviors in the cache mean that
a load which hits in the cache (like Access 3) can complete before a load earlier in the program
which missed in the cache (like Access 2).
In the main, it is still possible to preserve the illusion that the hardware executes instructions in
the order written to the programmer. There are generally only a few cases where the programmer
has to worry about such effects. For example, when modifying CP15 registers or when copying
or otherwise changing code in memory, it may be necessary for the programmer to explicitly
make the processor wait for such operations to complete.
For very high performance processors, such as the Cortex-A9 or Cortex-A15 processors, which
support speculative data accesses, multi-issuing of instructions, cache coherency protocols and
out-of-order execution in order to make further performance gains, there are even greater
possibilities for re-ordering. In general, the effects of this re-ordering are invisible to the
programmer, in a single processor system. The processor hardware takes care of many possible
hazards for us. It will ensure that address dependencies are respected and ensure the correct
value is returned by a read, allowing for potential modifications caused by earlier writes.
However, in cases where we have multiple processors which communicate through Shareable
memory (or share data in other ways), memory ordering considerations become more important.
In general, we are most likely to care about exact memory ordering at points where multiple
execution threads must be synchronized.
Processors which conform to the ARM v7-A architecture employ a weakly-ordered model of
memory. Reads and writes to Normal memory can be re-ordered by hardware, with such
re-ordering being subject only to address dependencies and explicit memory barrier
instructions. In cases where we need stronger ordering rules to be observed, we must
communicate this to the processor through the memory type attribute of the page table entry
which describes that memory. Enforcing ordering rules on the processor limits the possible
hardware optimizations and therefore reduces performance and increases power consumption.
The programmer therefore needs to understand when to apply such ordering constraints.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-2

Memory Ordering

In this chapter, we will consider the memory ordering model of the ARM architecture. We will
look at the different types of memory (and other memory attributes) which are assigned to pages
using page table entries. We will look at the barrier instructions in ARM assembly language (and
accessible through C intrinsics). Finally we will look at the support for coherency between SMP
clusters and the concept of coherency domains, as well as considering some common cases
where memory ordering problems can be encountered.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-3

Memory Ordering

9.1

ARM memory ordering model
Three memory types are defined in the ARM architecture. All regions of memory are configured
as one of these three types:
•
Strongly-ordered
•
Device
•
Normal.
In addition, for Normal and Device memory, it is possible to specify whether the memory is
Shareable (accessed by other agents) or not. For Normal memory, inner and outer cacheable
properties can be specified.

9.1.1

Strongly-ordered and Device memory
Accesses to Strongly-ordered and Device memory have the same memory-ordering model.
Access rules for this memory are as follows:
•

The number and size of accesses will be preserved. Accesses will be atomic, and will not
be interrupted part way through.

•

Both read and write accesses can have side-effects on the system. Accesses are never
cached. Speculative accesses will never be performed.

•

Accesses cannot be unaligned.

•

The order of accesses arriving at Device memory is guaranteed to correspond to the
program order of instructions which access Strongly-ordered or Device memory. This
guarantee applies only to accesses within the same peripheral or block of memory. The
size of such a block is implementation defined, but has a minimum size of 1KB.

•

In the ARMv7 architecture, the processor can re-order Normal memory accesses around
Strongly-ordered or Device memory accesses.

The only difference between Device and Strongly-ordered memory is that:
•

a write to Strongly-ordered memory can complete only when it reaches the peripheral or
memory component accessed by the write

•

a write to Device memory is permitted to complete before it reaches the peripheral or
memory component accessed by the write.

System peripherals will almost always be mapped as Device memory.
Regions of Device memory type can further be described using the Shareable attribute.
On some ARMv6 processors, the Shareable attribute of Device accesses is used to determine
which memory interface will be used for the access, with memory accesses to areas marked as
Device, Non-Shareable performed using a dedicated interface, the private peripheral port. This
mechanism is not used on ARMv7 processors.
Note
These memory ordering rules provide guarantees only about “explicit” memory accesses (those
caused by load and store instructions). The architecture does not provide similar guarantees
about the ordering of instruction fetches or page table walks with respect to such explicit
memory accesses.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-4

Memory Ordering

9.1.2

Normal memory
Normal memory is used to describe most parts of the memory system. All ROM and RAM
devices are considered to be Normal memory. All code to be executed by the processor must be
in Normal memory. Code is not architecturally permitted to be in a region of memory which is
marked as Device or Strongly-ordered.
The properties of Normal memory are as follows:
•

The processor can repeat read and some write accesses.

•

The processor can pre-fetch or speculatively access additional memory locations, with no
side-effects (if permitted by MMU access permission settings). The processor will not
perform speculative writes, however.

•

Unaligned accesses can be performed.

•

Multiple accesses can be merged by processor hardware into a smaller number of accesses
of a larger size. Multiple byte writes could be merged into a single double-word write, for
example.

Regions of Normal memory must also have cacheability attributes described (see Chapter 7 for
details of the supported cache policies). The ARM architecture supports cacheability attributes
for Normal memory for two levels of cache, the inner and outer cache. The mapping between
these levels of cache and the implemented physical levels of cache is implementation defined.
Inner refers to the innermost caches, and always includes the processor level 1 cache. An
implementation might not have any outer cache, or it can apply the outer cacheability attribute
to L2 and/or L3 cache. For example, in a system containing a Cortex-A9 processor and the
PL310 L2 Cache controller, the PL310 is considered to be the outer cache. The Cortex-A8 L2
cache can be configured to use either inner or outer cache policy.
Normal memory must also be specified either as Shareable or Non-Shareable. A region of
Normal memory with the Non-Shareable attribute is one which is used only by this processor.
There is no requirement for the processor to make accesses to this location coherent with other
processors. If other processors do share this memory, any coherency issues must be handled in
software. For example, this can be done by having individual processors perform cache
maintenance and barrier operations.
A region with the Shareable attribute set is one which can be accessed by other processors or
masters in the system. Data accesses to memory in this region by other processors within the
same “shareability domain” are coherent. This means that the programmer does not need to take
care of the effects of data or unified caches. In situations where cache coherency is not
maintained between processors for a region of shared memory, the programmer would have to
explicitly manage coherency themselves.
The ARMv7 architecture enables the programmer to specify Shareable memory as “inner
Shareable” or “outer Shareable” (this latter case means that the location is both inner and outer
Shareable).
The outer Shareable attribute enables the definition of systems containing clusters of processors.
Within a cluster, the data caches of the processors are coherent for all data accesses which have
the inner Shareable attribute. Between clusters, the caches are not coherent for data accesses
which only have the inner Shareable attribute, but are for data accesses which have the outer
Shareable attribute. Each cluster is said to be in a different shareability domain for the inner
Shareable attribute and in the same shareability domain for the outer Shareable attribute. Such
clusters are not supported in the Cortex-A5, Cortex-A8 or Cortex-A9 processors.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-5

Memory Ordering

9.2

Memory barriers
A memory barrier is an instruction which requires the processor to apply an ordering constraint
between memory operations which occur before and after the memory barrier instruction in the
program. Such instructions may also be known as “memory fences” in other architectures (for
example, the x86).
The term memory barrier can also be used to refer to a compiler instruction which prevents the
compiler from scheduling data access instructions across the barrier when performing
optimizations. For example in GCC, we can use the inline assembler memory “clobber”, to
indicate that the instruction changes memory and therefore the optimizer cannot re-order
memory accesses across the barrier. The syntax is as follows:
asm volatile("" ::: "memory");

ARM RVCT includes a similar intrinsic, called __schedule_barrier().
Here, however, we are looking at hardware memory barriers, provided through dedicated ARM
assembly language instructions. As we have seen, processor optimizations such as caches, write
buffers and out-of-order execution can result in memory operations occurring in an order
different from that specified in the executing code. Normally, this re-ordering is invisible to the
programmer and application developers do not normally need to worry about memory barriers.
However, there are cases where we may need to take care of such ordering issues, for example
in device drivers or when we have multiple observers of the data which need to be synchronized.
The ARM architecture specifies memory barrier instructions, which allow the programmer to
force the processor to wait for memory accesses to complete. These instructions are available in
both ARM and Thumb code, in both user and privileged modes. In older versions of the
architecture, these were performed using CP15 operations in ARM code only. Use of these is
now deprecated, although preserved for compatibility.
Let’s start by looking at the practical effect of these instructions in a uni-processor system. Note
that this description is a simplified version of that given in the ARM Architecture Reference
Manual, what is written here is intended to introduce the usage of these instructions. The term
explicit access is used to describe a data access resulting from a load or store instruction in the
program. It does not include instruction fetches.
Data Synchronization Barrier (DSB)
Data Synchronization Barrier (DSB) – this instruction forces the processor to wait
for all pending explicit data accesses to complete before any further instructions
can execute. There is no effect on pre-fetching of instructions.
Data Memory Barrier (DMB)
Data Memory Barrier (DMB) – this instruction ensures that all memory accesses
in program order before the barrier are observed in the system before any explicit
memory accesses that appear in program order after the barrier. It does not affect
the ordering of any other instructions executing on the processor, or of instruction
fetches.
Instruction Synchronization Barrier (ISB)
Instruction Synchronization Barrier (ISB) – this flushes the pipeline and prefetch
buffer(s) in the processor, so that all instructions following the ISB are fetched
from cache or memory, after the instruction has completed. This ensures that the
effects of context altering operations (for example, CP15 or ASID changes or
TLB or branch predictor operations), executed before the ISB instruction are
visible to any instructions fetched after the ISB. This does not in itself cause
synchronization between data and instruction caches, but is required as a part of
such an operation.
ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-6

Memory Ordering

Several options can be specified with the DMB or DSB instructions, to provide the type of access
and the shareability domain it should apply to, as follows:
SY

This is the default and means that the barrier applies to the full system.

ST

A barrier which waits only for stores to complete.

ISH

A barrier which applies only to the inner Shareable domain.

ISHST

A barrier which combines the above two (that is, it only stores to the inner
Shareable.

NSH

A barrier only to the point of unification. (See Chapter 7).

NSHST

A barrier that waits only for stores to complete and only out to the point of
unification.

OSH

Barrier operation only to the outer Shareable domain.

OSHST

Barrier operation that waits only for stores to complete, and only to the outer
Shareable domain.

To make sense of this, we need to use a more general definition of the DMB and DSB operations in
a multi-processor system. The use of the word “processor” (or agent) in the following text does
not necessarily mean a processor and also could refer to a DSP, DMA, hardware accelerator or
any other block that accesses shared memory.
The DMB instruction has the effect of enforcing memory access ordering within a shareability
domain. All processors within the shareability domain are guaranteed to observe all explicit
memory accesses before the DMB instruction, before they observe any of the explicit memory
accesses after it.
The DSB instruction has the same effect as the DMB, but in addition to this, it also synchronizes the
memory accesses with the full instruction stream, not just other memory accesses. This means
that when a DSB is issued, execution will stall until all outstanding explicit memory accesses have
completed. When all outstanding reads have completed and the write buffer is drained,
execution resumes as normal.
It may be easier to appreciate the effect of the barriers by considering an example. Consider the
case of a Cortex-A9 MPCore containing four processors. These processors operate as an SMP
cluster and form a single shareability domain. When a single processor within the cluster
executes a DMB instruction, that processor will ensure that all data memory accesses in program
order before the barrier complete, before any explicit memory accesses that appear in
program-order after the barrier. This way, it can be guaranteed that all processors within the
cluster will see the accesses on either side of that barrier in the same order as the processor that
performs them. If the DMB ISH variant is used, the same is not guaranteed for external observers
such as DMA controllers or DSPs.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-7

Memory Ordering

9.2.1

Memory barrier use example
Consider the case where we have two processors A and B and two addresses in Normal memory
(Addr1 and Addr2) held in processor registers. Each processor executes two instructions as shown
in Example 9-1:
Example 9-1 Code example showing memory ordering issues

Processor A:
STR R0, [Addr1]
LDR R1, [Addr2]

Processor B:
STR R2, [Addr2]
LDR R3, [Addr1]

Here, there is no ordering requirement and we can make no statement about the order in which
any of the transactions occur. The addresses Addr1 and Addr2 are independent and there is no
requirement on either processor to execute the load and store in the order written in the program,
or to care about the activity of the other processor.
There are therefore four possible legal outcomes of this piece of code, with four different sets
of values from memory ending up in processor A R1 and processor B R3:
•

A gets the “old” value, B gets the “old” value.

•

A gets the “old” value, B gets the “new” value.

•

A gets the “new” value, B gets the “old” value.

•

A gets the “new” value, B gets the “new” value.

If we were to involve a third processor, C, we should also note that there is no requirement that
it would observe either of the stores in the same order as either of the other processors. It is
perfectly permissible for both A and B to see an old value in Addr1 and Addr2, but for C to see
the new values.
So, let’s consider the case where the code on B looks for a flag being set by A and then reads
memory – for example if we are passing a message from A to B. We might now have code
similar to that shown in Example 9-2:
Example 9-2 Possible ordering hazard with postbox

Processor A:
STR R0, [Msg] @ write some new data into postbox
STR R1, [Flag] @ new data is ready to read

Processor B:
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] @ read new data.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-8

Memory Ordering

Again, this might not behave in the way that is expected. There is no reason why processor B is
not allowed to perform the read from [Msg] before the read from [Flag]. This is normal, weakly
ordered memory and the processor has no knowledge about a possible dependency between the
two. The programmer must explicitly enforce the dependency by inserting a memory barrier. In
this example, we actually need two memory barriers. Processor A needs a DMB between the two
store operations, to make sure they happen in the programmer specified order. Processor B
needs a DMB before the LDR R0, [Msg] to be sure that the message is not read until the flag is set.
9.2.2

Avoiding deadlocks with a barrier
Here is another case which can cause a deadlock if barrier instructions are not used. Consider a
situation where one processor (A) writes to an address and then polls for an acknowledge value
from another processor (B).
Example 9-3 shows the type of code which can cause a problem.
Example 9-3 Deadlock

Processor A:
STR R0, [Addr] @ write some data
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop

Processor B:
Poll_loop2:
LDR R1, [Addr]
CMP R1,#0 @ is the flag set yet?
BEQ Poll_loop2
STR R0, [Flag]

The ARMv7 architecture without multiprocessing extensions does not strictly require processor
A’s store to [Addr] to ever complete (it could be sitting in a write buffer while the memory system
is kept busy reading the flag), so both processors could potentially deadlock, each waiting for
the other. Inserting a DSB after the STR of processor A forces its store to be observed by
processor B before processor A will read from Flag. Processors which implement the
multiprocessing extensions, like the Cortex-A5MPCore processor and Cortex-A9MPCore
processor, are required to complete accesses in a finite time (that is, their write buffers must
drain) and so the barrier instruction is not absolutely required. However, most programmers
prefer not to think too hard about whether a barrier is needed and are advised to include one
anyway!
9.2.3

WFE and WFI Interaction with barriers
The WFE (Wait For Event) and WFI (Wait For Interrupt) instructions, described further in
Chapter 21 Power Management, allow us to stop execution and enter a low-power state. If we
need to ensure that all memory accesses prior to executing WFI or WFE have been completed (and
made visible to other processors), we must insert a DSB instruction.
A further consideration relates to usage of WFE and SEV (Send Event) in an MP system. These
instructions allow us to reduce the power consumption associated with a lock acquire loop (a
spinlock). A processor which is attempting to acquire a mutex may find that some other
processor already has the lock. Instead of having the processor repeatedly poll the lock, we can
suspend execution and enter a low-power state, using the WFE instruction. We wake either when

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-9

Memory Ordering

an interrupt or other asynchronous exception is recognized, or another processor sends an event
(with the SEV instruction). The processor that had the lock will use the SEV instruction to wake-up
other processors in the WFE state after the lock has been released. For the purposes of memory
barrier instructions, the event signal is not treated as an explicit memory access. We therefore
need to take care that the update to memory which releases the lock is actually visible to other
processors before the SEV instruction is executed. This requires the use of a DSB. DMB is not
sufficient as it only affects the ordering of memory accesses without synchronizing them to a
particular instruction, whereas DSB will prevent the SEV from executing until all preceding
memory accesses have been seen by other processors.
9.2.4

Linux use of barriers
In this chapter, we have looked at memory barrier instructions. In this section, we will look at
the implications of barriers in multi-core systems and take a much more detailed look at SMP
operation.
Barriers are needed to enforce ordering of memory operations. Most programmers will not need
to understand, or explicitly use memory barriers. This is because they are already included
within kernel locking and scheduling primitives. Nevertheless, writers of device drivers or those
seeking an understanding of kernel operation may find a detailed description useful.
Both the compiler and processor micro-architecture optimizations permit the order of
instructions and associated memory operations to be changed. Sometimes, however, we wish to
enforce a specified order of execution of memory operations. For example, we can write to a
memory mapped peripheral register. This write can have side-effects elsewhere in the system.
Memory operations which are in before or after this operation in our program can appear as if
they can be re-ordered, as they operate on different locations. In some cases, however, we wish
to ensure that all operations complete before this peripheral write completes. Or, we may want
to make sure that the peripheral write completes before any further memory operations are
started. Linux provides some functions to do this, as follows:
•

We instruct the compiler that re-ordering is not permitted for a particular memory
operation. This is done with the barrier() function call. This controls only the compiler
code generation and optimization and has no effect on hardware re-ordering.

•

We call a memory barrier function which maps to ARM processor instructions that
perform the memory barrier operations. These enforce a particular hardware ordering.
The available barriers are as follows (in a Linux kernel compiled with Cortex-A SMP
support):

•

—

the read memory barrier rmb() function ensures that any read that appears before the
barrier is completed before the execution of any read that appears after the barrier

—

the write memory barrier wmb() function ensures that any write that appears before
the barrier is completed before the execution of any write that appears after the
barrier

—

the memory barrier mb() function ensures that any memory access that appears
before the barrier is completed before the execution of any memory access that
appears after the barrier.

There are corresponding SMP versions of these barriers, called smp_mb(), smp_rmb() and
smp_wmb(). These are used to enforce ordering on Normal cacheable memory, between
processors inside the same SMP processor. For example, each processor inside a
Cortex-A9MPCore. They can be used with devices and they work even for normal
non-cacheable memory.

For these memory barriers, it is almost always the case that a pair of barriers is required. For
further information, see http://www.kernel.org/doc/Documentation/memory-barriers.txt.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-10

Memory Ordering

There are also SMP specific barriers: smp_mb(), smp_rmb() and smp_wmb(). These are not supersets
of those listed above, but rather subsets for resolving ordering issues only between processors
within an SMP system. When the kernel is compiled without CONFIG_SMP, each invocation of
these are expanded to barrier() statements.
All of Linux's locking primitives include any needed barrier.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-11

Memory Ordering

9.3

Cache coherency implications
The caches are largely invisible to the application programmer. However they can become
visible when there is a breakdown in the coherency of the caches, when memory locations are
changed elsewhere in the system or when memory updates made from the application code must
be made visible to other parts of the system.
A system containing an external DMA device and a processor provides a simple example of
possible problems. There are two situations in which a breakdown of coherency can occur. If
the DMA reads data from main memory while newer data is held in the processor’s cache, the
DMA will read the old data. Similarly, if a DMA writes data to main memory and stale data is
present in the processor’s cache, the processor can continue to use the old data.
Therefore dirty data which is in the ARM data cache must be explicitly cleaned before the DMA
starts. Similarly, if the DMA is copying data to be read by the ARM, it must be certain that the
ARM data cache does not contain stale data (the cache will not be updated by the DMA writing
memory and this may need the ARM to clean and/or invalidate the affected memory areas from
the cache(s) before starting the DMA). As all ARMv7-A processors can do speculative memory
accesses, it will also be necessary to invalidate after using the DMA.

9.3.1

Issues with copying code
Boot code, kernel code or JIT compilers can copy programs from one location to another, or
modify code in memory. There is no hardware mechanism to maintain coherency between
instruction and data caches. The programmer must invalidate stale data from the instruction
cache by invalidating the affected areas, and ensure that the data written has actually reached
the main memory. Specific code sequences including instruction barriers are needed if the
processor is then intended to branch to the modified code.

9.3.2

Compiler re-ordering optimizations
It is important to understand that memory barrier instructions apply only to hardware
re-ordering of memory accesses. Inserting a hardware memory barrier instruction may not have
any direct effect on compiler re-ordering of operations. The volatile type qualifier in C tells the
compiler that the variable can be changed by something other than the code that is accessing it.
This is often used for C language access to memory mapped I/O, allowing such devices to be
safely accessed through a pointer to a volatile variable. The C standard does not provide rules
relating to the use of volatile in systems with multiple processors. So, although we can be sure
that volatile loads and stores will happen in program specified order with respect to each other,
there are no such guarantees about re-ordering of accesses relative to non-volatile loads or
stores. This means that volatile does not provide a shortcut to implement mutexes.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

9-12

Chapter 10
Exception Handling

In this chapter, we look at how ARM processors respond to exceptions – also known as traps or
interrupts in other architectures. All microprocessors need to respond to external asynchronous
events, such as a button being pressed, or a clock reaching a certain value. Normally, there is
specialized hardware which activates input lines to the processor. This causes the microprocessor
to temporarily stop the current program sequence and execute a special handler routine. The speed
with which a processor can respond to such events may be a critical issue in system design. Indeed
in many embedded systems, there is no “main” program as such – all of the functions of the system
are handled by code which runs from interrupts, and assigning priorities to these is a key area of
design. Rather than have the processor constantly polling the flags from different parts of the
system to see if there is something to be done, we instead allow the system to tell the processor that
something needs to happen, by generating an interrupt. Complex systems have very many interrupt
sources with different levels of priority and requirements for nested interrupt handling (where a
higher priority interrupt can interrupt a lower priority one).
In normal program execution, the program counter increments through the address space, with
branches in the program modifying the flow of execution (for example, for function calls, loops,
and conditional code). When an exception occurs, this sequence is interrupted.
In addition to responding to external interrupts, there are a number of other things which can cause
the processor to take an exception, both external (reset, external aborts from the memory system)
and internal (MMU generated aborts or OS calls using the SVC instruction). Dealing with interrupts
and exceptions causes the ARM processor to switch between modes and copy some registers into
others. Readers new to the ARM architecture may wish to refresh their understanding of the modes
and registers previously described, before continuing with this chapter.
We start by introducing exceptions and see how the ARM processor handles each of the different
types and what they are used for. We then look in more detail at interrupts and describe mechanisms
of interrupt handling on ARM and standard interrupt handling schemes.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-1

Exception Handling

10.1

Types of exception
As we have already seen, the A classes and R classes of architecture support seven processor
modes, six privileged modes called FIQ, IRQ, SUPERVISOR, ABORT, UNDEFINED and
SYSTEM, and the non-privileged USER mode. The HYP mode and MONITOR registers can be
added to the list where the Virtualization and Security Extensions are implemented.The current
mode can change under software control or when processing an exception.
However, the unprivileged User mode can switch to another mode only by generating an
exception.
An exception is any condition that needs to halt normal execution and instead run software
associated with each exception type, known as an exception handler.
When an exception occurs, the processor saves the current status and the return address, enters
a specific mode and possibly disables hardware interrupts. Execution is then forced from a fixed
memory address called an exception vector. This happens automatically and is not under direct
control of the programmer.
The following types of exception exist:
Interrupts

There are two types of interrupts provided on ARMv7-A processors, called IRQ
and FIQ.
FIQ is higher priority than IRQ. FIQ also has some potential speed advantages
owing to its position in the vector table and the higher number of banked registers
available in FIQ mode. This potentially saves processor clock cycles on pushing
registers to the stack within the handler. Both of these kinds of exception are
typically associated with input pins on the processor – external hardware asserts
an interrupt request line and the corresponding exception type is raised when the
current instruction finishes executing, assuming that the interrupt is not disabled.

Aborts

Aborts can be generated either on instruction fetches (prefetch aborts) or data
accesses (data aborts). They can come from the external memory system giving
an error response on a memory access (indicating perhaps that the specified
address does not correspond to real memory in the system). Alternatively, the
abort can be generated by the Memory Management Unit (MMU) of the
processor. An operating system can use MMU aborts to dynamically allocate
memory to applications. An instruction can be marked within the pipeline as
aborted, when it is fetched. The prefetch abort exception is taken only if the
processor then actually tries to execute it. The exception takes place before the
instruction actually executes. If the pipeline is flushed before the aborted
instruction reaches the execute stage of the pipeline, the abort exception will not
occur. A data abort exception happens when a load or store instruction executes
and is considered to happen after the data read or write has been attempted. The
ARMv7 architecture distinguishes between precise and imprecise aborts. Aborts
generated by the MMU are always precise. The architecture does not require
particular classes of externally aborted accesses to be precise.
For example, on a particular processor implementation, it may be the case that an
external abort reported on a page table walk is treated as precise, but this is not
required to be the case for all processors. For precise aborts, the abort handler can
be certain which instruction caused the abort and that no further instructions were
executed after that instruction. This is in contrast to an imprecise abort, which
results when the external memory system reports an error on an access. In this
case, the abort handler cannot determine which instruction caused the problem (or
further instructions may have executed after the one which generated the abort).
For example, if a buffered write receives an error response from the external
memory system, further instructions will have been executed after the store. This

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-2

Exception Handling

means that it will be impossible for the abort handler to fix the problem and return
to the application. All it can do is to kill the application which caused the problem.
Device probing therefore needs special handling, as externally reported aborts on
reads to non-existent areas will generate imprecise, asynchronous aborts even
when such memory is marked as Strongly-ordered, or Device. Generation of
imprecise aborts is controlled by the CPSR A bit. If the A bit is set, imprecise
aborts from the external memory system will be recognized by the processor, but
no abort exception will be generated immediately. Instead, the processor keeps
the abort pending until the A bit is cleared and takes an exception at that time.
This bit is set by default on reset, and certain other exception types. Kernel code
will typically ensure (through the use of a barrier instruction) that pending
imprecise aborts are recognized against the correct application. If a thread has to
be killed due to an imprecise abort, it needs to be the correct one!
Reset

All processors have a reset input and will take the reset exception immediately
after they have been reset. It is the highest priority exception and cannot be
masked.

Exceptional instructions
There are two classes of instruction which can cause exceptions on the ARM. The
first is the Supervisor Call (SVC), previously known as Software Interrupt (SWI).
This is typically used to provide a mechanism by which User mode programs can
pass control to privileged, kernel code in the OS to perform OS-level tasks. The
second is an undefined instruction. The architecture defines certain bit-patterns as
corresponding to undefined opcodes. Trying to execute one of these causes an
Undefined Instruction exception to be taken. In addition, executing coprocessor
instructions for which there is no corresponding coprocessor hardware will also
cause this trap to happen. Some instructions can be executed only in a privileged
mode and executing these from User mode will cause an undefined instruction
exception.
When an exception occurs, code execution passes to an area of memory called the vector table.
Within the table just one word is allocated to each of the various exception types and this will
usually contain a branch instruction to the actual exception handler.
You can write the exception handlers in either ARM or Thumb code. The CP15 SCTLR.TE bit
is used to specify whether exception handlers will use ARM or Thumb. When handling
exceptions, the prior mode, state, and registers of the processor must be preserved so that the
program can be resumed after the exception has been handled.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-3

Exception Handling

10.2

Entering an exception handler
When an exception occurs, the ARM processor automatically does the following things:
•

Preserves the address of the next instruction, in the Link Register (LR) of the new mode.

•

Copies CPSR to the SPSR, one of the banked registers specific to each (non-user) mode
of operation.

•

Modifies the CPSR mode bits to a mode associated with the exception type. The other
CPSR mode bits are set to values determined by bits in the CP15 System Control Register.
The T bit is set to the value given by the CP15 TE bit. The J bit is cleared and the E bit
(Endianness) is set to the value of the EE (Exception Endianness) bit. This enables
exceptions to always run in ARM or Thumb state and in little or big-endian, irrespective
of the state the processor was in before the exception.

•

Forces the PC to point to the relevant instruction from the exception vector table.

It will almost always be necessary for the exception handler software to save registers onto the
stack immediately upon exception entry. (FIQ mode has more dedicated registers and so a
simple handler may be able to be written in a way which needs no stack usage.)
A special assembly language instruction is provided to assist with saving the necessary registers,
called SRS (Store Return State). This instruction pushes the LR and SPSR onto the stack of any
mode; which stack should be used is specified by the instruction operand.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-4

Exception Handling

10.3

Exit from an exception handler
Exiting from an exception and returning to the main program is always done by executing a
special instruction. This has the effect of doing both of the following (at the same time):
•

restore the CPSR from the saved SPSR.

•

set the PC to the value of (LR - offset) where the offset value is fixed for a particular
exception type, shown in Table 10-1 on page 10-6.

This can be accomplished by using an instruction like SUBS PC, LR, #offset if the link register
was not pushed on to the stack at the start of the handler. Otherwise, it will be necessary to pop
the values to be restored from the stack. Again, there is a special assembly language instruction
provided to assist with this. The Return From Exception (RFE) instruction pops the link register
and SPSR off the current mode stack.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-5

Exception Handling

10.4

Exception mode summary
Table 10-1 summarizes different exceptions and the associated mode. This table contains a great
deal of information and we’ll take a moment to look at each row and column in turn.
The CPSR column indicates the setting of the CPSR I and F bits, used to disable IRQ and FIQ
respectively.
Table 10-1 Summary of exception behavior
Vector
address

Exception

Mode

Event

CPSR

0x0

Reset

Supervisor

Reset input asserted

F=1
I=1

Not applicable

0x4

Undefined
instruction

Undefined

Executing undefined
instruction

I=1

MOVS PC, LR

Return
instruction

(if emulating the
instruction)
SUBS PC, LR, #4

(if re-executing
after for example
enabling VFP)
0x8

Supervisor call

Supervisor

SVC instruction

I=1

MOVS PC,LR

0xC

Prefetch Abort

Abort

Instruction fetch from
invalid address

I=1

SUBS PC, LR, #4

0x10

Data Abort

Abort

Data Read/Write to
invalid address

I=1

SUBS PC, LR, #8 (if

retry of the aborting
instruction is
wanted)

0x14

Yesa

HYP

Hypervisor entry

-

ERET

0x18

Interrupt

IRQ

IRQ input asserted

I=1

SUBS PC, LR, #4

0x1C

Fast Interrupt

FIQ

FIQ input asserted

F=1
I=1

SUBS PC, LR, #4

a. Hypervisor entry exception (described in Chapter 27 Virtualization) is available only in processors which
support Virtualization Extensions and is unused in other processors.

The suggested return instructions in Table 10-1 assumes that we wish to retry the aborted
instruction in the case of an abort, but to not retry an instruction which caused an undefined
instruction exception.
10.4.1

Exception priorities
As some of the exception types can occur simultaneously, the processor assigns a fixed priority
for each exception, as shown in Table 10-1. The UNDEFINED Instruction, prefetch abort and
Supervisor Call exceptions are due to execution of an instruction (there are specific bit patterns
for undefined and SVC opcodes) and so can never happen together and therefore have the same
priority.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-6

Exception Handling

It is important to distinguish between prioritization of exceptions, which happens when multiple
exceptions are required at the same time, and the actual exception handler code. You will notice
that Table 10-1 on page 10-6 contains columns explaining how FIQ and IRQ are automatically
disabled by some exceptions. (All exceptions disable IRQ, only FIQ and reset disable FIQ.) This
is done by the processor automatically setting the CPSR I (IRQ) and F (FIQ) bits.
So, an FIQ exception can interrupt an abort handler or IRQ exception. In the case of a data abort
and FIQ occurring simultaneously, the data abort (which has higher priority) is taken first. This
lets the processor record the return address for the data abort. But as FIQ is not disabled by data
abort, we then take the FIQ exception immediately. At the end of the FIQ we return back to the
data abort handler.
More than one exception can potentially be generated at the same time, but some combinations
are mutually exclusive. A prefetch abort marks an instruction as invalid and so cannot occur at
the same time as an undefined instruction or SVC (and of course, an SVC instruction cannot also
be an undefined instruction). These instructions cannot cause any memory access and therefore
cannot cause a data abort. The architecture does not define when asynchronous exceptions, FIQ,
IRQ and/or imprecise aborts must be taken, but the fact that taking an IRQ or data abort
exception does not disable FIQ exceptions means that FIQ execution will be prioritized over
IRQ and/or asynchronous abort handling.

ARM DEN0013B
ID082411

Copyright © 2011 ARM. All rights reserved.
Non-Confidential

10-7

Exception Handling

10.5

Vector table
The first column in the table gives the vector address within the vector table associated with the
particular type of exception. This is a table of instructions that the ARM processor goes to when
an exception is raised. These instructions are located in a specific place in memory. The normal
vector base address is 0x00000000, but most ARM processors allow the vector base address to be
moved to 0xFFFF0000. All Cortex-A series processors permit this, and it is the default address
selected by the Linux kernel.
You will notice that there is a single word address associated with each exception type.
Therefore, only a single instruction can be placed in the vector table for each exception
(although, in theory, two 16-bit Thumb instructions could be used). FIQ is different, as we shall
see in Distinction between FIQ and IRQ on page 10-9. Therefore, the vector table entry almost
always contains one of the various forms of branches.
B

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : Yes
Page Mode                       : UseOutlines
XMP Toolkit                     : Adobe XMP Core 4.0-c321 44.398116, Tue Aug 04 2009 14:24:39
Creator Tool                    : FrameMaker 8.0
Modify Date                     : 2011:08:24 20:02:24Z
Create Date                     : 2011:08:24 20:02:24Z
Copyright                       : Copyright © 2011 ARM. All rights reserved.
Producer                        : Acrobat Distiller 8.2.6 (Windows)
Format                          : application/pdf
Title                           : Cortex-A Series Programmer’s Guide
Creator                         : ARM Limited
Description                     : Cortex-A Series Programmer’s Guide, Understanding the ARMv7-A Architecture
Document ID                     : uuid:992f0eb3-cd42-4d1f-acd5-aa93531ced0f
Instance ID                     : uuid:9f1325bc-303c-4ed2-8182-123d47dd6b75
Page Count                      : 455
Subject                         : Cortex-A Series Programmer’s Guide, Understanding the ARMv7-A Architecture
Author                          : ARM Limited
Keywords                        : Cortex-A
EXIF Metadata provided by EXIF.tools

Navigation menu