ARM Cortex A Series Programmer’s Guide For ARMv8 Programmer's V1.0 Min

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 296

Download
Open PDF In Browser	View PDF

ARM Cortex -A Series
®

Version: 1.0

Programmer’s Guide for ARMv8-A

ARM Cortex-A Series
Programmer’s Guide for ARMv8-A
Copyright © 2015 ARM. All rights reserved.
Release Information
The following changes have been made to this book.
Change history
Date

Issue

Confidentiality

Change

24 March 2015

Non-Confidential

First release

Proprietary Notice
This document is protected by copyright and other related rights and the practice or implementation of the information
contained in this document may be protected by one or more patents or pending patent applications. No part of this
document may be reproduced in any form by any means without the express prior written permission of ARM. No
license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document
unless specifically stated.
Your access to the information in this document is conditional upon your acceptance that you will not use or permit
others to use the information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS
FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes
no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of,
third party patents, copyrights, trade secrets, or other rights.
This document may include technical inaccuracies or typographical errors.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or
disclosure of this document complies fully with any relevant export laws and regulations to assure that this document
or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word “partner”
in reference to ARM’s customers is not intended to create or refer to any partnership relationship with any other
company. ARM may make changes to this document at any time and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any signed written agreement
covering this document with ARM, then the signed written agreement prevails over and supersedes the conflicting
provisions of these terms. This document may be translated into other languages for convenience, and you agree that if
there is any conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.
Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in the
EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks
of their respective owners. Please follow ARM’s trademark usage guidelines at
http://www.arm.com/about/trademark-usage-guidelines.php
Copyright © 2015, ARM Limited or its affiliates. All rights reserved.
ARM Limited. Company 02557590 registered in England.
110 Fulbourn Road, Cambridge, England CB1 9NJ.
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license
restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this
document to.
Product Status
The information in this document is final, that is for a developed product.

ARM DEN0024A
ID050815

Web Address
http://www.arm.com

ARM DEN0024A
ID050815

iii

Contents
ARM Cortex-A Series Programmer’s Guide for
ARMv8-A

Preface
Glossary ...................................................................................................................... ix
References ............................................................................................................... xiii
Feedback on this book ............................................................................................... xv

Chapter 1

Introduction
1.1

Chapter 2

ARMv8-A Architecture and Processors
2.1
2.2

Chapter 3

Execution states ...................................................................................................... 3-4
Changing Exception levels ...................................................................................... 3-5
Changing execution state ........................................................................................ 3-8

ARMv8 Registers
4.1
4.2
4.3
4.4
4.5
4.6

ARM DEN0024A
ID050815

ARMv8-A ................................................................................................................. 2-3
ARMv8-A Processor properties ............................................................................... 2-5

Fundamentals of ARMv8
3.1
3.2
3.3

Chapter 4

How to use this book ............................................................................................... 1-3

AArch64 special registers ........................................................................................ 4-3
Processor state ........................................................................................................ 4-6
System registers ...................................................................................................... 4-7
Endianness ............................................................................................................ 4-12
Changing execution state (again) .......................................................................... 4-13
NEON and floating-point registers ......................................................................... 4-17

Contents

Chapter 5

An Introduction to the ARMv8 Instruction Sets
5.1
5.2
5.3

Chapter 6

The A64 instruction set
6.1
6.2
6.3
6.4
6.5

Chapter 7

The Translation Lookaside Buffer .......................................................................... 12-4
Separation of kernel and application Virtual Address spaces ................................ 12-7
Translating a Virtual Address to a Physical Address ............................................. 12-9
Translation tables in ARMv8-A ............................................................................ 12-14
Translation table configuration ............................................................................. 12-18
Translations at EL2 and EL3 ............................................................................... 12-20
Access permissions ............................................................................................. 12-23
Operating system use of translation table descriptors ........................................ 12-25
Security and the MMU ......................................................................................... 12-26
Context switching ................................................................................................. 12-27
Kernel access with user permissions ................................................................... 12-29

Memory Ordering
13.1

ARM DEN0024A
ID050815

Cache terminology ................................................................................................. 11-3
Cache controller ..................................................................................................... 11-8
Cache policies ....................................................................................................... 11-9
Point of coherency and unification ....................................................................... 11-11
Cache maintenance ............................................................................................. 11-13
Cache discovery .................................................................................................. 11-18

The Memory Management Unit
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.9
12.10
12.11

Chapter 13

Exception handling registers .................................................................................. 10-4
Synchronous and asynchronous exceptions ......................................................... 10-7
Changes to execution state and Exception level caused by exceptions ............. 10-10
AArch64 exception table ...................................................................................... 10-12
Interrupt handling ................................................................................................. 10-14
The Generic Interrupt Controller .......................................................................... 10-17

Caches
11.1
11.2
11.3
11.4
11.5
11.6

Chapter 12

AArch64 Exception Handling
10.1
10.2
10.3
10.4
10.5
10.6

Chapter 11

Alignment ................................................................................................................. 8-3
Data types ................................................................................................................ 8-4
Issues when porting code from a 32-bit to 64-bit environment ................................ 8-8
Recommendations for new C code ........................................................................ 8-10

The ABI for ARM 64-bit Architecture
9.1

Chapter 10

New features for NEON and Floating-point in AArch64 ........................................... 7-2
NEON and Floating-Point architecture .................................................................... 7-4
AArch64 NEON instruction format ........................................................................... 7-9
NEON coding alternatives ..................................................................................... 7-14

Porting to A64
8.1
8.2
8.3
8.4

Chapter 9

Instruction mnemonics ............................................................................................. 6-2
Data processing instructions .................................................................................... 6-3
Memory access instructions .................................................................................. 6-12
Flow control ........................................................................................................... 6-19
System control and other instructions .................................................................... 6-21

AArch64 Floating-point and NEON
7.1
7.2
7.3
7.4

Chapter 8

The ARMv8 instruction sets ..................................................................................... 5-2
C/C++ inline assembly ............................................................................................. 5-9
Switching between the instruction sets .................................................................. 5-10

Memory types ........................................................................................................ 13-3

Contents

13.2
13.3

Chapter 14

Multi-core processors
14.1
14.2
14.3
14.4

Chapter 15

TrustZone hardware architecture ...........................................................................
Switching security worlds through interrupts .........................................................
Security in multi-core systems ...............................................................................
Switching between Secure and Non-secure state .................................................

17-3
17-5
17-6
17-8

ARM debug hardware ............................................................................................ 18-3
ARM trace hardware .............................................................................................. 18-9
DS-5 debug and trace .......................................................................................... 18-12

ARMv8 Models
19.1
19.2
19.3

ARM DEN0024A
ID050815

Structure of a big.LITTLE system .......................................................................... 16-2
Software execution models in big.LITTLE ............................................................. 16-4
big.LITTLE MP ....................................................................................................... 16-7

Debug
18.1
18.2
18.3

Chapter 19

15-3
15-6
15-7
15-8

Security
17.1
17.2
17.3
17.4

Chapter 18

Idle management ...................................................................................................
Dynamic voltage and frequency scaling ................................................................
Assembly language power instructions .................................................................
Power State Coordination Interface .......................................................................

big.LITTLE Technology
16.1
16.2
16.3

Chapter 17

Multi-processing systems ...................................................................................... 14-3
Cache coherency ................................................................................................. 14-10
Multi-core cache coherency within a cluster ........................................................ 14-13
Bus protocol and the Cache Coherent Interconnect ............................................ 14-17

Power Management
15.1
15.2
15.3
15.4

Chapter 16

Barriers .................................................................................................................. 13-6
Memory attributes ................................................................................................ 13-11

ARM Fast Models .................................................................................................. 19-2
ARMv8-A Foundation Platform .............................................................................. 19-4
The Base Platform FVP ....................................................................................... 19-16

Preface

In 2013, ARM released its 64-bit ARMv8 architecture, the first major change to the ARM
architecture since ARMv7 in 2007, and the most fundamental and far reaching change since the
original ARM architecture was created.
Development of the architecture has continued for some years. Early versions were being used
before the Cortex-A Series Programmer’s Guide for ARMv7-A was first released. The first of
the Programmer’s Guide series from ARM, it post-dated the introduction of the 32-bit ARMv7
architecture by some years. Almost immediately there were requests for a version to cover the
ARMv8 architecture. It was intended from the outset that a guide to ARMv8 should be available
as soon as possible.
This book was started when the first versions of the ARMv8 architecture were being tested and
codified. As always, moving from a system that is known and understood to something new and
unknown can present a number of problems. The engineers who supplied information for the
present book are, by and large, the same engineers who supplied the information for the original
Cortex-A Series Programmer’s Guide. This book has been made richer by their observations and
insights as they use, and solve the problems presented by the new architecture.
The Programmer’s Guides are meant to complement, rather than replace, other ARM
documentation available, such as the Technical Reference Manuals (TRMs) for the processors
themselves, documentation for individual devices or boards or, most importantly, the ARM
Architecture Reference Manual (the ARM ARM). They are intended to provide a gentle
introduction to the ARM architecture, and cover all the main concepts that you need to know
about, in an easy to read format, with examples of actual code in both C and assembly language,
and with hints and tips for writing your own code.
It might be argued that if you are an application developer, you do not need to know what goes
on inside a processor. ARM Application processors can easily be regarded as black boxes which
simply run your code when you say go. Instead, this book provides a single guide, bringing

ARM DEN0024A
ID050815

vii

Preface

together information from a wide variety of sources, for those programmers who get the system
to the point where application developers can run applications, such as those involved in ASIC
verification, or those working on boot code and device drivers.
During bring-up of a new board or System-on-Chip (SoC), engineers may have to investigate
issues with the hardware. Memory system behavior is among the most common places for these
to manifest, for example, deadlocks where the processor cannot make forward progress because
of memory system lock. Debugging these problems requires an understanding of the operation
and effect of cache or MMU use. This is different from debugging a failing piece of code.
In a similar vein, system architects (usually hardware engineers) make choices early in the
design about the implementation of DMA, frame buffers and other parts of the memory system
where an understanding of data flow between agents in required. In this case it is difficult to
make sensible decisions about it if you do not understand when a cache will help you and when
it gets in the way, or how the OS will use the MMU. Similar considerations apply in many other
places.
This is not an introductory level book, nor is it a purely technical description of the architecture
and processors, which merely state the facts with little or no explanation of ‘how’ and ‘why’.
ARM and all who have collaborated on this book hope it successfully navigates between the two
extremes, while attempting to explain some of the more intricate aspects of the architecture.

ARM DEN0024A
ID050815

viii

Preface

Glossary
Abbreviations and terms used in this document are defined here.

ARM DEN0024A
ID050815

AAPCS

ARM Architecture Procedure Call Standard.

AArch32 state

The ARM 32-bit execution state that uses 32-bit general-purpose registers,
and a 32-bit Program Counter (PC), Stack Pointer (SP), and Link Register
(LR). AArch32 execution state provides a choice of two instruction sets,
A32 and T32, previously called the ARM and Thumb instruction sets.

AArch64 state

The ARM 64-bit execution state that uses 64-bit general-purpose registers,
and a 64-bit Program Counter (PC), Stack Pointer (SP), and Exception
Link Registers (ELR). AArch64 execution state provides a single
instruction set, A64.

ABI

Application Binary Interface.

ACE

AXI Coherency Extensions.

AES

Advanced Encryption Standard.

AMBA®

Advanced Microcontroller Bus Architecture.

AMP

Asymmetric Multi-Processing.

ARM ARM

The ARM Architecture Reference Manual.

ASIC

Application Specific Integrated Circuit.

ASID

Address Space ID.

AXI

Advanced eXtensible Interface.

BE8

Byte Invariant Big-Endian Mode.

BTAC

Branch Target Address Cache.

BTB

Branch Target Buffer.

CCI

Cache Coherent Interface.

CHI

Coherent Hub Interface.

CP15

Coprocessor 15 for AArch32 and ARMv7-A- System control coprocessor.

DAP

Debug Access Port.

DMA

Direct Memory Access.

DMB

Data Memory Barrier.

DS-5™

The ARM Development Studio.

DSB

Data Synchronization Barrier.

DSP

Digital Signal Processing.

DSTREAM

An ARM debug and trace unit.

DVFS

Dynamic Voltage/Frequency Scaling.

EABI

Embedded ABI.

ECC

Error Correcting Code.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

Preface

ECT

Embedded Cross Trigger.

EL0

Exception level used to execute user applications.

EL1

Exception level normally used to run operating systems.

EL2

Hypervisor Exception level. In the Normal world, or Non-Secure state,
this is used to execute hypervisor code.

EL3

Secure Monitor exception level.This is used to execute the code that
guards transitions between the Secure and Normal worlds.

ETB

Embedded Trace Buffer™.

ETM

Embedded Trace Macrocell™.

Execution state

The operational state of the processor, either 64-bit (AArch64) or 32-bit
(AArch32).

FIQ

An interrupt type (formerly fast interrupt).

FPSCR

Floating-Point Status and Control Register.

GCC

GNU Compiler Collection.

GIC

Generic Interrupt Controller.

Harvard architecture
Architecture with physically separate storage and signal pathways for
instructions and data.
HCR

Hyp Configuration Register.

HMP

Heterogenous Multi-Processing.

IMPLEMENTATION DEFINED
Some properties of the processor are defined by the manufacturer.

ARM DEN0024A
ID050815

IPA

Intermediate Physical Address.

IRQ

Interrupt Request, normally for external interrupts.

ISA

Instruction Set Architecture.

ISB

Instruction Synchronization Barrier.

ISR

Interrupt Service Routine.

Jazelle™

The ARM bytecode acceleration technology.

LLP64

Indicates the size in bits of basic C data types. Under LLP64 int and long
data types are 32 bit, pointers and long long are 64 bits.

LP64

Indicates the size in bits of basic C data types. Under LP64 int types are
32 bits, all others are 64 bits.

LPAE

Large Physical Address Extension.

LSB

Least Significant Bit.

MESI

A cache coherency protocol with four states that are Modified, Exclusive,
Shared and Invalid.

MMU

Memory Management Unit.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

Preface

MOESI

A cache coherency protocol with five states that are Modified, Owned,
Exclusive, Shared and Invalid.

Monitor mode

When EL3 is using AArch32, the PE mode in which the Secure Monitor
must execute. This mode guards transitions between the Secure and
Normal worlds.

MPU

Memory Protection Unit.

NEON™

The ARM Advanced SIMD Extensions.

NIC

Network InterConnect.

Normal world

The execution environment when the processor is in the Non-secure state.

PCS

Procedure Call Standard.

PIPT

Physically Indexed, Physically Tagged.

PoC

Point of Coherency.

PoU

Point of Unification.

PSR

Program Status Register.

SCU

Snoop Control Unit.

Secure world

The execution environment when the processor is in the Secure State.

SIMD

Single Instruction, Multiple Data.

SMC

Secure Monitor Call. An ARM assembler instruction that causes an
exception that is taken synchronously to EL3.

SMC32

32-bit SMC calling convention

SMC64

64-bit SMC calling convention

SMC Function Identifier
A 32-bit integer which identifies which function is being invoked by this
SMC call. Passed in R0 or W0 to every SMC call

ARM DEN0024A
ID050815

SMMU

System MMU.

SMP

Symmetric Multi-Processing.

SoC

System on Chip.

Stack Pointer.

SPSR

Saved Program Status Register.

Streamline

A graphical performance analysis tool.

SVC

Supervisor Call instruction.

SYS

System Mode.

Thumb®

An instruction set extension to ARM.

Thumb-2

A technology extending the Thumb instruction set to support both 16-bit
and 32-bit instructions.

TLB

Translation Lookaside Buffer.

Preface

TrustedOS

This is the operating system running in the Secure World. It supports the
execution of trusted applications in Secure EL0. When EL3 is using
AArch64 it executes in Secure EL1. When EL3 is using AArch32 it
executes in Secure EL3 modes other than Monitor mode.

TrustZone®

The ARM security extension.

TTB

Translation Table Base.

TTBR

Translation Table Base Register.

UART

Universal Asynchronous Receiver/Transmitter.

UEFI

Unified Extensible Firmware Interface.

U-Boot

A Linux Bootloader.

UNK

Unknown.

UNKNOWN

Values in a register cannot be known before they are reset.

UNPREDICTABLE
The value taken cannot be predicted.

ARM DEN0024A
ID050815

USR

User mode, a non-privileged processor mode.

VFP

The ARM floating-point instruction set. Before ARMv7, the VFP
extension was called the Vector Floating-Point architecture, and was used
for vector operations.

VIPT

Virtually Indexed, Physically Tagged.

VMID

Virtual Machine Identifier.

Execute Never.

xii

Preface

References
ANSI/IEEE Std 754-1985, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 754-2008, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 1003.1-1990, “Standard for Information Technology - Portable Operating
System Interface (POSIX) Base Specifications, Issue 7”.
ANSI/IEEE Std 1149.1-2001, “IEEE Standard Test Access Port and Boundary-Scan
Architecture”.
The ARMv8 Architecture Reference Manual, known as the ARM ARM, fully describes the
ARMv8 instruction set architecture, programmer’s model, system registers, debug features and
memory model. It forms a detailed specification to which all implementations of ARM
processors must adhere.
References to the ARM Architecture Reference Manual in this document are to:
ARM® Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile (ARM DDI
0487).
Note
In the event of a contradiction between this book and the ARM ARM, the ARM ARM is
definitive and must take precedence. In most instances, however, the ARM ARM and the
Cortex-A Series Programmer’s Guide for ARMv8-A cover two separate world views. The most
likely scenario is that this book describes something in a way that does not cover all
architecturally permitted behaviors, or simply rewords an abstract concept in more practical
terms.
ARM® Cortex®-A Series Programmer’s Guide for ARMv7-A (DEN 0013).
ARM® NEON™ Programmer’s Guide (DEN 0018).
ARM® Cortex®-A53 MPCore Processor Technical Reference Manual (DDI 0500).
ARM® Cortex®-A57 MPCore Processor Technical Reference Manual (DDI 0488).
ARM® Generic Interrupt Controller Architecture Specification (ARM IHI 0048).
ARM® Compiler armasm Reference Guide v6.01 (DUI 0802).
ARM® Compiler Software Development Guide v5.05 (DUI 0471).
ARM® C Language Extensions (IHI 0053).
ELF for the ARM® Architecture (ARM IHI 0044).
The individual processor Technical Reference Manuals provide a detailed description of the
processor behavior. They can be obtained from the ARM website documentation area
http://infocenter.arm.com.
Connected community
The ARM Connected Community makes it easier to design using ARM processors and IP. It is
an interactive platform containing information, discussions and blogs which help you to develop
an ARM-based design efficiently, in collaboration with ARM engineers and our 1200+

ARM DEN0024A
ID050815

xiii

Preface

ecosystem Partners and enthusiasts. Visitors also use the community to find new companies to
work with from the many ARM Partners who first introduced their products and services in their
dedicated area. You can join the Connected Community on http://community.arm.com.

ARM DEN0024A
ID050815

xiv

Preface

Feedback on this book
ARM hopes you find the Cortex-A Series Programmer’s Guide for ARMv8-A easy to read while
in enough depth to provide the comprehensive introduction to using the processors.
If you have any comments on this book, don’t understand our explanations, think something is
missing, or think that it is incorrect, send an e-mail to errata@arm.com. Give:
•
The title.
•
The number, ARM DEN0024A.
•
The page number(s) to which your comments apply.
•
What you think needs to be changed.
ARM also welcomes general suggestions for additions and improvements.

ARM DEN0024A
ID050815

Chapter 1
Introduction

ARMv8-A is the latest generation of the ARM architecture that is targeted at the Applications
Profile. In this book, the name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution states. ARMv8 introduces the ability to
perform execution with 64-bit wide registers, but provides mechanisms for backwards
compatibility to enable existing ARMv7 software to be executed.
AArch64 is the name used to describe the 64-bit execution state of the ARMv8 architecture.
AArch32 describes the 32-bit execution state of the ARMv8 architecture, which is almost
identical to ARMv7. GNU and Linux documentation (except for Redhat and Fedora
distributions) sometimes refers to AArch64 as ARM64.
Because many of the concepts of the ARMv8-A architecture are shared with the ARMv7-A
architecture, the details of all those concepts are not covered here. As a general introduction to
the ARMv7-A architecture, refer to the ARM® Cortex®-A Series Programmer’s Guide. This
guide can also help you to familiarize yourself with some of the concepts discussed in this
volume. However, the ARMv8-A architecture profile is backwards compatible with earlier
iterations, like most versions of the ARM architecture. Therefore, there is a certain amount of
overlap between the way the ARMv8 architecture and previous architectures function. The
general principles of the ARMv7 architecture are only covered to explain the differences
between the ARMv8 and earlier ARMv7 architectures.
Cortex-A series processors now include both ARMv8-A and ARMv7-A implementations:

ARM DEN0024A
ID050815

•

The Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15, and Cortex-A17
processors all implement the ARMv7-A architecture.

•

The Cortex-A53 and Cortex-A57 processors implement the ARMv8-A architecture.

1-1

Introduction

ARMv8 processors still support software (with some exceptions) written for the ARMv7-A
processors. This means, for example, that 32-bit code written for the ARMv7 Cortex-A series
processors also runs on ARMv8 processors such as the Cortex-A57. However, the code will
only run when the ARMv8 processor is in the AArch32 execution state. The A64 64-bit
instruction set, however, does not run on ARMv7 processors, and only runs on the ARMv8
processors.
Some knowledge of the C programming language and microprocessors is assumed of the
readers of this book. There are pointers to further reading, referring to books and websites that
can give you a deeper level of background to the subject matter.

The change from 32-bit to 64-bit
There are several performance gains derived from moving to a 64-bit processor.
•

The A64 instruction set provides some significant performance benefits, including a
larger register pool. The additional registers and the ARM Architecture Procedure Call
Standard (AAPCS) provide a performance boost when you must pass more than four
registers in a function call. On ARMv7, this would require using the stack, whereas in
AArch64 up to eight parameters can be passed in registers.

•

Wider integer registers enable code that operates on 64-bit data to work more efficiently.
A 32-bit processor might require several operations to perform an arithmetic operation on
64-bit data. A 64-bit processor might be able to perform the same task in a single
operation, typically at the same speed required by the same processor to perform a 32-bit
operation. Therefore, code that performs many 64-bit sized operations is significantly
faster.

•

64-bit operation enables applications to use a larger virtual address space. While the Large
Physical Address Extension (LPAE) extends the physical address space of a 32-bit
processor to 40-bit, it does not extend the virtual address space. This means that even with
LPAE, a single application is limited to a 32-bit (4GB) address space. This is because
some of this address space is reserved for the operating system.

•

Software running on a 32-bit architecture might need to map some data in or out of
memory while executing. Having a larger address space, with 64-bit pointers, avoids this
problem. However, using 64-bit pointers does incur some cost. The same piece of code
typically uses more memory when running with 64-pointers than with 32-bit pointers.
Each pointer is stored in memory and requires eight bytes instead of four. This might
sound trivial, but can add up to a significant penalty. Furthermore, the increased usage of
memory space associated with a move to 64-bits can cause a drop in the number of
accesses that hit in the cache. This in turn can reduce performance.
The larger virtual address space also enables memory-mapping larger files. This is the
mapping of the file contents into the memory map of a thread. This can occur even though
the physical RAM might not be large enough to contain the whole file.

ARM DEN0024A
ID050815

1-2

Introduction

1.1

How to use this book
This book provides a single guide for programmers who want to use the Cortex-A series
processors that implement the ARMv8 architecture. The guide brings together information from
a wide variety of sources that is useful to both ARM assembly language and C programmers. It
is meant to complement rather than replace other ARM documentation available for ARMv8
processors. The other documents for specific information includes the ARM Technical
Reference Manuals (TRMs) for the processors themselves, documentation for individual
devices or boards or, most importantly, the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile - the ARM ARM.
This book is not written at an introductory level. It assumes some knowledge of the C
programming language and microprocessors. Hardware concepts such as caches and Memory
Management Units are covered, but only where this knowledge is valuable to the application
writer. The book looks at the way operating systems utilize ARMv8 features, and how to take
full advantage of the capabilities of the ARMv8 processors. Some chapters contain pointers to
additional reading. We also refer to books and web sites that can give a deeper level of
background to the subject matter, but often the main focus is the ARM-specific detail. No
assumptions are made on the use of any particular toolchain, and both GNU and ARM tools are
mentioned throughout the book.
If you are new to the ARMv8 architecture, Chapter 2 ARMv8-A Architecture and Processors
describes the previous 32-bit ARM architectures, introduces ARMv8, and describes some of the
properties of the ARMv8 processors. Next, Chapter 3 Fundamentals of ARMv8 describes the
building blocks of the architecture in the form of Exception levels and Execution states.
Chapter 4 ARMv8 Registers then describes the registers available to you in the ARMv8
architecture.
One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set, which complements the existing 32-bit architecture. Chapter 5 An
Introduction to the ARMv8 Instruction Sets describes the differences between the Instruction Set
Architecture (ISA) of ARMv7 (A32), and that of the A64 instruction set. Chapter 6 The A64
instruction set looks at the Instruction Set and its use in more detail. In addition to a new
instruction set for general operation, ARMv8 also has a changed NEON and floating-point
instruction set. Chapter 7 AArch64 Floating-point and NEON describes the changes in ARMv8
to ARM Advanced SIMD (NEON) and floating-point instructions. For a more detailed guide to
NEON and its capabilities at ARMv7, refer to the ARM® NEON™ Programmer’s Guide.
Chapter 8 Porting to A64 of this book covers the problems you might encounter when porting
code from other architectures, or previous ARM architectures to ARMv8. Chapter 9 The ABI
for ARM 64-bit Architecture describes the Application Binary Interface (ABI) for the ARM
architecture specification. The ABI is a specification for all the programming behavior of an
ARM target, which governs the form your 64-bit code takes. Chapter 10 AArch64 Exception
Handling describes the exception handling behavior of ARMv8 in AArch64 state.
Following this, the focus moves to the internal architecture of the processor. Chapter 11 Caches
describes the design of caches and how the use of caches can improve performance.
An important motivating factor behind ARMv8 and moving to a 64-bit architecture is
potentially enabling access to larger address space than is possible using just 32 bits. Chapter 12
The Memory Management Unit describes how the MMU converts virtual memory addresses to
physical addresses.
Chapter 13 Memory Ordering describes the weakly-ordered model of memory in the ARMv8
architecture. Generally, this means that the order of memory accesses is not required to be the
same as the program order for load and store operations. Only some programmers must be aware
of memory ordering issues. If your code interacts directly with the hardware or with code

ARM DEN0024A
ID050815

1-3

Introduction

executing on other cores, directly loads or writes instructions to be executed, or modifies page
tables, then you might have to think about ordering and barriers. This also applies if you are
implementing your own synchronization functions or lock-free algorithms.
Chapter 14 Multi-core processors describes how the ARMv8-A architecture supports systems
with multiple cores. Systems that use the ARMv8 processors are almost always implemented in
such a way. Chapter 15 Power Management describes how ARM cores use their hardware that
can reduce power use. A further aspect of power management, applied to multi-core and
multi-cluster systems is covered in Chapter 16 big.LITTLE Technology. This chapter describes
how big.LITTLE technology from ARM couples together an energy efficient LITTLE core with
a high performance big core, to provide a system with high performance and power efficiency.
Chapter 17 Security describes how the ARMv8 processors can create a Secure, or trusted system
that protects assets such as passwords or credit card details from unauthorized copying or
damage. The main part of the book then concludes with Chapter 18 Debug describing the
standard debug and trace features available in the Cortex-A53 and Cortex-A57 processors.

ARM DEN0024A
ID050815

1-4

Chapter 2
ARMv8-A Architecture and Processors

The ARM architecture dates back to 1985, but it has not stayed static. On the contrary, it has
developed massively since the early ARM cores, adding features and capabilities at each step:
ARMv4 and earlier
These early processors used only the ARM 32-bit instruction set.
ARMv4T

The ARMv4T architecture added the Thumb 16-bit instruction set to the ARM
32-bit instruction set. This was the first widely licensed architecture. It was
implemented by the ARM7TDMI® and ARM9TDMI® processors.

ARMv5TE The ARMv5TE architecture added improvements for DSP-type operations,
saturated arithmetic, and for ARM and Thumb interworking. The ARM926EJ-S®
implements this architecture.
ARMv6

ARMv6 made several enhancements, including support for unaligned memory
accesses, significant changes to the memory architecture and for multi-processor
support. Additionally, some support for SIMD operations operating on bytes or
halfwords within the 32-bit registers was included. The ARM1136JF-S®
implements this architecture. The ARMv6 architecture also provided some
optional extensions, notably Thumb-2 and Security Extensions (TrustZone®).
Thumb-2 extends Thumb to be a mixed length 16-bit and 32-bit instruction set.

ARMv7-A

The ARMv7-A architecture makes the Thumb-2 extensions mandatory and adds
the Advanced SIMD extensions (NEON). Before ARMv7, all cores conformed to
essentially the same architecture or feature set. To help address an increasing
range of differing applications, ARM introduced a set of architecture profiles:
•

ARM DEN0024A
ID050815

ARMv7-A provides all the features necessary to support a platform
Operating System such as Linux.

2-1

ARMv8-A Architecture and Processors

ARM DEN0024A
ID050815

•

ARMv7-R provides predictable real-time high-performance.

•

ARMv7-M is targeted at deeply-embedded microcontrollers.
An M profile was also added to the ARMv6 architecture to enable features
for the older architecture. The ARMv6M profile is used by low-cost
microprocessors with low power consumption.

2-2

ARMv8-A Architecture and Processors

2.1

ARMv8-A
The ARMv8-A architecture is the latest generation ARM architecture targeted at the
Applications Profile. The name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution. It introduces the ability to perform
execution with 64-bit wide registers, while preserving backwards compatibility with existing
ARMv7 software.

v5
VFPv2

Thumb-2
TrustZone
SIMD

VFPv3/v4
NEON

Key Feature ARMv7-A
Compatibility

A32+T32 ISAs

A64 ISAs

Scalar FP (SP
and DP)
Adv SIMD (SP
Float)

Scalar FP (SP
and DP)
Adv SIMD (SP &
DP Float)

AArch32

AArch64

Crypto

Figure 2-1 Development of the ARMv8 architecture

The ARMv8-A architecture introduces a number of changes, which enable significantly higher
performance processor implementations to be designed.
Large physical address
This enables the processor to access beyond 4GB of physical memory.
64-bit virtual addressing
This enables virtual memory beyond the 4GB limit. This is important for modern
desktop and server software using memory mapped file I/O or sparse addressing.
Automatic event signaling
This enables power-efficient, high-performance spinlocks.
Larger register files
Thirty-one 64-bit general-purpose registers increase performance and reduce
stack use.
Efficient 64-bit immediate generation
There is less need for literal pools.
Large PC-relative addressing range
A +/-4GB addressing range for efficient data addressing within shared libraries
and position-independent executables.
ARM DEN0024A
ID050815

2-3

ARMv8-A Architecture and Processors

Additional 16KB and 64KB translation granules
This reduces Translation Lookaside Buffer (TLB) miss rates and depth of page
walks.
New exception model
This reduces OS and hypervisor software complexity.
Efficient cache management
User space cache operations improve dynamic code generation efficiency. Fast
Data cache clear using a Data Cache Zero instruction.
Hardware-accelerated cryptography
Provides 3× to 10× better software encryption performance. This is useful for
small granule decryption and encryption too small to offload to a hardware
accelerator efficiently, for example https.
Load-Acquire, Store-Release instructions
Designed for C++11, C11, Java memory models. They improve performance of
thread-safe code by eliminating explicit memory barrier instructions.
NEON double-precision floating-point advanced SIMD
This enables SIMD vectorization to be applied to a much wider set of algorithms,
for example, scientific computing, High Performance Computing (HPC) and
supercomputers.

ARM DEN0024A
ID050815

2-4

ARMv8-A Architecture and Processors

2.2

ARMv8-A Processor properties
Table 2-1 compares the properties of the processor implementations from ARM that support the
ARMv8-A architecture.
Table 2-1 Comparison of ARMv8-A processors
Processor
Cortex-A53

Cortex-A57

Release date

July 2014

January 2015

Typical clock speed

2GHz on 28nm

1.5 to 2.5 GHz on 20nm

Execution order

In-order

Out of order, speculative
issue, superscalar

Cores

1 to 4

Integer Peak throughput

2.3MIPS/MHz

4.1 to 4.76MIPS/MHza

Floating-point Unit

Yes

Half-precision

Yes

Hardware Divide

Yes

Fused Multiply Accumulate

Yes

Pipeline stages

15+

Return stack entries

Generic Interrupt Controller

External

AMBA interface

64-bit I/F AMBA 4
(Supports AMBA 4
and AMBA 5)

128-bit I/F AMBA 4
(Supports AMBA 4 and
AMBA 5)

L1 Cache size (Instruction)

8KB to 64 KB

48KB

L1 Cache structure (Instruction)

2-way set associative

3-way set associative

L1 Cache size (Data)

8KB to 64KB

32KB

L1 Cache structure (Data)

4-way set associative

2-way set associative

L2 Cache

Optional

Integrated

L2 Cache size

128KB to 2MB

512KB to 2MB

L2 Cache structure

16-way set associative

Main TLB entries

512

1024

uTLB entries

48 I-side
32 D-side

A. IMPLEMENTATION DEFINED

ARM DEN0024A
ID050815

2-5

ARMv8-A Architecture and Processors

2.2.1

ARMv8 processors
This section describes each of the processors that implement the ARMv8-A architecture. It only
gives a general description in each case. For more specific information on each processor, see
Table 2-1 on page 2-5.
The Cortex-A53 processor
The Cortex-A53 processor is a mid-range, low-power processor with between one and four
cores in a single cluster, each with an L1 cache subsystem, an optional integrated GICv3/4
interface, and an optional L2 cache controller.
The Cortex-A53 processor is an extremely power efficient processor capable of supporting
32-bit and 64-bit code. It delivers significantly higher performance than the highly successful
Cortex-A7 processor. It is capable of deployment as a standalone applications processor, or
paired with the Cortex-A57 processor in a big.LITTLE configuration for optimum performance,
scalability, and energy efficiency.

ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller

NEON
Data Engine
with crypto ext
Cortex-A53 processor
Floating-point
unit

Level 1
Instruction
Cache

Level 1 Data
Cache w/ECC

Performance Monitor
Unit

SCU

Memory
Management
Unit

Data Processing
Unit

ACP

3
2

Core

Integrated Level 2 Cache w/ECC

AMBA 4 ACE or AMBA 5 CHI Coherent Bus Interface

Figure 2-2 Cortex-A53 processor

The Cortex-A53 processor has the following features:

ARM DEN0024A
ID050815

•

In-order, eight stage pipeline.

•

Lower power consumption from the use of hierarchical clock gating, power domains, and
advanced retention modes.

•

Increased dual-issue capability from duplication of execution resources and dual
instruction decoders.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-6

ARMv8-A Architecture and Processors

•

Power-optimized L2 cache design delivers lower latency and balances performance with
efficiency.

The Cortex-A57 processor
The Cortex-A57 processor is targeted at mobile and enterprise computing applications
including compute intensive 64-bit applications such as high end computer, tablet, and server
products. It can be used with the Cortex-A53 processor into an ARM big.LITTLE configuration,
for scalable performance and more efficient energy use.
The Cortex-A57 processor features cache coherent interoperability with other processors,
including the ARM Mali™ family of Graphics Processing Units (GPUs) for GPU compute and
provides optional reliability and scalability features for high-performance enterprise
applications. It provides significantly more performance than the ARMv7 Cortex-A15
processor, at a higher level of power efficiency. The inclusion of cryptography extensions
improves performance on cryptography algorithms by 10 times over the previous generation of
processors.

ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller

NEON
Data Engine
with crypto ext
Cortex-A57 processor
Floating-point
unit

Level 1
Instruction
Cache

Level 1 Data
Cache w/ECC

3
2

Performance Monitor Unit

SCU

Memory
Protection Unit

ACP

Core

0
Integrated Level 2 Cache w/ECC

AMBA 4 ACE or AMBA5 CHI Coherent Bus Interface

Figure 2-3 Cortex-A57 processor core

The Cortex-A57 processor fully implements the ARMv8-A architecture. It enables multi-core
operation with between one and four cores multi-processing within a single cluster. Multiple
coherent SMP clusters are possible, through AMBA5 CHI or AMBA 4 ACE technology. Debug
and trace are available through CoreSight technology.
The Cortex-A57 processor has the following features:
•
ARM DEN0024A
ID050815

Out-of-order, 15+ stage pipeline.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-7

ARMv8-A Architecture and Processors

ARM DEN0024A
ID050815

•

Power-saving features include way-prediction, tag-reduction, and cache-lookup
suppression.

•

Increased peak instruction throughput through duplication of execution resources.
Power-optimized instruction decode with localized decoding, 3-wide decode bandwidth.

•

Performance optimized L2 cache design enables more than one core in the cluster to
access the L2 at the same time.

2-8

Chapter 3
Fundamentals of ARMv8

In ARMv8, execution occurs at one of four Exception levels. In AArch64, the Exception level
determines the level of privilege, in a similar way to the privilege levels defined in ARMv7. The
Exception level determines the privilege level, so execution at ELn corresponds to privilege
PLn. Similarly, an Exception level with a larger value of n than another one is at a higher
Exception level. An Exception level with a smaller number than another is described as being
at a lower Exception level.
Exception levels provide a logical separation of software execution privilege that applies across
all operating states of the ARMv8 architecture. It is similar to, and supports the concept of,
hierarchical protection domains common in computer science.
The following is a typical example of what software runs at each Exception level:

ARM DEN0024A
ID050815

EL0

Normal user applications.

EL1

Operating system kernel typically described as privileged.

EL2

Hypervisor.

EL3

Low-level firmware, including the Secure Monitor.

3-1

Fundamentals of ARMv8

Normal world
EL0

Application

Kernel

EL1

Application

Kernel

EL2

Hypervisor

EL3

Secure monitor

Figure 3-1 Exception levels

In general, a piece of software, such as an application, the kernel of an operating system, or a
hypervisor, occupies a single Exception level. An exception to this rule is in-kernel hypervisors
such as KVM, which operate across both EL2 and EL1.
ARMv8-A provides two security states, Secure and Non-secure. The Non-secure state is also
referred to as the Normal World. This enables an Operating System (OS) to run in parallel with
a trusted OS on the same hardware, and provides protection against certain software attacks and
hardware attacks. ARM TrustZone technology enables the system to be partitioned between the
Normal and Secure worlds. As with the ARMv7-A architecture, the Secure monitor acts as a
gateway for moving between the Normal and Secure worlds.

Normal world
EL0

EL1

EL2

EL3

Application

Secure world

Application

Guest OS

Application

Guest OS

Secure firmware

Trusted OS

No Hypervisor in
Secure world

Hypervisor

Secure monitor

Figure 3-2 ARMv8 Exception levels in the Normal and Secure worlds

ARMv8-A also provides support for virtualization, though only in the Normal world. This
means that hypervisor, or Virtual Machine Manager (VMM) code can run on the system and
host multiple guest operating systems. Each of the guest operating systems is, essentially,
running on a virtual machine. Each OS is then unaware that it is sharing time on the system with
other guest operating systems.

ARM DEN0024A
ID050815

3-2

Fundamentals of ARMv8

The Normal world (which corresponds to the Non-secure state) has the following privileged
components:
Guest OS kernels
Such kernels include Linux or Windows running in Non-secure EL1. When
running under a hypervisor, the rich OS kernels can be running as a guest or host
depending on the hypervisor model.
Hypervisor
This runs at EL2, which is always Non-secure. The hypervisor, when present and
enabled, provides virtualization services to rich OS kernels.
The Secure world has the following privileged components:
Secure firmware
On an application processor, this firmware must be the first thing that runs at boot
time. It provides several services, including platform initialization, the
installation of the trusted OS, and routing of Secure monitor calls.
Trusted OS
Trusted OS provides Secure services to the Normal world and provides a runtime
environment for executing Secure or trusted applications.
The Secure monitor in the ARMv8 architecture is at a higher Exception level and is more
privileged than all other levels. This provides a logical model of software privilege.
Figure 3-2 on page 3-2 shows that a Secure version of EL2 is not available.

ARM DEN0024A
ID050815

3-3

Fundamentals of ARMv8

3.1

Execution states
The ARMv8 architecture defines two Execution States, AArch64 and AArch32. Each state is
used to describe execution using 64-bit wide general-purpose registers or 32-bit wide
general-purpose registers, respectively. While ARMv8 AArch32 retains the ARMv7 definitions
of privilege, in AArch64, privilege level is determined by the Exception level. Therefore,
execution at ELn corresponds to privilege PLn.
When in AArch64 state, the processor executes the A64 instruction set. When in AArch32 state,
the processor can execute either the A32 (called ARM in earlier versions of the architecture) or
the T32 (Thumb) instruction set.
The following diagrams show the organization of the Exception levels in AArch64 and
AArch32.
In AArch64:

Normal world
EL0

Application

EL1

Application

Guest OS

EL2

Secure world
Application

Guest OS

Trusted OS

No Hypervisor in
Secure world

Hypervisor

EL3

Secure firmware

Secure monitor

Figure 3-3 Exception levels in AArch64

In AArch32:

Normal world
EL0

EL1

Application

Secure world

Application

Guest OS

Application

Secure firmware

Guest OS
Trusted kernel
(operates at EL3)

EL2

EL3

Hypervisor

No EL2 in Secure
world
Secure monitor

Figure 3-4 Exception levels in AArch32

In AArch32 state, Trusted OS software executes in Secure EL3, and in AArch64 state it
primarily executes in Secure EL1.

ARM DEN0024A
ID050815

3-4

Fundamentals of ARMv8

3.2

Changing Exception levels
In the ARMv7 architecture, the processor mode can change under privileged software control
or automatically when taking an exception. When an exception occurs, the core saves the
current execution state and the return address, enters the required mode, and possibly disables
hardware interrupts.
This is summarized in the following table. Applications operate at the lowest level of privilege,
PL0, previously unprivileged mode. Operating systems run at PL1, and the Hypervisor in a
system with the Virtualization extensions at PL2. The Secure monitor, which acts as a gateway
for moving between the Secure and Non-secure (Normal) worlds, also operates at PL1.
Table 3-1 ARMv7 processor modes

ARM DEN0024A
ID050815

Mode

Function

Security
state

Privilege
level

User (USR)

Unprivileged mode in which most applications run

Both

PL0

FIQ

Entered on an FIQ interrupt exception

Both

PL1

IRQ

Entered on an IRQ interrupt exception

Both

PL1

Supervisor
(SVC)

Entered on reset or when a Supervisor Call instruction (SVC)
is executed

Both

PL1

Monitor (MON)

Entered when the SMC instruction (Secure Monitor Call) is
executed or when the processor takes an exception which is
configured for secure handling.
Provided to support switching between Secure and
Non-secure states.

Secure only

PL1

Abort (ABT)

Entered on a memory access exception

Both

PL1

Undef (UND)

Entered when an undefined instruction is executed

Both

PL1

System (SYS)

Privileged mode, sharing the register view with User mode

Both

PL1

Hyp (HYP)

Entered by the Hypervisor Call and Hyp Trap exceptions.

Non-secure only

PL2

3-5

Fundamentals of ARMv8

Non-secure state

Secure state

Non-secure PL0
USER mode

Secure PL0
USER mode

Non-secure PL1

Secure PL1

System mode (SYS)
Supervisor mode (SVC)
FIQ mode
IRQ mode
Undef (UND) mode
Abort (ABT) mode

Non-secure PL2
Hyp mode

Secure PL1
Monitor mode (MON)

Figure 3-5 ARMv7 privilege levels

In AArch64, the processor modes are mapped onto the Exception levels as in Figure 3-6. As in
ARMv7 (AArch32) when an exception is taken, the processor changes to the Exception level
(mode) that supports the handling of the exception.

Normal world
User

SVC, ABT, IRQ,
FIQ, UND, SYS

Hyp

Mon

Application

Secure world

Application

Guest OS

Application

Guest OS

Hypervisor

Secure firmware

EL0

Trusted OS

EL1

No Hypervisor in
Secure world

EL2

EL3

Secure monitor

Figure 3-6 AArch32 processor modes

Movement between Exception levels follows these rules:

ARM DEN0024A
ID050815

•

Moves to a higher Exception level, such as from EL0 to EL1, indicate increased software
execution privilege.

•

An exception cannot be taken to a lower Exception level.

•

There is no exception handling at level EL0, exceptions must be handled at a higher
Exception level.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-6

Fundamentals of ARMv8

ARM DEN0024A
ID050815

•

An exception causes a change of program flow. Execution of an exception handler starts,
at an Exception level higher than EL0, from a defined vector that relates to the exception
taken. Exceptions include:
— Interrupts such as IRQ and FIQ.
— Memory system aborts.
— Undefined instructions.
— System calls. These permit unprivileged software to make a system call to an
operating system.
— Secure monitor or hypervisor traps.

•

Ending exception handling and returning to the previous Exception level is performed by
executing the ERET instruction.

•

Returning from an exception can stay at the same Exception level or enter a lower
Exception level. It cannot move to a higher Exception level.

•

The security state does change with a change of Exception level, except when retuning
from EL3 to a Non-secure state. See Switching between Secure and Non-secure state on
page 17-8.

3-7

Fundamentals of ARMv8

3.3

Changing execution state
There are times when you must change the execution state of your system. This could be, for
example, if you are running a 64-bit operating system, and want to run a 32-bit application at
EL0. To do this, the system must change to AArch32.
When the application has completed or execution returns to the OS, the system can switch back
to AArch64. Figure 3-7 on page 3-9 shows that you cannot do it the other way around. An
AArch32 operating system cannot host a 64-bit application.
To change between execution states at the same Exception level, you have to switch to a higher
Exception level then return to the original Exception level. For example, you might have 32-bit
and 64-bit applications running under a 64-bit OS. In this case, the 32-bit application can
execute and generate a Supervisor Call (SVC) instruction, or receive an interrupt, causing a
switch to EL1 and AArch64. (See Exception handling instructions on page 6-21.) The OS can
then do a task switch and return to EL0 in AArch64. Practically speaking, this means that you
cannot have a mixed 32-bit and 64-bit application, because there is no direct way of calling
between them.
You can only change execution state by changing Exception level. Taking an exception might
change from AArch32 to AArch64, and returning from an exception may change from AArch64
to AArch32.
Code at EL3 cannot take an exception to a higher exception level, so cannot change execution
state, except by going through a reset.
The following is a summary of some of the points when changing between AArch64 and
AArch32 execution states:
•

Both AArch64 and AArch32 execution states have Exception levels that are generally
similar, but there are some differences between Secure and Non-secure operation. The
execution state the processor is in when the exception is generated can limit the Exception
levels available to the other execution state.

•

Changing to AArch32 requires going from a higher to a lower Exception level. This is the
result of exiting an exception handler by executing the ERET instruction. See Exception
handling instructions on page 6-21.

•

Changing to AArch64 requires going from a lower to a higher Exception level. The
exception can be the result of an instruction execution or an external signal.

•

If, when taking an exception or returning from an exception, the Exception level remains
the same, the execution state cannot change.

•

Where an ARMv8 processor operates in AArch32 execution state at a particular
Exception level, it uses the same exception model as in ARMv7 for exceptions taken to
that Exception level. In the AArch64 execution state, it uses the exception handling model
described in Chapter 10 AArch64 Exception Handling.

Interworking between the two states is therefore performed at the level of the Secure monitor,
hypervisor or operating system. A hypervisor or operating system executing in AArch64 state
can support AArch32 operation at lower privilege levels. This means that an OS running in
AArch64 can host both AArch32 and AArch64 applications. Similarly, an AArch64 hypervisor
can host both AArch32 and AArch64 guest operating systems. However, a 32-bit operating
system cannot host a 64-bit application and a 32-bit hypervisor cannot host a 64-bit guest
operating system.

ARM DEN0024A
ID050815

3-8

Fundamentals of ARMv8

EL0

An AArch64
OS can host
a mix of
AArch64
and AArch32
applications

EL1

EL2

AArch32
App

AArch64
App

AArch32
App

AArch64 OS

An AArch64
hypervisor
can host
an AArch64 and
AArch32 OS

AArch64
App

An AArch32
OS cannot host
an AArch64
application

AArch32 OS

Hypervisor

An AArch32
hypervisor
cannot host
an AArch64 OS

Figure 3-7 Moving between AArch32 and AArch64

For the highest implemented Exception level (EL3 on the Cortex-A53 and Cortex-A57
processors), which execution state to use for each Exception level when taking an exception is
fixed. The Exception level can only be changed by resetting the processor. For EL2 and EL1, it
is controlled by the System registers on page 4-7.

ARM DEN0024A
ID050815

3-9

Chapter 4
ARMv8 Registers

The AArch64 execution state provides 31 × 64-bit general-purpose registers accessible at all
times and in all Exception levels.
Each register is 64 bits wide and they are generally referred to as registers X0-X30.

ARM DEN0024A
ID050815

4-1

ARMv8 Registers

Frame pointer
Procedure link register

X0/W0
X1/W1
X2/W2
X3/W3
X4/W4
X5/W5
X6/W6
X7/W7
X8/W8
X9/W9
X10/W10
X11/W11
X12/W12
X13/W13
X14/W14
X15/W15
X16/W16
X17/W17
X18/W18
X19/W19
X20/W20
X21/W21
X22/W22
X23/W23
X24/W24
X25/W25
X26/W26
X27/W27
X28/W28
X29/W29
X30/W30

EL0, EL1,
EL2, EL3
Figure 4-1 AArch64 general-purpose registers

Each AArch64 64-bit general-purpose register (X0-X30) also has a 32-bit (W0-W30) form.

32 31

Wn
Xn

Figure 4-2 64-bit register with W and X access.

The 32-bit W register forms the lower half of the corresponding 64-bit X register. That is, W0
maps onto the lower word of X0, and W1 maps onto the lower word of X1.
Reads from W registers disregard the higher 32 bits of the corresponding X register and leave
them unchanged. Writes to W registers set the higher 32 bits of the X register to zero. That is,
writing 0xFFFFFFFF into W0 sets X0 to 0x00000000FFFFFFFF.

ARM DEN0024A
ID050815

4-2

ARMv8 Registers

4.1

AArch64 special registers
In addition to the 31 core registers, there are also several special registers.

XZR/WZR
PC

Zero register
Program counter
Stack pointer

Special
registers

SP_EL0

SP_EL1
SPSR_EL1
ELR_EL1

SP_EL2
SPSR_EL2
ELR_EL2

SP_EL3
SPSR_EL3
ELR_EL3

EL0

EL1

EL2

EL3

Program Status Register
Exception Link Register

Figure 4-3 AArch64 special registers

Note
There is no register called X31 or W31. Many instructions are encoded such that the number 31
represents the zero register, ZR (WZR/XZR). There is also a restricted group of instructions
where one or more of the arguments are encoded such that number 31 represents the Stack
Pointer (SP).
When accessing the zero register, all writes are ignored and all reads return 0. Note that the
64-bit form of the SP register does not use an X prefix.
Table 4-1 Special registers in AArch64
Name

Size

Description

WZR

32 bits

Zero register

XZR

64 bits

Zero register

WSP

32 bits

Current stack pointer

64 bits

Current stack pointer

64 bits

Program counter

In the ARMv8 architecture, when executing in AArch64, the exception return state is held in the
following dedicated registers for each Exception level:
•

Exception Link Register (ELR).

•

Saved Processor State Register (SPSR).

There is a dedicated SP per Exception level, but it is not used to hold return state.
Table 4-2 Special registers by Exception level
EL0

EL1

EL2

EL3

SP_EL0

SP_EL1

SP_EL2

SP_EL3

Exception Link Register (ELR)

ELR_EL1

ELR_EL2

ELR_EL3

Saved Process Status Register (SPSR)

SPSR_EL1

SPSR_EL2

SPSR_EL3

Stack Pointer (SP)

ARM DEN0024A
ID050815

4-3

ARMv8 Registers

4.1.1

Zero register
The zero register reads as zero when used as a source register and discards the result when used
as a destination register. You can use the zero register in most, but not all, instructions.

4.1.2

Stack pointer
In the ARMv8 architecture, the choice of stack pointer to use is separated to some extent from
the Exception level. By default, taking an exception selects the stack pointer for the target
Exception level, SP_ELn. For example, taking an exception to EL1 selects SP_EL1. Each
Exception level has its own stack pointer, SP_EL0, SP_EL1, SP_EL2, and SP_EL3.
When in AArch64 at an Exception level other than EL0, the processor can use either:
•

A dedicated 64-bit stack pointer associated with that Exception level (SP_ELn).

•

The stack pointer associated with EL0 (SP_EL0).

EL0 can only ever access SP_EL0.
Table 4-3 AArch64 Stack pointer options
Exception
level

Options

EL0

EL0t

EL1

EL1t, EL1h

EL2

EL2t, EL2h

EL3

EL3t, EL3h

The t suffix indicates that the SP_EL0 stack pointer is selected. The h suffix indicates that the
SP_ELn stack pointer is selected.
The SP cannot be referenced by most instructions. However, some forms of arithmetic
instructions, for example, the ADD instruction, can read and write to the current stack pointer to
adjust the stack pointer in a function. For example:
ADD SP, SP, #0x10

4.1.3

// Adjust SP to be 0x10 bytes before its current value

Program Counter
One feature of the original ARMv7 instruction set was the use of R15, the Program Counter
(PC) as a general-purpose register. The PC enabled some clever programming tricks, but it
introduced complications for compilers and the design of complex pipelines. Removing direct
access to the PC in ARMv8 makes return prediction easier and simplifies the ABI specification.
The PC is never accessible as a named register. Its use is implicit in certain instructions such as
PC-relative load and address generation. The PC cannot be specified as the destination of a data
processing instruction or load instruction.

4.1.4

Exception Link Register (ELR)
The Exception Link Register holds the exception return address.

ARM DEN0024A
ID050815

4-4

ARMv8 Registers

4.1.5

Saved Process Status Register
When taking an exception, the processor state is stored in the relevant Saved Program Status
Register (SPSR), in a similar way to the CPSR in ARMv7. The SPSR holds the value of PSTATE
before taking an exception and is used to restore the value of PSTATE when executing an
exception return.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V

SS IL

D A I F

M [3:0]

Figure 4-4 SPSR

The individual bits represent the following values for AArch64:
N

Negative result (N flag).

Zero result (Z) flag.

Carry out (C flag).

Overflow (V flag).

Software Step. Indicates whether software step was enabled when an exception
was taken.

Illegal Execution State bit. Shows the value of PSTATE.IL immediately before
the exception was taken.

Process state Debug mask. Indicates whether debug exceptions from watchpoint,
breakpoint, and software step debug events that are targeted at the Exception level
the exception occurred in were masked or not.

SError (System Error) mask bit.

IRQ mask bit.

FIQ mask bit.

M[4]

Execution state that the exception was taken from. A value of 0 indicates
AArch64.

M[3:0]

Mode or Exception level that an exception was taken from.

In ARMv8, the SPSR written to depends on the Exception level. If the exception is taken in EL1,
then SPSR_EL1 is used. If the exception is taken in EL2, then SPSR_EL2 is used, and if the
exception is taken in EL3, SPSR_EL3 is used. The core populates the SPSR when taking an
exception.
Note
The register pairs ELR_ELn and SPSR_ELn that are associated with an Exception level retain
their state during execution at a lower Exception level.

ARM DEN0024A
ID050815

4-5

ARMv8 Registers

4.2

Processor state
AArch64 does not have a direct equivalent of the ARMv7 Current Program Status Register
(CPSR). In AArch64, the components of the traditional CPSR are supplied as fields that can be
made accessible independently. These are referred to collectively as Processor State (PSTATE).
The Processor State, or PSTATE fields, for AArch64 have the following definitions:
Table 4-4 PSTATE field definitions
Name

Description

Negative condition flag.

Zero condition flag.

Carry condition flag.

oVerflow condition flag.

Debug mask bit.

SError mask bit.

IRQ mask bit.

FIQ mask bit.

Software Step bit.

Illegal execution state bit.

EL (2)

Exception level.

nRW

Execution state
0 = 64-bit
1 = 32-bit

Stack Pointer selector.
0 = SP_EL0
1 = SP_ELn

In AArch64, you return from an exception by executing the ERET instruction, and this causes the
SPSR_ELn to be copied into PSTATE. This restores the ALU flags, execution state, Exception
level, and the processor branches. From here, you continue execution from the address in
ELR_ELn.
The PSTATE.{N, Z, C, V} fields can be accessed at EL0. All other PSTATE fields can be executed
at EL1 or higher and are UNDEFINED at EL0.

ARM DEN0024A
ID050815

4-6

ARMv8 Registers

4.3

System registers
In AArch64, system configuration is controlled through system registers, and accessed using
MSR and MRS instructions. This contrasts with ARMv7-A, where such registers were typically
accessed through coprocessor 15 (CP15) operations. The name of a register tells you the lowest
Exception level that it can be accessed from.
For example:
•

TTBR0_EL1 is accessible from EL1, EL2, and EL3.

•

TTBR0_EL2 is accessible from EL2 and EL3.

Registers that have the suffix _ELn have a separate, banked copy in some or all of the levels,
though usually not EL0. Few system registers are accessible from EL0, although the Cache Type
Register (CTR_EL0) is an example of one that can be accessible.
Code to access system registers takes the following form:
MRS
MSR

x0, TTBR0_EL1
TTBR0_EL1, x0

// Move TTBR0_EL1 into x0
// Move x0 into TTBR0_EL1

Previous versions of the ARM architecture have used coprocessors for system configuration.
However, AArch64 does not include support for coprocessors. Table 4-5 lists only the system
registers mentioned in this book.
For a complete list, see Appendix J of the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile.
The table shows the Exception levels that have separate copies of each register. For example,
separate Auxiliary Control Registers (ACTLRs) exist as ACTLR_EL1, ACTLR_EL2 and
ACTLR_EL3.
Table 4-5 System registers
Name

Description

Allowed
values of n

ACTLR_ELn

Auxiliary Control
Register

Controls processor-specific features.

1, 2, 3

CCSIDR_ELn

Current Cache
Size ID Register

Provides information about the architecture of the currently
selected cache. See Cache discovery on page 11-18.

CLIDR_ELn

Cache Level ID
Register

The type of cache, or caches, implemented at each level.
The Level of Coherency and Level of Unification for the cache
hierarchy.
See Cache maintenance on page 11-13.

1, 2, 3

CNTFRQ_ELn

Counter-timer
Frequency
Register

Reports the frequency of the system timer. See Timers on
page 14-5.

CNTPCT_ELn

Counter-timer
Physical Count
Register

Holds the 64-bit current count value. See Timers on page 14-5.

CNTKCTL_ELn

Counter-timer
Kernel Control
Register

Controls the generation of an event stream from the virtual
counter. Also controls access from EL0 to the physical counter,
virtual counter, EL1 physical timers, and the virtual timer. See
Timers on page 14-5.

ARM DEN0024A
ID050815

4-7

ARMv8 Registers

Table 4-5 System registers (continued)
Allowed
values of n

Name

Description

CNTP_CVAL_ELn

Counter-timer
Physical Timer
Compare Value
Register

Holds the compare value for the EL1 physical timer. See Timers
on page 14-5.

CPACR_ELn

Coprocessor
Access Control
Register

Controls access to Trace, floating-point, and NEON
functionality. See ISB in more detail on page 13-9.

CSSELR_ELn

Cache Size
Selection Register

Selects the current Cache Size ID Register, CCSIDR_EL1, by
specifying the required cache level and the cache type, either
instruction or data cache. See Cache discovery on page 11-18.

CNTP_CTL_ELn

Counter-timer
Physical Control
Register

Control register for the EL1 physical timer. See Timers on
page 14-5.

CTR_ELn

Cache Type
Register

Information about the architecture of the integrated caches. See
Cache discovery on page 11-18.

DCZID_ELn

Data Cache Zero
ID Register

Indicates the block size written with byte values of 0 by the Data
Cache Zero by Virtual Address (DCZVA) system instruction.
See Cache discovery on page 11-18.

ELR_ELn

Exception Link
Register

Holds the address of the instruction which caused the exception.

1, 2, 3

ESR_ELn

Exception
Syndrome
Register

Includes information about the reasons for the exception. See
The Exception Syndrome Register on page 10-9.

1, 2, 3

FAR_ELn

Fault Address
Register

Holds the virtual faulting address. See Handling synchronous
exceptions on page 10-7.

1, 2, 3

FPCR

Floating-point
Control Register

Controls floating-point extension behavior. The fields in this
register map to the equivalent fields in the AArch32 FPSCR.
See New features for NEON and Floating-point in AArch64 on
page 7-2.

FPSR

Floating-point
Status Register

Provides floating-point system status information. The fields in
this register map to the equivalent fields in the AArch32
FPSCR. See New features for NEON and Floating-point in
AArch64 on page 7-2.

HCR_ELn

Hypervisor
Configuration
Register

Controls virtualization settings and trapping of exceptions to
EL2. See Exception handling on page 18-8.

MAIR_ELn

Memory Attribute
Indirection
Register

Provides the memory attribute encodings corresponding to the
possible values in a Long-descriptor format translation table
entry for stage 1 translations at ELn. See Memory types on
page 13-3.

1, 2, 3

MIDR_ELn

Main ID Register

The type of processor the code is running on (part number and
revision).

MPIDR_ELn

Multiprocessor
Affinity Register

The processor and cluster IDs, in multi-core or cluster systems.
See Determining which core the code is running on on
page 14-3.

ARM DEN0024A
ID050815

4-8

ARMv8 Registers

Table 4-5 System registers (continued)

4.3.1

Allowed
values of n

Name

Description

SCR_ELn

Secure
Configuration
Register

Controls Secure state and trapping of exceptions to EL3. See
Handling synchronous exceptions on page 10-7.

SCTLR_ELn

System Control
Register

Controls architectural features, for example the MMU, caches
and alignment checking.

0, 1, 2, 3

SPSR_ELn

Saved Program
Status Register

Holds the saved processor state when an exception is taken to
this mode or Exception level.

abt, fiq, irq,
und, 1,2, 3

TCR_ELn

Translation
Control Register

Determines which of the Translation Table Base Registers
define the base address for a translation table walk required for
the stage 1 translation of a memory access from ELn. Also
controls the translation table format and holds cacheability and
shareability information. See Separation of kernel and
application Virtual Address spaces on page 12-7.

1, 2, 3

TPIDR_ELn

User Read/Write
Thread ID
Register

Provides a location where software executing at ELn can store
thread identifying information, for OS management purposes.
See Context switching on page 12-27.

0, 1, 2, 3

TPIDRRO_ELn

User Read-Only
Thread ID
Register

Provides a location where software executing at EL1 or higher
can store thread identifying information. This informaton is
visible to software executing at EL0, for OS management
purposes. See Context switching on page 12-27.

TTBR0_ELn

Translation Table
Base Register 0

Holds the base address of translation table 0, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at ELn. See
Separation of kernel and application Virtual Address spaces on
page 12-7.

1, 2, 3

TTBR1_ELn

Translation Table
Base Register 1

Holds the base address of translation table 1, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at EL0 and
EL1. See Separation of kernel and application Virtual Address
spaces on page 12-7.

VBAR_ELn

Vector Based
Address Register

Holds the exception base address for any exception that is taken
to ELn. See AArch64 exception table on page 10-12.

1, 2, 3

VTCR_ELn

Virtualization
Translation
Control Register

Controls the translation table walks required for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
Also holds cacheability and shareability information for the
accesses. See Translations at EL2 and EL3 on page 12-20.

VTTBR_ELn

Virtualization
Translation Table
Base Register

Holds the base address of the translation table for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
See Memory translation on page 18-3.

The system control register
The System Control Register (SCTLR) is a register that controls standard memory, system
facilities and provides status information for functions that are implemented in the core.

ARM DEN0024A
ID050815

4-9

ARMv8 Registers

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EE

SA C A M

nTWE
UCI EOE

WXN

UCT

SED CP15BEN

nTWI DZE

SCTLR_EL1

SA0

UMA ITD

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I

SA C A M

SCTLR_EL2
SCTLR_EL3

WXN

Figure 4-5 SCTLR bit assignments

Not all bits are available above EL1. The individual bits represent the following:
UCI

When set, enables EL0 access in AArch64 for DC CVAU, DC CIVAC, DC CVAC, and
IC IVAU instructions. See Cache maintenance on page 11-13.

Exception endianness. See Endianness on page 4-12.

EOE

WXN

ARM DEN0024A
ID050815

Little endian.

Big endian.

Endianness of explicit data accesses at EL0. The possible values of this bit are:
0

Explicit data accesses at EL0 are little-endian.

Explicit data accesses at EL0 are big-endian.

Write permission implies XN (eXecute Never). See Access permissions on
page 12-23.
0

Regions with write permission are not forced to XN.

Regions with write permission are forced to XN.

nTWE

Not trap WFE. A value of 1 means that WFE instructions are executed as normal.

nTWI

Not trap WFI. A value of 1 means that WFI instructions are executed as normal.

UCT

When set, enables EL0 access in AArch64 to the CTR_EL0 register.

DZE

Access to DC ZVA instruction at EL0. See Cache maintenance on page 11-13.
0

Execution prohibited.

Execution allowed.

Instruction cache enable. This is an enable bit for instruction caches at EL0 and
EL1. Instruction accesses to cacheable Normal memory are cached.

UMA

User Mask Access. Controls access to interrupt masks from EL0, when EL0 is
using AArch64.

SED

SETEND Disable. Disables SETEND instructions at EL0 using AArch32.
0

SETEND instructions are enabled.

The SETEND instruction is disabled.

4-10

ARMv8 Registers

ITD

IT Disable. The possible values of this bit are:
0

The IT instruction is available.

The IT instruction is treated as a 16-bit instruction. Only another 16-bit
instruction, or the first half of a 32-bit instruction, can follow. This
depends upon the implementation.

CP15BEN

CP15 barrier enable. If implemented, it is an enable bit for the AArch32 CP15
DMB, DSB, and ISB barrier operations.

SA0

Stack Alignment Check Enable for EL0.

Stack Alignment Check Enable.

Data cache enable. This is an enable bit for data caches at EL0 and EL1. Data
accesses to cacheable Normal memory are cached.

Alignment check enable bit.

Enable the MMU.

Accessing the SCTLR
To access the SCTLR_ELn, use:
MRS , SCTLR_ELn
MSR SCTLR_ELn,

// Read SCTLR_ELn into Xt
// Write Xt to SCTLR_ELn

For example:
Example 4-1 Setting bits in the SCTLR

MRS
ORR
ORR
MSR

X0, SCTLR_EL1
X0, X0, #(1 << 2)
X0, X0, #(1 << 12)
SCTLR_EL1, X0

//
//
//
//

Read System Control Register configuration data
Set [C] bit and enable data caching
Set [I] bit and enable instruction caching
Write System Control Register configuration data

Note
The caches in the processor must be invalidated before caching of data and instructions is
enabled in any of the Exception levels.

ARM DEN0024A
ID050815

4-11

ARMv8 Registers

4.4

Endianness
There are two basic ways of viewing bytes in memory, either as Little-Endian (LE) or
Big-Endian (BE). On big-endian machines, the most significant byte of an object in memory is
stored at the lowest address, that is the address closest to zero. On little-endian machines, the
least significant byte is stored at the lowest address. The term byte-ordering can also be used
rather than endianness.

Byte

Little endian

0x12345678

Big endian
Byte

Figure 4-6

This data endianness is controlled independently for each Execution level. For EL3, EL2 and
EL1, the relevant register of SCTLR_ELn.EE sets the endianness. The additional bit at EL1,
SCTLR_EL1.E0E controls the data endian setting for EL0. In the AArch64 execution state, data
accesses can be LE or BE, while instruction fetches are always LE.
Whether a processor supports both LE and BE depends upon the implementation of the
processor. If only little-endianness is supported, then the EE and E0E bits are always 0.
Similarly, if only big-endianness is supported, then the EE and E0E bits are at a static 1 value.
When using AArch32, having the CPSR.E bit have a different value to the equivalent System
Control register EE bit when in EL1, EL2, or EL3 is now deprecated. The use of the ARMv7
SETEND instruction is also deprecated. It is possible to cause the Undef exception to be taken upon
executing a SETEND instruction, by setting the SCTLR.SED bit.

ARM DEN0024A
ID050815

4-12

ARMv8 Registers

4.5

Changing execution state (again)
In Changing execution state on page 3-8, we described the change between AArch64 and
AArch32 in terms of Exception levels. Now we consider the change from the point of view of
the registers.
On entry to an Exception level using AArch64 from an Exception level using AArch32:
•

The values of the upper 32 bits of registers that were accessible to any lower Exception
level using AArch32 execution are UNKNOWN.

•

The registers that are not accessible during AArch32 execution retain the state that they
had before AArch32 execution.

•

On exception entry to EL3, when EL2 has been using AArch32, the values of the upper
32 bits of the ELR_EL2 are UNKNOWN.

•

AArch64 Stack Pointers (SPs) and Exception Link Registers (ELRs) associated with an
Exception level that is not accessible during AArch32 execution, at that Exception level,
retain the state that they had before AArch32 execution. This applies to the following
registers:
— SP_EL0.
— SP_EL1.
— SP_EL2.
— ELR_EL1.

In general, application programmers write applications for either AArch32 or AArch64. It is
only the OS that must take account of the two execution states and the switch between them.
4.5.1

Registers at AArch32
Being virtually identical to ARMv7 means AArch32 must match ARMv7 privilege levels. It
also means that AArch32 only deals with ARMv7 32-bit general-purpose registers. Therefore,
there must be some correspondence between the ARMv8 architecture, and the view of it
provided by the AArch32 execution state.
Remember that in the ARMv7 architecture there are sixteen 32-bit general-purpose registers
(R0-R15) for software use. Fifteen of them (R0-R14) can be used for general-purpose data
storage. The remaining register, R15, is the program counter (PC) whose value is altered as the
core executes instructions. Software can also access the CPSR, and the saved copy of the CPSR
from the previously executed mode, is the SPSR. On taking an exception, the CPSR is copied
to the SPSR of the mode to which the exception is taken.
Which of these registers is accessed, and where, depends upon the processor mode the software
is executing in and the register itself. This is called banking, and the shaded registers in
Figure 4-7 on page 4-14 are banked. They use physically distinct storage and are usually
accessible only when a process is executing in that particular mode.

ARM DEN0024A
ID050815

4-13

ARMv8 Registers

R2
R3

R5
R6

R8_fiq

R9_fiq

R10

R10_fiq

R10

R11

R11_fiq

R11

R12

R12_fiq

R12

R13 (sp)

SP_fiq

SP_irq

SP_abt

SP_svc

SP_und

SP_mon

SP_hyp

R14 (lr)

LR_fiq

LR_irq

LR_abt

LR_svc

LR_und

LR_mon

LR_hyp

R15 (pc)

R15 (pc) R15 (pc)

(A/C)PSR

CPSR

User

Sys

R15 (pc)

R15 (pc) R15 (pc)

R15 (pc)

CPSR
CPSR
CPSR
SPSR_hyp
SPSR_mon
SPSR_und
SPSR_fiq SPSR_irq SPSR_abt SPSR_svc
ELR_hyp
CPSR

CPSR

FIQ

IRQ

ABT

SVC

UND

MON

HYP

Banked

Figure 4-7 The ARMv7 register set showing banked registers

Banking is used in ARMv7 to reduce the latency for exceptions. However, this also means that
of a considerable number of possible registers, fewer than half can be used at any one time.
In contrast, the AArch64 execution state has 31 × 64-bit general-purpose registers accessible at
all times and in all Exception levels. A change in execution state between AArch64 and
AArch32 means that the AArch64 registers must necessarily map onto the AArch32 (ARMv7)
register set. This mapping is shown in Figure 4-8 on page 4-15.
The upper 32 bits of the AArch64 registers are inaccessible when executing in AArch32. If the
processor is operating in AArch32 state, it uses the 32-bit W registers, which are equivalent to
the 32-bit ARMv7 registers.
AArch32 maps the banked registers to AArch64 registers that would otherwise be inaccessible.

ARM DEN0024A
ID050815

4-14

ARMv8 Registers

R1
R2

W24

W8
W9

W25

W10

R10

W26

R10

W11

R11

W27

R11

R12

W12

R12

W28

R12

W29

W17

W21

W19

W13

R13 (sp)

W14

R14 (lr)

R15

R15 (pc) R15 (pc) R15 (pc)

(A/C)PSR

W30

CPSR

W16

W20

W18

R15 (pc) R15 (pc)

CPSR

W23

R13

W15

W22

R14

R15 (pc) R15 (pc)
CPSR

CPSR

R15 (pc)
CPSR

SPSR_fiq SPSR_irq SPSR_abt SPSR_EL1 SPSR_und SPSR_EL3 SPSR_EL2
ELR_EL2

User

Sys

FIQ

IRQ

ABT

SVC

UND

MON

HYP

Inaccessible from AArch64

Figure 4-8 AArch64 to AArch32 register mapping

The SPSR and ELR_Hyp registers in AArch32 are additional registers that are accessible using
system instructions only. They are not mapped into the general-purpose register space of the
AArch64 architecture. Some of these registers are mapped between AArch32 and AArch64:
•

SPSR_svc maps to SPSR_EL1.

•

SPSR_hyp maps to SPSR_EL2.

•

ELR_hyp maps to ELR_EL2.

The following registers are only used during AArch32 execution. However, because of the
execution at EL1 using AArch64, they retain their state despite them being inaccessible during
AArch64 execution at that Exception level.
•

SPSR_abt.

•

SPSR_und.

•

SPSR_irq.

•

SPSR_fiq.

The SPSR registers are only accessible during AArch64 execution at higher Exception levels
for context switching.
Again, if an exception is taken to an Exception level in AArch64 from an Exception level in
AArch32, the top 32 bits of the AArch64 ELR_ELn are all zero.

ARM DEN0024A
ID050815

4-15

ARMv8 Registers

4.5.2

PSTATE at AArch32
In AArch64, the different components of the traditional CPSR are presented as Processor State
(PSTATE) fields that can be made accessible independently. At AArch32, there are extra fields
corresponding to the ARMv7 CPSR bits.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V Q

IT [7:2]

E A I F T M

M [3:0]

Figure 4-9 CPSR bit assignments in AArch32

Giving additional PSTATE bits which are accessible only at AArch32:
Table 4-6 PSTATE bit definitions

ARM DEN0024A
ID050815

Name

Description

Cumulative saturation (sticky) flag.

GE (4)

Greater than or Equal flags.

IT (8)

If-Then execution bits.

J bit.

T32 bit.

Endianness bit.

Mode field.

4-16

ARMv8 Registers

4.6

NEON and floating-point registers
In addition to the general-purpose registers, ARMv8 also has 32 128-bit floating-point registers
labeled V0-V31. The 32 registers are used to hold floating-point operands for scalar
floating-point instructions and both scalar and vector operands for NEON operations. NEON
and floating-point registers are also covered in Chapter 7 AArch64 Floating-point and NEON.

4.6.1

Floating-point register organization in AArch64
In NEON and floating-point instructions that operate on scalar data, the floating-point and
NEON registers behave similarly to the main general-purpose integer registers. Therefore, only
the lower bits are accessed, with the unused high bits ignored on a read and set to zero on a write.
The qualified names for scalar floating-point and NEON names indicate the number of
significant bits as follows, where n is a register number 0-31.
Table 4-7 Operand name for differently sized floats
Precision

Size (bits)

Name

Half

Single

Double

D31

Unused

S31

Unused
Unused

H31

64 63

32 31

16 15

...
D0

Unused

Unused
Unused

64 63

32 31

16 15

Figure 4-10 Arrangement of floating-point values

Note
16-bit floating-point is supported, but only as a format to be converted from or to. It is not
supported for data processing operations.
The F prefix and the float size is specified by the floating-point ADD instruction:
FADD Sd, Sn, Sm
FADD Dd, Dn, Dm

ARM DEN0024A
ID050815

// Single-precision
// Double-precision

4-17

ARMv8 Registers

The half-precision floating-point instructions are for converting between different sizes:
FCVT
FCVT
FCVT
FCVT

4.6.2

Sd,
Dd,
Hd,
Hd,

Hn
Hn
Sn
Dn

//
//
//
//

half-precision to single-precision
half-precision to double-precision
single-precision to half-precision
double-precision to half-precision

Scalar register sizes
In AArch64, the mapping for the integer scalars has changed from what is used in ARMv7-A to
the mapping shown in Figure 4-11:
Q31
D31

Unused

S31

Unused
Unused

H31

B31

Unused

64 63

32 31

16 15 8 7

...
Q0
D0

Unused

Unused
Unused

Unused

64 63

32 31

16 15 8 7

Figure 4-11 Arrangement of ARMv8 registers when holding scalar values

In Figure 4-11 S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half
of D1, which is the bottom half of Q1, and so on. This eliminates many of the problems
compilers have in auto-vectorizing high-level code.

ARM DEN0024A
ID050815

•

The bottom 64-bits of each of the Q registers can also be viewed as D0-D31, 32 64-bit
wide registers for floating-point and NEON use.

•

The bottom 32-bits of each of the Q registers can also be viewed as S0-S31, 32 32-bit wide
registers for floating-point and NEON use.

•

The bottom 16-bits of each of the S registers can also be viewed as H0-H31, 32 16-bit
wide registers for floating-point and NEON use.

•

The bottom 8-bits of each of the H registers can also be viewed as B0-B31, 32 8-bit wide
registers for NEON use.

4-18

ARMv8 Registers

Note
Only the bottom bits of each register set are used in each case. The rest of the register space is
ignored when read, and filled with zeros when written.
A consequence of this mapping is that if a program executing in AArch64 is interpreting D or
S registers from AArch32 execution. Then the program must unpack the D or S registers from
the V registers before using them.
For the scalar ADD instruction:
ADD Vd, Vn, Vm

If the size was, for example, 32 bits, the instruction would be:
ADD Sd, Sn, Sm

Table 4-8 Operand name for differently sized scalars

4.6.3

Word size

Size (bits)

Name

Byte

Halfword

Word

Doubleword

Quadword

128

Vector register sizes
Vectors can be 64-bits wide with one or more elements or 128-bits wide with two or more
elements as shown in Figure 4-12:
D

V0.2D

V0.4S

128-bit vector
H

...

127

64 63

32 31

V0.8H

16 15 8 7

V0.16B
0

Unused

V31.1D

Unused

V31.2S

64-bit vector
Unused
Unused
127

64 63

32 31

V31.4H

16 15 8 7

V31.8B
0

Figure 4-12 Vector sizes

For the vector ADD instruction:
ADD Vd.T, Vn.T, Vm.T

ARM DEN0024A
ID050815

4-19

ARMv8 Registers

For 32-bit vectors this time, with 4 lanes, the instruction becomes:
ADD Vd.4S, Vn.4S, Vm.4S

Table 4-9 Operand names for different size vectors
Name

Shape

Vn.8B

8 lanes, each containing an 8-bit element

Vn.16B

16 lanes, each containing an 8-bit element

Vn.4H

4 lanes, each containing a 16-bit element

Vn.8H

8 lanes, each containing a 16-bit element

Vn.2S

2 lanes, each containing a 32-bit element

Vn.4S

4 lanes, each containing a 32-bit element

Vn.1D

1 lane containing a 64-bit element

Vn.2D

2 lanes, each containing a 64-bit element

When these registers are used in a specific instruction form, the names must be further qualified
to indicate the data shape. More specifically, this means the data element size and the number
of elements or lanes held within them.
4.6.4

NEON in AArch32 execution state.
In AArch32, the smaller registers are packed into larger ones (D0 and D1 are combined to form
Q1, for instance). This introduces some tricky loop-carried dependencies which can reduce the
ability of the compiler to vectorize loop structures.

S4
D2

Q1
127

S0
D0

Q0
127

Figure 4-13 Arrangement of ARMv7 SIMD registers

The floating-point and Advanced SIMD registers in AArch32 are mapped into the AArch64 FP
and SIMD registers. This is done to allow the floating-point and NEON registers of an
application or a virtual machine to be interpreted (and, as necessary, modified) by a higher level
of system software, for example, the OS or the Hypervisor.
The AArch64 V16-V31 FP and NEON registers are not accessible from AArch32. As with the
general-purpose registers, during execution in an Exception level using AArch32 these registers
retain their state from the previous execution using AArch64.

ARM DEN0024A
ID050815

4-20

Chapter 5
An Introduction to the ARMv8 Instruction Sets

One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set. This set complements the existing 32-bit instruction set architecture. This
addition provides access to 64-bit wide integer registers and data operations, and the ability to
use 64-bit sized pointers to memory. The new instructions are known as A64 and execute in the
AArch64 execution state. ARMv8 also includes the original ARM instruction set, now called
A32, and the Thumb (T32) instruction set. Both A32 and T32 execute in AArch32 state, and
provide backward compatibility with ARMv7.
Although ARMv8-A provides backward compatibility with the 32-bit ARM Architectures, the
A64 instruction set is separate and distinct from the older ISA and is encoded differently. A64
adds some additional capabilities while also removing other features that would potentially limit
the speed or energy efficiency of high performance implementations. The ARMv8 architecture
includes some enhancements to the 32-bit instruction sets (A32 and T32) as well. However,
code that makes use of such features is not compatible with older ARMv7 implementations.
Instruction opcodes in the A64 instruction set, however, are still 32 bits long, not 64 bits.
Programmers seeking a more detailed description of A64 assembly language can also refer to
the ARM® Compiler armasm Reference Guide v6.01.

ARM DEN0024A
ID050815

5-1

An Introduction to the ARMv8 Instruction Sets

5.1

The ARMv8 instruction sets
The new A64 instruction set is similar to the existing A32 instruction set. Instructions are 32 bits
wide and have similar syntax.
The instruction sets use a generic naming convention within the ARMv8 architecture, so that
the original 32-bit instruction set states are now called:
A32

When in AArch32 state, the instruction set is largely compatible with ARMv7,
though there are differences. See, ARMv8-A Architecture Reference Manual. It
also provides some new instructions to align with some of the features that are
introduced in the A64 instruction set.

T32

The Thumb instruction set was first included in the ARM7TDMI processor and
originally contained only 16-bit instructions. 16-bit instructions gave much
smaller programs at the cost of some performance. ARMv7 processors, including
those in the Cortex-A series, support Thumb-2 technology, which extends the
Thumb instruction set to provide a mix of 16-bit and 32-bit instructions. This
gives performance similar to that of ARM, while retaining the reduced code size.
Because of its size and performance advantages, it is increasingly common for all
32-bit code to be compiled or assembled to take advantage of Thumb-2
technology.

A new instruction set has been introduced that the core can use when in AArch64 state. In
keeping with the naming convention, and reflecting the 64-bit operation, this instruction set is
called:
A64

A64 provides similar functionality to the A32 and T32 instruction sets in
AArch32 or ARMv7. The design of the new A64 instruction set allowed several
improvements:
A consistent encoding scheme
The late addition of some instructions in A32 resulted in some
inconsistency in the encoding scheme. For example, LDR and STR
support for halfwords is encoded slightly differently to the mainstream
byte and word transfer instructions. The result of this is that the
addressing modes are slightly different.
Wide range of constants
A64 instructions provide a huge range of options for constants, each
tailored to the requirements of specific instruction types.
•

Arithmetic instructions generally accept a 12-bit immediate
constant.

•

Logical instructions generally accept a 32-bit or 64-bit constant,
which has some constraints in its encoding.

•

MOV instructions accept a 16-bit immediate, which can be shifted

to any 16-bit boundary.
•

Address generation instructions are geared to addresses aligned
to a 4KB page size.
There are slightly more complex rules for constants that are used in bit
manipulation instructions. However, bitfield manipulation instructions
can address any contiguous sequence of bits, in either the source or
destination operand.
A64 provides flexible constants, but encoding them, even determining
whether a particular constant can be legally encoded in a particular
context, can be non-trivial.

ARM DEN0024A
ID050815

5-2

An Introduction to the ARMv8 Instruction Sets

Data types are easier
A64 deals naturally with 64-bit signed and unsigned data types in that
it offers more concise and efficient ways of manipulating 64-bit
integers. This can be advantageous for all languages which provide
64-bit integers such as C or Java.
Long offsets
A64 instructions generally provide longer offsets, both for PC-relative
branches and for offset addressing.
The increased branch range makes it easier to manage inter-section
jumps. Dynamically generated code is generally placed on the heap so
it can, in practice, be located anywhere. The runtime system finds it
much easier to manage this with increased branch ranges, and fewer
fix-ups are required.
The need for literal pools (blocks of literal data embedded in the code
stream) has long been a feature of ARM instruction sets. This still
exists in A64. However, the larger PC-relative load offset helps
considerably with the management of literal pools, making it possible
to use one per compilation unit. This removes the need to manufacture
locations for multiple pools in long code sequences.
Pointers Pointers are 64-bit in AArch64, which allows larger amounts of virtual
memory to be addressed and gives more freedom for address mapping.
However, using 64-bit pointers does incur some costs. The same piece
of code typically uses more memory when running with 64-pointers
than with 32-bit pointers. Each pointer is stored in memory and
requires eight bytes instead of four. This might sound trivial, but can
add up to a significant penalty. Additionally, the increased use of
memory space that is associated with a move to 64 bits can cause a
drop in the number of accesses that hit in cache. This drop of cache hits
can reduce performance.
Some languages can be implemented with compressed pointers, such
as Java, to circumvent the performance issue.
Conditional constructs are used instead of IT blocks
IT blocks are a useful feature of T32, enabling efficient sequences that
avoid the need for short forward branches around unexecuted
instructions. However, they are sometimes difficult for hardware to
handle efficiently. A64 removes these blocks and replaces them with
conditional instructions such as CSEL, or Conditional Select and CINC,
or Conditional Increment. These conditional constructs are more
straightforward and easier to handle without special cases.
Shift and rotate behavior is more intuitive
The A32 or T32 shift and rotate behavior does not always map easily
to the behavior expected by high-level languages.
ARMv7 provides a barrel shifter that can be used as part of data
processing instructions. However, specifying the type of shift and the
amount to shift requires a certain number of opcode bits, which could
be used elsewhere.
A64 instructions therefore remove options that were rarely used, and
instead adds new explicit instructions to carry out more complicated
shift operations.

ARM DEN0024A
ID050815

5-3

An Introduction to the ARMv8 Instruction Sets

Code generation
When generating code, both statically and dynamically, for common
arithmetic functions, A32 and T32 often require different instructions,
or instruction sequences. This is to cope with different data types.
These operations in A64 are much more consistent so it is much easier
to generate common sequences for simple operations on differently
sized data types.
For example, in T32 the same instruction can have different encodings
depending on what registers are used (either a low register or a high
register).
The A64 instruction set encodings are much more regular and
rationalized. Consequently, an assembler for A64 typically requires
fewer lines of code than an assembler for T32.
Fixed-length instructions
All A64 instructions are the same length, unlike T32, which is a
variable-length instruction set. This makes management and tracking
of generated code sequences easier, particularly affecting dynamic
code generators.
Three operands map better
A32, in general, preserves a true three-operand structure for
data-processing operations. T32, on the other hand, contains a
significant number of two-operand instruction formats, which make it
slightly less flexible when generating code. A64 sticks to a consistent
three-operand syntax, which further contributes to the regularity and
homogeneity of the instruction set for the benefit of compilers.
5.1.1

Distinguishing between 32-bit and 64-bit A64 instructions
Most integer instructions in the A64 instruction set have two forms, which operate on either
32-bit or 64-bit values within the 64-bit general-purpose register file.
When looking at the register name that the instruction uses:
•

If the register name starts with X, it is a 64-bit value.

•

If the register name starts with W, it is a 32-bit value.

Where a 32-bit instruction form is selected, the following facts hold true:
•

Right shifts and rotates inject at bit 31, instead of bit 63.

•

The condition flags, where set by the instruction, are computed from the lower 32 bits.

•

Writes to the W register set bits [63:32] of the X register to zero.

This distinction applies even when the results of a 32-bit instruction form would be
indistinguishable from the lower 32 bits computed by the equivalent 64-bit instruction form. For
example, a 32-bit bitwise ORR could be performed using a 64-bit ORR and simply ignoring the top
32 bits of the result. The A64 instruction set includes separate 32 and 64-bit forms of the ORR
instruction.
The C and C++ LP64 and LLP64 data models are expected to be the most commonly used on
AArch64. They both define the frequently used int, short, and char types to be 32 bits or less.
By maintaining this semantic information in the instruction set, implementations can exploit this
information. For example, to avoid expending energy or cycles to compute, forward, and store
the unused upper 32 bits of such data types. Implementations are free to exploit this freedom in
whatever way they choose to save energy.
ARM DEN0024A
ID050815

5-4

An Introduction to the ARMv8 Instruction Sets

So the new A64 instruction set provides distinct sign and zero-extend instructions. Additionally.
the A64 instruction set means it is possible to extend and shift the final source register of an ADD,
SUB, CMN, or CMP instruction and the index register of a Load or Store instruction. This results in
efficient implementation of array index calculations involving a 64-bit array pointer and 32-bit
array index.
5.1.2

Addressing
When the processor can store 64-bit values in a single register, it becomes much simpler to
access large amounts of memory within a program. A single thread executing on a 32-bit core
is limited to accessing 4GB of address space. Large parts of that addressable space are reserved
for use by the OS kernel, library code, peripherals, and more. As a result, lack of space means
that the program might need to map some data in or out of memory while executing. Having a
larger address space, with 64-bit pointers, avoids this problem. It also makes techniques such as
memory-mapped files more attractive and convenient to use. The file contents are mapped into
the memory map of a thread, even though the physical RAM might not be large enough to
contain the whole file.
Other improvements to addressing include the following:
Exclusive accesses
Exclusive load-store of a byte, halfword, word and doubleword. Exclusive access
to a pair of doublewords permits atomic updates of a pair of pointers, for example
circular list inserts. All exclusive accesses must be naturally aligned, and
exclusive pair access must be aligned to twice the data size, that is, 128 bits for a
pair of 64-bit values.
Increased PC-relative offset addressing
PC-relative literal loads have an offset range of ±1MB. Compared to the
PC-relative loads of A32, this reduces the number of literal pools, and increases
sharing of literal data between functions. In turn, this reduces I-cache and TLB
pollution.
Most conditional branches have a range of ±1MB, expected to be sufficient for
the majority of conditional branches that take place within a single function.
Unconditional branches, including branch and link, have a range of ±128MB,
expected to be sufficient to span the static code segment of most executable load
modules and shared objects, without needing linker-inserted veneers.
Note
Veneers are small pieces of code that are automatically inserted by the linker, for
example, when it detects that a branch target is out of range. The veneer becomes
an intermediate target of the original branch with the veneer itself then being a
branch to the target address.
The linker can reuse a veneer generated for a previous call, for other calls to the
same function if it is in range from both calls. Occasionally, such veneers can be
a performance factor.
If you have a loop that calls multiple functions through veneers, you will get
many pipeline flushes and therefore sub-optimal performance. Placing related
code together in memory can avoid this.
PC-relative load and store and address generation with a range of ±4GB can be
performed inline using only two instructions, that is, without the need to load an
offset from a literal pool.

ARM DEN0024A
ID050815

5-5

An Introduction to the ARMv8 Instruction Sets

Unaligned address support
Except for exclusive and ordered accesses, all loads and stores support the use of
unaligned addresses when accessing normal memory. This simplifies porting
code to A64.
Bulk transfers
The LDM, STM, PUSH, and POP instructions do not exist in A64. Bulk transfers can be
constructed using the LDP and STP instructions. These instructions load and store
a pair of independent registers from consecutive memory locations.
The LDNP and STNP instructions provide a streaming or non-temporal hint, that the
data does not need to be retained in caches.
The PRFM, or prefetch memory instructions enable targeting of a prefetch to a
specific cache level.
Load/Store
All Load/Store instructions now support consistent addressing modes. This
makes it much easier, for example, to treat char, short, int and long long in the
same way when loading and storing quantities from memory.
The floating-point and NEON registers now support the same addressing modes
as the core registers, making it easier to use the two register banks
interchangeably.
Alignment checking
When executing in AArch64, additional alignment checking is performed on
instruction fetches and on loads or stores using the stack pointer, enabling
misalignment checking of the PC or the current SP.
This approach is preferable to forcing the correct alignment of the PC or SP,
because a misalignment of the PC or SP commonly indicates a software error,
such as corruption of an address in software.
There are a number of types of alignment checking:
•

Program Counter alignment checking generates an exception associated
with instruction fetch whenever an attempt is made to execute an
instruction fetched with a misaligned PC in AArch64.
A misaligned PC is defined to be one where bits [1:0] of the PC are not 00.
A PC misalignment is identified in the exception syndrome register
associated with the target Exception level.
When the exception is handled using AArch64, the associated exception
link register holds the entire PC in its misaligned form, as does the Fault
Address Register, FAR_ELn, for the Exception level in which the exception
is taken.
PC alignment checking is performed in AArch64, and in AArch32 as part
of Data Abort exception handling.

•

Stack Pointer (SP) alignment checking generates an exception associated
with data memory access whenever a load or store using the stack pointer
as a base address in AArch64 is attempted.
A misaligned stack pointer is one where bits [3:0] of the stack pointer, used
as the base address of the calculation, are not 0000. The stack pointer must
be 16-byte aligned whenever it is usedas a base address.
Stack pointer alignment checking is only performed in AArch64, and can
be enabled independently for each Exception level:
—

ARM DEN0024A
ID050815

EL0 and EL1 are controlled by two separate bits in SCTLR_EL1.

5-6

An Introduction to the ARMv8 Instruction Sets

5.1.3

—

EL2 is controlled by a bit in SCTLR_EL2.

—

EL3 is controlled by a bit in SCTLR_EL3.

Registers
The A64 64-bit register bank helps reduce register pressure in most applications.
The A64 Procedure Call Standard (PCS) passes up to eight parameters in registers (X0-X7). In
contrast, A32 and T32 pass only four arguments in registers, with any excess being passed on
the stack.
The PCS also defines a dedicated Frame Pointer (FP), which makes debugging and call-graph
profiling easier by making it possible to reliably unwind the stack. Refer to Chapter 9 The ABI
for ARM 64-bit Architecture for further information.
A consequence of adopting 64-bit wide integer registers is the varying widths of variables used
by programming languages. A number of standard models are currently in use, which differ
mainly in the size defined for integers, longs, and pointers:
Table 5-1 Variable width
Type

ILP32

LP64

LLP64

char

short

int

long

long long

size_t

pointer

64-bit Linux implementations use LP64 and this is supported by the A64 Procedure Call
Standard. Other PCS variants are defined that can be used by other operating systems.
Zero register
The zero register (WZR/XZR) is used for a few encoding tricks. For example,
there is no plain multiply encoding, just multiply-add. The instruction MUL W0, W1,
W2 is identical to MADD W0, W1, W2, WZR which uses the zero register. Not all
instructions can use the XZR/WZR. As we mentioned in Chapter 4, the zero
register shares the same encoding as the stack pointer. This means that, for some
arguments, for a very limited number of instructions, WZR/XZR is not available,
but WSP/SP is used instead.
Example 5-1 Using the Zero register to write a zero to memory

In A32:
mov
str

r0, #0
r0, [...]

In A64 using the zero register:
str

wzr, [...]

No need for a spare register. Or write 16 bytes of zeros using:

ARM DEN0024A
ID050815

5-7

An Introduction to the ARMv8 Instruction Sets

stp xzr, xzr, [...] etc

A convenient side-effect of the zero register is that there are many NOP instructions
with large immediate fields. For example, ADR XZR, # alone gives you 21 bits
of data in an instruction with no other side effects. This is very useful for JIT
compilers, where code can be patched at runtime.
Stack pointer
The Stack Pointer (SP) cannot be referenced by most instructions. Some forms of
arithmetic instructions can read or write the current stack pointer. This might be
done to adjust the stack pointer in a function prologue or epilogue. For example:
ADD SP, SP, #256

// SP = SP + 256

Program counter
The current Program Counter (PC) cannot be referred to by number as if part of
the general register file and therefore cannot be used as the source or destination
of arithmetic instructions, or as the base, index or transfer register of load and
store instructions.
The only instructions that read the PC are those whose function it is to compute a
PC-relative address (ADR, ADRP, literal load, and direct branches), and the
branch-and-link instructions that store a return address in the link register (BL and
BLR). The only way to modify the program counter is using branch, exception
generation and exception return instructions.
Where the PC is read by an instruction to compute a PC-relative address, then its
value is the address of that instruction. Unlike A32 and T32, there is no implied
offset of 4 or 8 bytes.
FP and NEON registers
The most significant update to the NEON registers is that NEON now has 32
16-byte registers, instead of the 16 registers it had before. The simpler mapping
scheme between the different register sizes in the floating-point and NEON
register bank make these registers much easier to use. The mapping is easier for
compilers and optimizers to model and analyze.
Register indexed addressing
The A64 instruction set provides additional addressing modes with respect to
A32, allowing a 64-bit index register to be added to the 64-bit base register, with
optional scaling of the index by the access size. Additionally, it provides sign or
zero-extension of a 32-bit value within an index register, again with optional
scaling.

ARM DEN0024A
ID050815

5-8

An Introduction to the ARMv8 Instruction Sets

5.2

C/C++ inline assembly
In this section, we briefly cover how to include assembly code within C or C++ language
modules.
The asm keyword can incorporate inline GCC syntax assembly code into a function. For
example:
#include
int add(int i, int j)
{
int res = 0;
asm (
"ADD %w[result], %w[input_i], %w[input_j]"

//Use `%w[name]` to operate on W
// registers (as in this case).
// You can use `%x[name]` for X
// registers too, but this is the
// default.

: [result] "=r" (res)
: [input_i] "r" (i), [input_j] "r" (j)
);
return res;
}
int main(void)
{
int a = 1;
int b = 2;
int c = 0;
c = add(a,b)
printf(“Result of %d + %d = %d\n, a, b, c);
}

The general form of an asm inline assembly statement is:
asm(code [: output_operand_list [: input_operand_list [: clobber_list]]]);

where:
code is the assembly code. In our example, this is "ADD %[result], %[input_i], %[input_j]".
output_operand_list is an optional list of output operands, separated by commas. Each operand

consists of a symbolic name in square brackets, a constraint string, and a C expression in
parentheses. In this example, there is a single output operand: [result] "=r" (res).
input_operand_list is an optional list of input operands, separated by commas. Input operands
use the same syntax as output operands. In this example, there are two input operands: [input_i]
"r" (i) and [input_j] "r" (j).
clobber_list is an optional list of clobbered registers, or other values. In our example, this is

omitted.
When calling functions between C/C++ and assembly code, you must follow the AAPCS64
rules.
For further information, see:
https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-L
anguage-with-C

ARM DEN0024A
ID050815

5-9

An Introduction to the ARMv8 Instruction Sets

5.3

Switching between the instruction sets
It is not possible to use code from the two execution states within a single application. There is
no interworking between A64 and A32 or T32 instruction sets in ARMv8 as there is between
A32 and T32 instruction sets. Code written in A64 for the ARMv8 processors cannot run on
ARMv7 Cortex-A series processors. However, code written for ARMv7-A processors can run
on ARMv8 processors in the AArch32 execution state. This is summarized in Figure 5-1.

T32
Mixed 16 and 32-bit instructions
32-bit general purpose registers
BX
BLX
MOV PC
LDR PC

Exception
entry or
return

Exception
entry

Exception
return

A64
32-bit instructions
32 and 64-bit general purpose registers

A32
32-bit instructions
32-bit general purpose registers

Figure 5-1 Switching between instruction sets

ARM DEN0024A
ID050815

5-10

Chapter 6
The A64 instruction set

Many programmers writing at the application level do not need to write code in assembly
language. However, assembly code can be useful in cases where highly optimized code is
required. This is the case when when writing compilers, or where use of low level features not
directly available in C is needed. It might be required for portions of boot code, device drivers,
or when developing operating systems. Finally, it can be useful to be able to read assembly code
when debugging C, and particularly, to understand the mapping between assembly instructions
and C statements.

ARM DEN0024A
ID050815

6-1

The A64 instruction set

6.1

Instruction mnemonics
The A64 assembly language overloads instruction mnemonics, and distinguishes between the
different forms of an instruction based on the operand register names. For example, the ADD
instructions below all have different encodings, but you only have to remember one mnemonic,
and the assembler automatically chooses the correct encoding based on the operands.
ADD W0, W1, W2
ADD X0, X1, X2
ADD X0, X1, W2, SXTW
ADD X0, X1, #42
ADD V0.8H, V1.8H, V2.8H

ARM DEN0024A
ID050815

//
//
//
//
//
//

add 32-bit registers
add 64-bit registers
add sign extended 32-bit register to 64-bit extended
register
add immediate to 64-bit register
NEON 16-bit add, in each of 8 lanes

6-2

The A64 instruction set

6.2

Data processing instructions
These are the fundamental arithmetic and logical operations of the processor and operate on
values in the general-purpose registers, or a register and an immediate value. Multiply and
divide instructions on page 6-4 can be considered special cases of these instructions.
Data processing instructions mostly use one destination register and two source operands. The
general format can be considered to be the instruction, followed by the operands, as follows:
Instruction Rd, Rn, Operand2

The second operand might be a register, a modified register, or an immediate value. The use of
R indicates that it can be either an X or a W register.
The data processing operations include:

6.2.1

•

Arithmetic and logical operations.

•

Move and shift operations.

•

Instructions for sign and zero extension.

•

Bit and bitfield manipulation.

•

Conditional comparison and data processing.

Arithmetic and logical operations
Table 6-1 shows some of the available arithmetic and logical operations.
Table 6-1 Arithmetic and logical operations
Type

Instructions

Arithmetic

ADD, SUB, ADC, SBC, NEG

Logical

AND, BIC, ORR, ORN, EOR, EON

Comparison

CMP, CMN, TST

Move

MOV, MVN

Some instructions also have an S suffix, indicating that the instruction sets flags. Of the
instructions in Table 6-1, this includes ADDS, SUBS, ADCS, SBCS, ANDS, and BICS. There are other flag
setting instructions, notably CMP, CMN and TST, but these do not take an S suffix.
The operations ADC and SBC perform additions and subtractions that also use the carry condition
flag as an input.
ADC{S}: Rd = Rn + Rm + C
SBC{S}: Rd = Rn - Rm - 1 + C

Example 6-1 Arithmetic instructions

ADD W0, W1, W2, LSL #3
SUBS X0, X4, X3, ASR #2
MOV X0, X1
CMP W3, W4
ADD W0, W5, #27

ARM DEN0024A
ID050815

//
//
//
//
//

W0 = W1 + (W2 << 3)
X0 = X4 - (X3 >> 2), set flags
Copy X1 to X0
Set flags based on W3 - W4
W0 = W5 + 27

6-3

The A64 instruction set

The logical operations are essentially the same as the corresponding boolean operators operating
on individual bits of the register.
The BIC (Bitwise bit Clear) instruction performs an AND of the register that is the first after the
destination register, with the inverted value of the second operand. For example, to clear bit [11]
of register X0, use:
MOV X1, #0x800
BIC X0, X0, X1
ORN and EON perform an OR or EOR respectively with a bitwise-NOT of the second operand.

The comparison instructions only modify the flags and have no other effect. The range of
immediate values for these instructions is 12 bits, and this value can be optionally shifted 12 bits
to the left.
6.2.2

Multiply and divide instructions
The multiply instructions provided are broadly similar to those in ARMv7-A, but with the
ability to perform 64-bit multiplies in a single instruction.
Table 6-2 Multiplication operations in assembly language
Opcode

Description

Multiply instructions
MADD

Multiply add

MNEG

Multiply negate

MSUB

Multiply subtract

MUL

Multiply

SMADDL

Signed multiply-add long

SMNEGL

Signed multiply-negate long

SMSUBL

Signed multiply-subtract long

SMULH

Signed multiply returning high half

SMULL

Signed multiply long

UMADDL

Unsigned multiply-add long

UMNEGL

Unsigned multiply-negate long

UMSUBL

Unsigned multiply-subtract long

UMULH

Unsigned multiply returning high half

UMULL

Unsigned multiply long

Divide instructions
SDIV

Signed divide

UDIV

Unsigned divide

There are multiply instructions that operate on 32-bit or 64-bit values and return a result of the
same size as the operands. For example, two 64-bit registers can be multiplied to produce a
64-bit result with the MUL instruction.

ARM DEN0024A
ID050815

6-4

The A64 instruction set

MUL X0, X1, X2

// X0 = X1 * X2

There is also the ability to add or subtract an accumulator value in a third source register, using
the MADD or MSUB instructions.
The MNEG instruction can be used to negate the result, for example:
MNEG X0, X1, X2

// X0 = -(X1 * X2)

Additionally, there are a range of multiply instructions that produce a long result, that is,
multiplying two 32-bit numbers and generating a 64-bit result. There are both signed and
unsigned variants of these long multiplies (UMULL, SMULL). There are also options to accumulate
a value from another register (UMADDL, SMADDL) or to negate (UMNEGL, SMNEGL).
Including 32-bit and 64-bit multiply with optional accumulation give a result size the same size
as the operands:
•

32 ± (32 × 32) gives a 32-bit result.

•

64 ± (64 × 64) gives a 64-bit result.

•

± (32 × 32) gives a 32-bit result.

•

± (64 × 64) gives a 64-bit result.

Widening multiply, that is signed and unsigned, with accumulation gives a single 64-bit result:
•

64 ± (32 × 32) gives a 64-bit result.

•

± (32 × 32) gives a 64-bit result.

A 64 × 64 to 128-bit multiply requires a sequence of two instructions to generate a pair of 64-bit
result registers:
•

± (64 × 64) gives the lower 64 bits of the result [63:0].

•

(64 × 64) gives the higher 64 bits of the result [127:64].

Note
The list contains no 32 × 64 options. You cannot directly multiply a 32-bit W register by a 64-bit
X register.
The ARMv8-A architecture has support for signed and unsigned division of 32-bit and 64-bit
sized values. For example:
UDIV W0, W1, W2
SDIV X0, X1, X2

// W0 = W1 / W2 (unsigned, 32-bit divide)
// X0 = X1 / X2 (signed, 64-bit divide)

Overflow and divide-by-zero are not trapped:
•

Any integer division by zero returns zero.

•

Overflow can only occur in SDIV:
—

ARM DEN0024A
ID050815

INT_MIN / -1 returns INT_MIN, where INT_MIN is the smallest negative number that
can be encoded in the registers used for the operation. The result is always rounded
towards zero, as in most C/C++ dialects.

6-5

The A64 instruction set

6.2.3

Shift operations
The following instructions are specifically for shifting:
•

Logical Shift Left (LSL). The LSL instruction performs multiplication by a power of 2.

•

Logical Shift Right (LSR). The LSR instruction performs division by a power of 2.

•

Arithmetic Shift Right (ASR). The ASR instruction performs division by a power of 2,
preserving the sign bit.

•

Rotate right (ROR). The ROR instruction performs a bitwise rotation, wrapping the bits
rotated from the LSB into the MSB.
Table 6-3 Shift and move operations
Instruction

Description

Shift
ASR

Arithmetic shift right

LSL

Logical shift left

LSR

Logical shift right

ROR

Rotate right

Move
MOV

Move

MVN

Bitwise NOT

LSL Logical shift left
Bits shifted
out are lost

LSR Logical shift right

Bits shifted
out are lost

Unsigned division by 2n
where n is the shift amount

Multiplication by 2n where n is
the shift amount

ASR Arithmetic shift right

ROR Rotate right

sign-bit
Register

Bits shifted
out are lost

Division by 2n, where n is the
shift amount, preserving the
sign bit

Bit rotate with wrap around
from LSB to MSB

Figure 6-1 Shift operations

The register that is specified for a shift can be 32-bit or 64-bit. The amount to be shifted can be
specified either as an immediate, that is up to register size minus one, or by a register where the
value is taken only from the bottom five (modulo-32) or six (modulo-64) bits.

ARM DEN0024A
ID050815

6-6

The A64 instruction set

6.2.4

Bitfield and byte manipulation instructions
There are instructions that extend a byte, halfword, or word to register size, which can be either
X or W. These instructions exist in both signed (SXTB, SXTH, SXTW) and unsigned (UXTB, UXTH)
variants and are aliases to the appropriate bitfield manipulation instruction.
Both the signed and unsigned variants of these instructions extend a byte, halfword, or word
(although only SXTW operates on a word) to register size. The source is always a W register. The
destination register is either an X or a W register, except for SXTW which must be an X register.
For example:
SXTB X0, W1

// Sign extend the least significant byte of register W1
// from 8-bits to 64-bit by repeating the leftmost bit of the
// byte.

Bitfield instructions are similar to those that exist in ARMv7 and include Bit Field Insert (BFI),
and signed and unsigned Bit Field Extract ((S/U)BFX). There are extra bitfield instructions too,
such as BFXIL (Bit Field Extract and Insert Low), UBFIZ (Unsigned Bit Field Insert in Zero), and
SBFIZ (Signed Bit Field Insert in Zero).

0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0

BFI W0, W0, #9, #6

;Bit field insert

0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0

UBFX W1, W0, #18, #7

;Bit field extract

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
Zero extend
BFC W1, WZR, #3, #4

;Bit field clear

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

Figure 6-2 Bit manipulation instructions

Note
There are also BFM, UBFM, and SBFM instructions. These are Bit Field Move instructions, which are
new for ARMv8. However, the instructions do not need to be used explicitly, as aliases are
provided for all cases. These aliases are the bitfield operations already described: [SU]XT[BHWX],
ASR/LSL/LSR immediate, BFI, BFXIL, SBFIZ, SBFX, UBFIZ, and UBFX.
If you are familiar with the ARMv7 architecture, you might recognize the other bit manipulation
instruction:
•
ARM DEN0024A
ID050815

CLZ Count leading zero bits in a register.

6-7

The A64 instruction set

Similarly, the same byte manipulation instructions:
•

RBIT Reverse all bits.

•

REV Reverse the byte order of a register.

•

REV16 Reverse the byte order of each halfword in a register.

Xd
Figure 6-3 REV16 instruction

•

REV32 Reverse the byte order of each word in a register.

Xd
Figure 6-4 REV32 instruction

These operations can be performed on either word (32-bit) or doubleword (64-bit) sized
registers, except for REV32, which applies only to 64-bit registers.

6.2.5

Conditional instructions
The A64 instruction set does not support conditional execution for every instruction. Predicated
execution of instructions does not offer sufficient benefit to justify its significant use of opcode
space.
Processor state on page 4-6, describes the four status flags, Zero (Z), Negative (N), Carry (C)
and Overflow (V). Table 6-4 indicates the value of these bits for flag setting operations.
Table 6-4 Condition flag
Flag

Name

Description

Negative

Set to the same value as bit[31] of the result. For a 32-bit signed integer, bit[31] being set indicates
that the value is negative.

Zero

Set to 1 if the result is zero, otherwise it is set to 0.

Carry

Set to the carry-out value from result, or to the value of the last bit shifted out from a shift
operation.

Overflow

Set to 1 if signed overflow or underflow occurred, otherwise it is set to 0.

The C flag is set if the result of an unsigned operation overflows the result register.
The V flag operates in the same way as the C flag, but for signed operations.
ARM DEN0024A
ID050815

6-8

The A64 instruction set

Note
The condition flags (NZCV) and the condition codes are the same as in A32 and T32. However,
A64 adds NV (0b1111), though it behaves the same as its complement, AL (0b1110). This differs
from A32, which did not assign any meaning to 0b1111.

Table 6-5 Condition codes
Code

Encoding

Meaning (when set by CMP)

Meaning (when set by FCMP)

Condition flags

0b0000

Equal to.

Z =1

0b0001

Not equal to.

Unordered, or not equal to.

Z=0

0b0010

Carry set (identical to HS).

Greater than, equal to, or unordered (identical
to HS).

C=1

0b0010

Greater than, equal to (unsigned)
(identical to CS).

Greater than, equal to, or unordered (identical
to CS).

C=1

0b0011

Carry clear (identical to LO).

Less than (identical to LO).

C=0

0b0011

Unsigned less than (identical to
CC).

Less than (identical to CC).

C=0

0b0100

Minus, Negative.

Less than.

N=1

0b0101

Positive or zero.

Greater than, equal to, or unordered.

N=0

0b0110

Signed overflow.

Unordered. (At least one argument was NaN).

V=1

0b0111

No signed overflow.

Not unordered. (No argument was NaN).

V=0

0b1000

Greater than (unsigned).

Greater than or unordered.

(C = 1) && (Z = 0)

0b1001

Less than or equal to (unsigned).

Less than or equal to.

(C = 0) || (Z = 1)

0b1010

Greater than or equal to (signed).

Greater than or equal to.

N==V

0b1011

Less than (signed).

Less than or unordered.

N!=V

0b1100

Greater than (signed).

Greater than.

(Z==0) && (N==V)

0b1101

Less than or equal to (signed).

Less than, equal to or unordered.

(Z==1) || (N!=V)

0b1110

Always executed.

Default. Always executed.

Any

0b1111

Always executed.

Any

There are a small set of conditional data processing instructions. These instructions are
unconditionally executed but use the condition flags as an extra input to the instruction. This set
has been provided to replace common usage of conditional execution in ARM code.
The instructions types which read the condition flags are:
Add/subtract with carry
The traditional ARM instructions, for example, for multi-precision arithmetic and
checksums.
ARM DEN0024A
ID050815

6-9

The A64 instruction set

Conditional select with optional increment, negate, or invert
Conditionally select between one source register and a second incremented,
negated, inverted, or unmodified source register.
These are the most common uses of single conditional instructions in A32 and
T32. Typical uses include conditional counting or calculating the absolute value
of a signed quantity.
Conditional operations
The A64 instruction set enables conditional execution of only program flow control branch
instructions. This is in contrast to A32 and T32 where most instructions can be predicated with
a condition code. These can be summarized as follows:
Conditional select (move)
•

CSEL Select between two registers based on a condition. Unconditional

instructions, followed by a conditional select, can replace short conditional
sequences.
•

CSINC Select between two registers based on a condition. Return the first
source register or the second source register incremented by one.

•

CSINV Select between two registers based on a condition. Return the first
source register or the inverted second source register.

•

CSNEG Select between two registers based on a condition. Return the first
source register or the negated second source register.

Conditional set
Conditionally select between 0 and 1 (CSET) or 0 and -1 (CSETM). Used, for
example, to set the condition flags as a boolean value or mask in a general
register.
Conditional compare
(CMP and CMN) Sets the condition flags to the result of a comparison if the original
condition is true. If not true, the conditional flags are set to a specified condition
flag state. The conditional compare instruction is very useful for expressing
nested or compound comparisons.
Note
Conditional select and conditional compare are also available for floating-point registers using
the FCSEL and FCCMP instructions.
For example:
CSINC X0, X1, X0, NE

// Set the return register X0 to X1 if Zero flag clear,
// else increment X0

Some aliases to the example instructions are provided, where either the zero register is used, or
the same register is used as both destination and both source registers for the instruction.
For example:
CINC X0, X0, LS
CSET W0, EQ
CSETM X0, NE

ARM DEN0024A
ID050815

//
//
//
//

If less than or same (LS) then X0 = X0 + 1
If the previous comparison was equal (Z=1) then W0 = 1,
else W0 = 0
If not equal then X0 = -1, else X0 = 0

6-10

The A64 instruction set

This class of instructions provides a powerful way to avoid the use of branches or conditionally
executed instructions. Compilers, or assembly programmers, might adopt a technique of
performing the operations for both branches of an if-then-else statement. Then the correct result
is selected at the end.
For example, consider the simple C code:
if (i == 0)

r = r + 2;

else

r = r - 1;

This might produce code similar to:
CMP w0, #0
SUB w2, w1, #1
ADD w1, w1, #2
CSEL w1, w1, w2, EQ

ARM DEN0024A
ID050815

//
//
//
//

if (i == 0)
r = r - 1
r = r + 2
select between the two results

6-11

The A64 instruction set

6.3

Memory access instructions
As with all prior ARM processors, the ARMv8 architecture is a Load/Store architecture. This
means that no data processing instruction operates directly on data in memory. The data must
first be loaded into registers, modified, and then stored to memory. The program must specify
an address, the size of data to be transferred, and a source or destination register. There are
additional Load and Store instructions which provide further options, such as non-temporal
Load/Store, Load/Store exclusives, and Acquire/Release.
Memory instructions can access Normal memory in an unaligned fashion (see Chapter 13
Memory Ordering). This is not supported by exclusive accesses, load acquire or store release
variants. If unaligned accesses are not desired, they can be configured to be faulted.

6.3.1

Load instruction format
The general form of a Load instruction is as follows:
LDR Rt,

For loads into integer registers, you can choose a size to load. For example, to load a size smaller
than the specified register value, append one of the following suffixes to the LDR instruction:
•
LDRB (8-bit, zero extended).
•
LDRSB (8-bit, sign extended).
•
LDRH (16-bit, zero extended).
•
LDRSH (16-bit, sign extended).
•
LDRSW (32-bit, sign extended).
There are also unscaled-offset forms such as LDUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the LDUR form
explicitly, because most assemblers can select the appropriate version based on the offset used.
You do not need to specify a zero-extended load to an X register, because writing a W register
effectively zero extends to the entire register width.

LDRSB W4,

Memory.

Sign extend
00

LDRSB X4,
Sign extend
FF

LDRB W4,
Zero extend
00

Figure 6-5 Load instructions

ARM DEN0024A
ID050815

6-12

The A64 instruction set

6.3.2

Store instruction format
Similarly, the general form of a Store instruction is as follows:
STR Rn,

There are also unscaled-offset forms such as STUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the STUR form
explicitly, as most assemblers can select the appropriate version based on the offset used.
The size to be stored might be smaller than the register. You specify this by adding a B or H
suffix to the STR. It is always the least significant part of the register that is stored in such a case.
6.3.3

Floating-point and NEON scalar loads and stores
Load and Store instructions can also access floating-point/NEON registers. Here, the size is
determined only by the register being loaded or stored, which can be any of the B, H, S, D, or
Q registers. This information is summarized in Table 6-6, and Table 6-7.
For Load instructions:
Table 6-6 Memory bits written by Load instructions
Load

LDR

128

LDP

128

256

128

LDRB

LDRH

LDRSB

LDRSH

LDRSW

LDPSW

For Store instructions:
Table 6-7 Memory bits read by Store instructions
Store

STR

126

STP

128

256

128

STRB

STRH

No sign-extension options are available for loads into FP/SIMD registers. Addresses for such
loads are still specified using the general-purpose registers.
For example:
LDR D0, [X0, X1]

Loads register D0 with the doubleword at the memory address pointed to by X0 plus X1.
ARM DEN0024A
ID050815

6-13

The A64 instruction set

Note
Floating-point and scalar NEON Loads and Stores use the same addressing modes as integer
Loads and Stores.

6.3.4

Specifying the address for a Load or Store instruction
The addressing modes available to A64 are similar to those in A32 and T32. There are some
additional restrictions as well as some new features, but the addressing modes available to A64
will not be surprising to someone familiar with A32 or T32.
In A64, the base register of an address operand must always be an X register. However, several
instructions support zero-extension or sign-extension so that a 32-bit offset can be provided as
a W register.
Offset modes
Offset addressing modes add an immediate value or an optionally-modified register value to a
64-bit base register to generate an address.
Table 6-8 Offset addressing modes
Example instruction

Description

LDR X0, [X1]

Load from the address in X1

LDR X0, [X1, #8]

Load from address X1 + 8

LDR X0, [X1, X2]

Load from address X1 + X2

LDR X0, [X1, X2, LSL, #3]

Load from address X1 + (X2 << 3)

LDR X0, [X1, W2, SXTW]

Load from address X1 + sign_extend(W2)

LDR X0, [X1, W2, SXTW, #3]

Load from address X1 + (sign_extend(W2) << 3)

Typically, when specifying a shift or extension option, the shift amount can be either 0 (the
default) or log2 of the access size in bytes (so that Rn << multiplies Rn by the access
size). This supports common array-indexing operations.
// A C example showing accesses that a compiler is likely to generate.
void example_dup(int32_t a[], int32_t length) {
int32_t first = a[0];
// LDR W3, [X0]
for (int32_t i = 1; i < length; i++) {
a[i] = first;
// STR W3, [X0, W2, SXTW, #2]
}
}

Index modes
Index modes are similar to offset modes, but they also update the base register. The syntax is the
same as in A32 and T32, but the set of operations is more restrictive. Usually, only immediate
offsets can be provided for index modes.

ARM DEN0024A
ID050815

6-14

The A64 instruction set

There are two variants: pre-index modes which apply the offset before accessing the memory,
and post-index modes which apply the offset after accessing the memory.
Table 6-9 Index addressing modes
Example instruction

Description

LDR X0, [X1, #8]!

Pre-index: Update X1 first (to X1 + #8), then load from the new address

LDR X0, [X1], #8

Post-index: Load from the unmodified address in X1 first, then update X1 (to X1 + #8)

STP X0, X1, [SP, #-16]!

Push X0 and X1 to the stack.

LDP X0, X1, [SP], #16

Pop X0 and X1 off the stack.

These options map cleanly onto some common C operations:
// A C example showing accesses that a compiler is likely to generate.
void example_strcpy(char * dst, const char * src)
{
char c;
do {
c = *(src++);
// LDRB W2, [X1], #1
*(dst++) = c;
// STRB W2, [X0], #1
} while (c != '\0');
}

PC-relative modes (load-literal)
A64 adds another addressing mode specifically for accessing literal pools. Literal pools are
blocks of data encoded in an instruction stream. The pools are not executed, but their data can
be accessed from surrounding code using PC-relative memory addresses. Literal pools are often
used to encode constant values that do not fit into a simple move-immediate instruction.
In A32 and T32, the PC can be read like a general-purpose register, so a literal pool can be
accessed simply by specifying PC as the base register.
In A64, PC is not generally accessible, but instead there is a special addressing mode (for load
instructions only) that accesses a PC-relative address. This special-purpose addressing mode
also has a much greater range than the PC-relative loads in A32 and T32 could achieve, so literal
pools can be positioned more sparsely.
Table 6-10
Example instruction

Description

LDR W0,

Load 4 bytes from into W0

LDR X0,

Load 8 bytes from into X0

LDRSW X0,

Load 4 bytes from and sign-extend into X0

LDR S0,

Load 4 bytes from into S0

LDR D0,

Load 8 bytes from into D0

LDR Q0,

Load 16 bytes from into Q0

Note
must be 4-byte-aligned for all variants.

ARM DEN0024A
ID050815

6-15

The A64 instruction set

6.3.5

Accessing multiple memory locations
A64 does not include the Load Multiple (LDM) or Store Multiple (STM) instructions that are
available to A32 and T32 code.
In A64 code, there are the Load Pair (LDP) and Store Pair (STP) instructions. Unlike the A32 LDRD
and STRD instructions, any two integer registers can be read or written. Data is read or written to
or from adjacent memory locations. The addressing mode options provided for these
instructions are more restrictive than for other memory access instructions. LDP and STP
instructions can only use a base register with a scaled 7-bit signed immediate value, with
optional pre- or post-increment. Unaligned accesses are possible for LDP and STP, unlike the
32-bit LDRD and STRD.
Table 6-11 Register Load/Store pair
Load and Store pair

Description

LDP W3, W7, [X0]

Loads word at address X0 into W3 and word at
address X0 + 4 into W7. See Figure 6-6.

LDP X8, X2, [X0, #0x10]!

Loads doubleword at address X0 + 0x10 into X8
and the doubleword at address X0 + 0x10 + 8
into X2 and add 0x10 to X0. See Figure 6-7.

LDPSW X3, X4, [X0]

Loads word at address X0 into X3 and word at
address X0 + 4 into X4, and sign extends both
to doubleword size.

LDP D8, D2, [X11], #0x10

Loads doubleword at address X11 into D8 and
the doubleword at address X11 + 8 into D2 and
adds 0x10 to X11.

STP X9, X8, [X4]

Stores the doubleword in X9 to address X4 and
stores the doubleword in X8 to address X4 + 8.

X0 + 4
4 bytes
63

X0
4 bytes

32 31

Figure 6-6 LDP W3, W7 [X0]

[X0+0x10]+8
8 bytes
127

[X0+0x10]
8 bytes

64 63

Figure 6-7 LDP X8, X2, [X0 + #0x10]!

ARM DEN0024A
ID050815

6-16

The A64 instruction set

6.3.6

Unprivileged access
The A64 LDTR and STTR instructions perform an unprivileged Load or Store (see LDTR and STTR in
ARMv8-A Architecture Reference Manual):
•

At EL0, EL2 or EL3, they behave as normal Loads or Stores.

•

When executed at EL1, they behave as if they had been executed at privilege level EL0.
These instructions are equivalent to the A32 LDRT and STRT instructions.

6.3.7

Prefetching memory
Prefetch from Memory (PRFM) enables code to provide a hint to the memory system that data
from a particular address will be used by the program soon. The effect of this hint is
IMPLEMENTATION DEFINED, but typically, it results in data or instructions being loaded into one
of the caches.
The instruction syntax is:
PRFM , | label

Where prfop is a concatenation of the following options:
Type

PLD or PST (prefetch for load or store).

Target

L1, L2, or L3 (which cache to target).

Policy

KEEP or STRM (keep in cache, or streaming data).

For example, PLDL1KEEP.
These instructions are similar to the A32 PLD and PLI instructions.
6.3.8

Non-temporal load and store pair
A new concept in ARMv8 is the non-temporal load and store. These are the LDNP and STNP
instructions that perform a read or write of a pair of register values. They also give a hint to the
memory system that caching is not useful for this data. The hint does not prohibit memory
system activity such as caching of the address, preload, or gathering. However, it indicates that
caching is unlikely to increase performance. A typical use case might be streaming data, but take
note that effective use of these instructions requires an approach specific to the
microarchitecture.
Non-temporal loads and stores relax the memory ordering requirements. In the above case, the
LDNP instruction might be observed before the preceding LDR instruction, which can result in

reading from an uncertain address in X0.
For example:
LDR X0, [X3]
LDNP X2, X1, [X0]

// Xo may not be loaded when the instruction executes!

To correct the above, you need an explicit load barrier:
LDR X0, [X3]
DMB nshld
LDNP X2, X1, [X0]

ARM DEN0024A
ID050815

6-17

The A64 instruction set

6.3.9

Memory access atomicity
An aligned memory access, using a single general-purpose register, is guaranteed to be atomic.
Load pair and store pair instructions to a pair of general-purpose registers, using an aligned
memory address are guaranteed to appear as two individual atomic accesses. Unaligned
accesses are not atomic, as they typically require two separate accesses. Additionally,
floating-point and SIMD memory accesses are not guaranteed to be atomic.

6.3.10

Memory barrier and fence instructions
Both ARMv7 and ARMv8 provide support for different barrier operations. These are described
in more detail in Chapter 13 Memory Ordering:
•

Data Memory Barrier (DMB). This forces all earlier-in-program-order memory accesses to
become globally visible before any subsequent accesses.

•

Data Synchronization Barrier (DSB). All pending loads and stores, cache maintenance
instructions, and all TLB maintenance instructions, are completed before program
execution continues. A DSB behaves like a DMB, but with additional properties.

•

Instruction Synchronization Barrier (ISB). This instruction flushes the CPU pipeline and
prefetch buffers, causing instructions after the ISB to be fetched (or re-fetched) from
cache or memory.

ARMv8 introduces one-sided fences, which are associated with the Release Consistency model.
These are called Load-Acquire (LDAR) and Store-Release (STLR) and are address-based
synchronization primitives. (See One-way barriers on page 13-8.) The two operations can be
paired to form a full fence. Only base register addressing is supported for these instructions, no
offsets or other kinds of indexed addressing are provided.
6.3.11

Synchronization primitives
ARMv7-A and ARMv8-A architectures both provide support for exclusive memory accesses.
In A64, this is the Load/Store exclusive (LDXR/STXR) pair.
The LDXR instruction loads a value from a memory address and attempts to silently claim an
exclusive lock on the address. The Store-Exclusive instruction then writes a new value to that
location only if the lock was successfully obtained and held. The LDXR/STXR pairing is used to
construct standard synchronization primitives such as spinlocks. A paired set of LDXRP and STXRP
instructions is provided, to allow code to atomically update a location that spans two registers.
Byte, halfword, word, and doubleword options are available. Like the Load Acquire/Store
Release pairing, only base register addressing, without any offsets, is supported.
The CLREX instruction clears the monitors, but unlike in ARMv7, exception entry or return also
clears the monitor. The monitor might also be cleared spuriously, for example by cache evictions
or other reasons not directly related to the application. Software must avoid having any explicit
memory accesses, system control register updates, or cache maintenance instructions between
paired LDXR and STXR instructions.
There is also an exclusive pair of Load Acquire/Store Release instructions called LDAXR and
STLXR. See Synchronization on page 14-6.

ARM DEN0024A
ID050815

6-18

The A64 instruction set

6.4

Flow control
The A64 instruction set provides a number of different kinds of branch instructions (see
Table 6-12). For simple relative branches, that is those to an offset from the current address, the
B instruction is used. Unconditional simple relative branches can branch backward or forward
up to 128MB from the current program counter location. Conditional simple relative branches,
where a condition code is appended to the B, have a smaller range of ±1MB.
Calls to subroutines, where it is necessary for the return address to be stored in the link register
(X30), use the BL instruction. This does not have a conditional version. BL behaves as a B
instruction with the additional effect of storing the return address, which is the address of the
instruction after the BL, in register X30.
Table 6-12 Branch instructions
Branch instructions
B (offset)

Program relative branch forward or back 128MB.
A conditional version, for example B.EQ, has a 1MB range.

BL (offset)

As B but store the return address in X30, and hint to branch prediction logic
that this is a function call.

BR Xn

Absolute branch to address in Xn.

BLR Xn

As BR but store the return address in X30, and hint to branch prediction
logic that this is a function call.

RET{Xn}

As BR, but hint to branch prediction logic that this is a function return.
Returns to the address in X30 by default, but a different register can be
specified.
Conditional branch instructions

CBZ Rt, label

Compare and branch if zero. If Rt is zero, branch forward or back up to
1MB.

CBNZ Rt, label

Compare and branch if non-zero. If Rt is not zero, branch forward or back
up to 1MB.

TBNZ Rt, bit, label

Test and branch if zero. Branch forward or back up to 32kB.

TBNZ Rt, bit, label

Test and branch if non-zero. Branch forward or back up to 32kB.

In addition to these PC-relative instructions, the A64 instruction set includes two absolute
branches. The BR Xn instruction performs an absolute branch to the address in Xn while BLR Xn has
the same effect, but also stores the return address in X30 (the link register). The RET instruction
behaves like BR Xn, but it hints to branch prediction logic that it is a function return. RET branches
to the address in X30 by default, though other registers can be specified..
The A64 instruction set includes some special conditional branches. These allow improved code
density in some cases because an explicit comparison is not necessary.
•

CBZ Rt, label

// Compare and branch if zero

•

CBNZ Rt, label

// Compare and branch if not zero

These instructions compare the source register, either 32-bit or 64-bit, with zero and then
conditionally perform a branch. The branch offset has a range of ± 1MB. These instructions do
not read or write the condition code flags (NZCV).
There are two similar test and branch instructions
•

ARM DEN0024A
ID050815

TBZ Rt, bit, label

// Test and branch if Rt zero

6-19

The A64 instruction set

•

TBNZ Rt, bit, label

// Test and branch if Rt is not zero

These instructions test the bit in the source register at the bit position specified by the immediate
and conditionally branch depending on whether the bit is set or clear. The branch offset has a
range of ±32kB. As with CBZ/CBNZ, these instructions do not read or write the condition code
flags (NZCV).

ARM DEN0024A
ID050815

6-20

The A64 instruction set

6.5

System control and other instructions
The A64 instruction set contains instructions that relate to:
•
Exception handling.
•
System register access.
•
Debug.
•
Hint instructions, which in many systems have power management applications.

6.5.1

Exception handling instructions
There are three exception handling instructions whose purpose it is to cause an exception to be
taken. These are used to make a call to code that runs in a higher Exception level in the OS
(EL1), the Hypervisor (EL2), or Secure Monitor (EL3):
•

SVC #imm16

// Supervisor call, allows application program to call the kernel
// (EL1).

•

HVC #imm16

// Hypervisor call, allows OS code to call hypervisor (EL2).

•

SMC #imm16

// Secure Monitor call, allows OS or hypervisor to call Secure
// Monitor (EL3).

The immediate value is made available to the handler in the Exception Syndrome Register. This
is a change from ARMv7, where the immediate value had to be determined by reading the
opcode of the calling instruction. See Chapter 10 AArch64 Exception Handling for further
information.
To return from an exception, use the ERET instruction. This instruction restores processor state
by copying SPSR_ELn to PSTATE and branches to the saved return address in ELR_ELn.
6.5.2

System register access
Two instructions are provided for system register access:
•

MRS Xt,

// This copies a system register into a general
// purpose register

For example

•

MRS X4, ELR_EL1

// Copies ELR_EL1 to X4

MSR , Xt

// This copies a general-purpose register into a
// system register

For example
MSR SPSR_EL1, X0

// Copies X0 to SPSR_EL1

Individual fields of PSTATE can also be accessed with MSR or MRS. For example, to select the
Stack Pointer associated with EL0 or the current Exception level:
•

MSR SPSel, #imm

// A value of 0 or 1 in this register is used to select
// between using EL0 stack pointer or the current exception
// level stack pointer

There are special forms of these instructions that can be used to clear or set individual exception
mask bits (see Saved Process Status Register on page 4-5):
•

MSR DAIFClr, #imm4

•

MSR DAIFSet, #imm4

See System registers on page 4-7.

ARM DEN0024A
ID050815

6-21

The A64 instruction set

6.5.3

Debug instructions
There are two debug-related instructions:
•

BRK #imm16

// Enters monitor mode debug, where there is on-chip debug monitor
// code

•

HLT #imm16

// Enters halt mode debug, where external debug hardware is connected

For information on debugging, see Chapter 18 Debug.
6.5.4

Hint instructions
HINT instructions can legally be treated as a NOP, but they can have implementation-specific

effects:
•

NOP

// No operation - not guaranteed to take time to execute

•

YIELD

// Hint that the current thread is performing a task that
// can be swapped out

•

WFE

// Wait for Event

•

WFI

// Wait for interrupt

•

SEV

// Send Event

•

SEVL

// Send Event Local

These concepts are also covered in Chapter 14 Multi-core processors and Chapter 15 Power
Management.
6.5.5

NEON instructions
The NEON instruction set also has several enhancements, some of which are quite significant.
Chapter 7 AArch64 Floating-point and NEON describes these in more detail.
Changes to NEON in A64 include

6.5.6

•

Support for double precision floating-point, enabling C code using double precision
floating-point to be vectorized.

•

New instructions to operate on scalar data stored in NEON registers.

•

New instructions to insert and extract vector elements.

•

New instructions for type conversion and saturating integer arithmetic.

•

New instructions for normalization of floating-point values.

•

New cross-lane instructions for vector reduction, summation, and taking the minimum or
maximum value.

•

Instructions to perform actions such as compare, add, find absolute value, and negate have
been extended to be able to operate on 64-bit integer elements.

Floating-point instructions
A64 provides a similar set of floating-point instructions to those of the ARMv7-A VFPv4
extension, which provides single and double precision mathematical operations on scalar
floating-point values. There are a number of changes and new features:
•

ARM DEN0024A
ID050815

Floating-point comparisons set the condition flags (NZCV) directly. In A64 there is no
need to explicitly transfer the comparison results from floating-point to integer flags.

6-22

The A64 instruction set

6.5.7

•

Instructions have been added relating to the IEEE754-2008 standard, for example to
calculate the minimum and maximum of a pair of numbers.

•

A rounding mode can now be explicitly specified when converting from integer to
floating-point formats. It is no longer necessary to set the global FPCR flags when simple
conversions are required in a particular rounding mode. Some of these options are also
available to ARMv8 A32 and T32.

•

Instructions have been added to support conversions between 64-bit integers and
floating-point formats.

•

In A64, floating-point operations involving integer types work directly on integer
registers. There is no need to manually transfer integer values between floating-point and
integer registers for conversion operations.

Cryptographic instructions
An optional extension for ARMv8 adds cryptographic instructions that significantly improve
performance on tasks such as AES encryption and SHA1 and SHA256 hashing.

ARM DEN0024A
ID050815

6-23

Chapter 7
AArch64 Floating-point and NEON

The ARM Advanced SIMD architecture, its associated implementations, and supporting
software, are commonly referred to as NEON technology. There are NEON instruction sets for
both AArch32 (equivalent to the ARMv7 NEON instructions) and for AArch64. Both can be
used to significantly accelerate repetitive operations on large data sets. This can be useful in
applications such as media codecs.
The NEON architecture for AArch64 uses 32 × 128-bit register, twice as many as for ARMv7.
These are the same registers used by the floating-point instructions. All compiled code and
subroutines conforms to the EABI, which specifies which registers can be corrupted and which
registers must be preserved within a particular subroutine. The compiler is free to use any
NEON/VFP registers for floating-point values or NEON data at any point in the code.
Both floating-point and NEON are required in all standard ARMv8 implementations. However,
implementations targeting specialized markets may support the following combinations:

ARM DEN0024A
ID050815

•

No NEON or floating-point.

•

Full floating-point and SIMD support with exception trapping.

•

Full floating-point and SIMD support without exception trapping.

7-1

AArch64 Floating-point and NEON

7.1

New features for NEON and Floating-point in AArch64
AArch64 NEON is based upon the existing AArch32 NEON, with the following changes:
•

There are now thirty-two 128-bit registers, rather than the 16 available for ARMv7.

•

Smaller registers are no longer packed into larger registers, but are mapped one-to-one to
the lower-order bits of the 128-bit register. A single precision floating-point value uses the
lower 32 bits, while double precision value uses the lower 64 bits of the 128-bit register.
See NEON and Floating-Point architecture on page 7-4.

•

The V prefix present in ARMv7-A NEON instructions has been removed.

•

Writes of 64 bits or less to a vector register result in the higher bits being zeroed.

•

In AArch64, there are no SIMD or saturating arithmetic instructions which operate on the
general-purpose registers. Such operations use the NEON registers.

•

New lane insert and extract instructions have been added to support the new register
packing scheme.

•

Additional instructions are provided for generating or consuming the top 64 bits of a
128-bit vector register. Data-processing instructions, which would generate more than one
result register (widening to a 256-bit vector), or consume two sources (narrowing to a
128-bit vector), have been split into separate instructions.

•

A new set of vector reduction operations provide across-lane sum, minimum and
maximum.

•

Some existing instructions have been extended to support 64-bit integer values. For
example, comparison, addition, absolute value and negate, including saturating versions.

•

Saturating instructions have been extended to include Unsigned Accumulate into Signed,
and Signed into Unsigned Accumulate.

•

Support is provided in AArch64 NEON for double-precision floating-point and full
IEEE754 operation including rounding modes, denormalized numbers, and NaN
handling.

Floating-point has been enhanced in AArch64 with the following changes:
•

The V prefix present in ARMv7-A floating-point instructions has been replaced with an F.

•

Support for both single-precision (32-bit) and double-precision (64-bit) floating-point
vector data types and arithmetic as defined by the IEEE 754 floating-point standard,
honoring the FPCR Rounding Mode field, the Default NaN control, the Flush-to-Zero
control, and (where supported by the implementation) the Exception trap enable bits.

•

Load/Store addressing modes for FP/NEON registers are identical to integer Load/Stores,
including the ability to Load or Store a pair of floating-point registers.

•

Floating-point FCSEL and Select and Compare instructions, equivalent to the integer CSEL
and CCMP have been added.
Floating-point FCMP, FCMPE, FCCMP, and FCCMP set the PSTATE.{N, Z, C, V} flags based on the
result of the floating-point comparison and do not modify the condition flags in the
Floating-Point Status Register (FPSR), as is the case in ARMv7.

•

ARM DEN0024A
ID050815

All floating-point Multiply-Add and Multiply-Subtract instructions are fused.

7-2

AArch64 Floating-point and NEON

Fused multiply was introduced in VFPv4 and means that the result of the multiply is not
rounded before being used in the addition. In earlier ARM floating-point architectures, a
Multiply Accumulate operation would perform rounding of both the intermediate result
and final results, which could potentially cause a small loss of precision.
•

Additional conversion operations are provided, for example, between 64-bit integer and
floating-point and between half-precision and double-precision.
Convert float to integer (FCVTxU, FCVTxS) instructions encode a directed rounding mode:
— Towards zero.
— Towards +∞.
— Towards –∞.
— Nearest with ties to even.
— Nearest with ties to away.

•

Round float to nearest integer in floating-point format (FRINTx) has been added, with the
same directed rounding modes, as well as rounding according to the ambient rounding
mode.

•

A new double to single precision Down-Convert instruction with inexact rounding to odd,
suitable for ongoing down-conversion to half-precision with correct rounding (FCVTXN).

•

FMINNM and FMAXNM instructions have been added which implement the IEEE754-2008
minNum() and maxNum() operations. These return the numerical value if one of the operands

is a quiet NaN.
•

ARM DEN0024A
ID050815

Instructions to accelerate floating-point vector normalization have been added (FRECPX,
FMULX).

7-3

AArch64 Floating-point and NEON

7.2

NEON and Floating-Point architecture
The contents of the NEON registers are vectors of elements of the same data type. A vector is
divided into lanes and each lane contains a data value called an element.
The number of lanes in a NEON vector depends on the size of the vector and the data elements
in the vector.
Usually, each NEON instruction results in n operations occurring in parallel, where n is the
number of lanes that the input vectors are divided into. There cannot be a carry or overflow from
one lane to another. Ordering of elements in the vector is from the least significant bit. This
means that element 0 uses the least significant bits of the register.
NEON and floating-point instructions operate on elements of the following types:
•

32-bit single precision and 64-bit double precision floating-point.
Note
16-bit floating-point is supported, but only as a format to be converted from or to. It is not
supported for data processing operations.

•

8-bit, 16-bit, 32-bit, or 64-bit unsigned and signed integers.

•

8-bit and 16-bit polynomials.
The polynomial type is for code, such as error correction, that uses power-of-two finite
fields or simple polynomials over {0,1}. Normal ARM integer code typically uses a
lookup table for finite field arithmetic. AArch64 NEON provides instructions to use large
lookup tables.
Polynomial operations are hard to synthesize out of other operations, so it is useful having
a basic multiply operation from which other, larger operations can be synthesized.

The NEON unit views the register file as:
32 × 128-bit quadword registers, V0-V31, each of which can be viewed as in Figure 7-1:

127

112 111

96 95

80 79

64 63

48 47

32 31

16 15

128-bit NEON register
127

64 63

2 x 64-bit lanes

1
127

96 95

4 x 32-bit lanes

64 63

3
127

8 x 16-bit lanes

112 111

96 95

6
112 111

32 31

7
127

16 x 8-bit lanes

80 79

5
96 95

64 63

48 47

80 79

64 63

32 31

16 15

2
48 47

1
32 31

0
16 15

Figure 7-1 Divisions of the V register

Thirty-two 64-bit D, or doubleword, registers, D0-D31, each of which can be viewed as in
Figure 7-2 on page 7-5:

ARM DEN0024A
ID050815

7-4

AArch64 Floating-point and NEON

127

64 63

64-bit register
127

64 63

1 x 64-bit lane

0
127

64 63

32 31

2 x 32-bit lanes
127

64 63

4 x 16-bit lanes

48 47

3
127

64 63

8 x 8-bit lanes

32 31

2
48 47

16 15

1
32 31

0
16 15

Figure 7-2 Divisions of the D register

All of these registers are accessible at any time. Software does not have to explicitly switch
between them because the instruction used determines the appropriate view.

ARM DEN0024A
ID050815

7-5

AArch64 Floating-point and NEON

7.2.1

Floating-point
In AArch64 the floating-point unit views the NEON register file as:
•
32 × 64-bit D registers D0-D31. The D registers are called double-precision registers and
contain double-precision floating-point values.
•
32 × 32-bit S registers S0-S31. The S registers are called single-precision registers and
contain single-precision floating-point values.
•
32 × 16-bit H registers H0-H31. The H registers are called half-precision registers and
contain half-precision floating-point values.
•
A combination of registers from the above views.

127

112 111

96 95

80 79

64 63

48 47

32 31

16 15

128-bit NEON register
64 63

64-bit double precision floating point

64-bit floating point
127

64 63

32 31

32-bit floating point

32-bit single precision
127

64 63

16 15

16-bit half
precision

16-bit floating point

Figure 7-3 Floating-point register divisions

7.2.2

Scalar data and NEON
Scalar data refers to a single value instead of a vector containing multiple values. Some NEON
instructions use a scalar operand. A scalar inside a register is accessed by index into the vector
of values.
The general array notation to access individual elements of a vector is:
Vd.Ts[index1], Vn.Ts[index2]

where:
Vd is the destination register.
Vn is the first source register.
Ts is the size specifier for the element.
index is the element index.

As in the following example:
INS V0.S[1], V1.S[0]

ARM DEN0024A
ID050815

7-6

AArch64 Floating-point and NEON

Figure 7-4 Inserting an element into a vector (INS V0.S[1], V1.S[0])

In the MOV V0.B[3], W0 instruction, the least significant byte of register W0 is copied into the
fourth byte in register V0.

ARM register W0
31

NEON register V0
0

32 31 24 23

Figure 7-5 Moving a scalar to a lane (MOV V0.B[3], W0)

NEON scalars can be 8-bit, 16-bit, 32-bit, or 64-bit values. Other than multiply instructions,
instructions that access scalars can access any element in the register file.
Multiply instructions only allow 16-bit or 32-bit scalars, and can only access the first 128 scalars
in the register file:
•
16-bit scalars are restricted to registers Vn.H[x], with 0 ≤ n ≤ 15.
•
32-bit scalars are restricted to registers Vn.S[x].
7.2.3

Floating-point parameters
Floating-point values are passed to (and returned from) functions using the floating-point
registers. Both integer (general-purpose) and floating-point registers can be used at the same
time. This means that the floating-point parameters are passed in the floating-point H, S or D
registers and other parameters are passed in integer X or W registers. The AArch64 Procedure
Call Standard mandates hardware floating-point wherever floating-point arithmetic is required,
so there is no software floating-point linkage in AArch64 state.
A detailed list of instructions is given in the ARMv8-A Architecture Reference Manual, but the
main floating-point data processing operations are listed here to show the kind of things that can
be done:
Table 7-1

ARM DEN0024A
ID050815

FABS Sd, Sn

Calculates the absolute value.

FNEG Sd, Sn

Negates the value.

FSQRT Sd, Sn

Calculates the square root.

FADD Sd, Sn, Sm

Adds values.

FSUB Sd, Sn, Sm

Subtracts values.

FDIV Sd, Sn, Sm

Divides one value by another.

7-7

AArch64 Floating-point and NEON

Table 7-1 (continued)

ARM DEN0024A
ID050815

FMUL Sd, Sn, Sm

Multiplies two values.

FNMUL Sd, Sn, Sm

Multiplies and negates.

FMADD Sd, Sn, Sm, Sa

Multiplies and adds (fused).

FMSUB Sd, Sn, Sm, Sa

Multiplies, negates and subtracts (fused).

FNMADD Sd, Sn, Sm, Sa

Multiplies, negates and adds (fused).

FNMSUB Sd, Sn, Sm, Sa

Multiplies, negates and subtracts (fused).

FPINTy Sd, Sn

Rounds to an integral in floating-point format (where y
is one of a number of rounding mode options)

FCMP Sn, Sm

Performs a floating-point compare.

FCCMP Sn, Sm, #uimm4, cond

Performs a floating-point conditional compare.

FCSEL Sd, Sn, Sm, cond

Floating-point conditional select if (cond) Sd = Sn else
Sd = Sm.

FCVTSty Rn, Sm

Converts a floating-point value to an integer value (ty
specifies type of rounding).

SCVTF Sm, Ro

Converts an integer value to a floating-point value.

7-8

AArch64 Floating-point and NEON

7.3

AArch64 NEON instruction format
A number of changes have been made in the syntax of NEON and floating-point instructions to
harmonize with the AArch64 core integer and scalar floating-point instruction set syntax. The
instruction mnemonics are based closely on ARMv7 NEON.
•

The V prefix of ARMv7 NEON instructions has been removed.
Some mnemonics have been renamed where the removal of the V prefix caused a clash
with the ARM core instruction set mnemonics.
This means, for example, that there are now instructions with the same name which do the
same thing, and can be ARM core instructions, NEON, or floating-point, depending on
the syntax of the instruction, for example:
ADD W0, W1, W2{, shift #amount}}

and
ADD X0, X1, X2{, shift #amount}}

are A64 base instructions.
ADD D0, D1, D2

is a scalar floating-point instruction, and
ADD V0.4H, V1.4H, V2.4H

is a NEON vector instruction.
•

An S, U, F or P prefix has been added to indicate Signed, Unsigned, Floating-point, or
Polynomial (only one of these) data types. This mnemonic indicates the data type of the
operation. For example:
PMULL V0.8B, V1.8B, V2.8B

•

The vector organization (element size and number of lanes) is described by the register
qualifiers. For example:
ADD Vd.T, Vn.T, Vm.T

where Vd, Vn and Vm are the register names and T is the subdivision of the register to be
used. For this example, T is the arrangement specifier and is one of 8B, 16B, 4H, 8H, 2S, 4S
or 2D. Any of these can be used, depending on whether 64, 32, 16 or 8-bit data is used, and
whether 64 bits or 128 bits of the register are used.
To add 2 × 64 bit lanes, use
ADD V0.2D, V1.2D, V2.2D

•

As in ARMv7, some NEON data processing instructions are available in Normal, Long,
Wide, Narrow and Saturating variants. Long, Wide and Narrow variants are shown by a
suffix:
—

Normal instructions can operate on any vector types, and produce result vectors the
same size, and usually the same type, as the operand vectors.

—

Long or Lengthening instructions operate on doubleword vector operands and
produce a quadword vector result. The result elements are twice the width of the
operands. Long instructions are specified using an L appended to the instruction.
For example:
SADDL V0.4S, V1.4H, V2.4H

Figure 7-6 on page 7-10 shows this, with input operands being promoted before the
operation.

ARM DEN0024A
ID050815

7-9

AArch64 Floating-point and NEON

V2.4H
V1.4H

V0.4S

Figure 7-6 NEON long instructions

—

Wide or Widening instructions operate on a doubleword vector operand and a
quadword vector operand, producing a quadword vector result. The result elements
and the first operand are twice the width of the second operand elements. Wide
instructions have a W appended to the instruction. For example:
SADDW V0.4S, V1.4H, V2.4S

Figure 7-7 shows this, with the input doubleword operands being promoted before
the operation.
V2.4S
V1.4H

V0.4S

Figure 7-7 NEON wide instructions

—

Narrow or Narrowing instructions operate on quadword vector operands, and
produce a doubleword vector result. The result elements are usually half the width
of the operand elements. Narrow instructions are specified using an N appended to
the instruction. For example:
SUBHN V0.4H, V1.4S, V2.4S

Figure 7-8 on page 7-11 shows this, with input operands being demoted before the
operation.

ARM DEN0024A
ID050815

7-10

AArch64 Floating-point and NEON

V2.4S
V1.4S

V0.4H

Figure 7-8 NEON narrow instructions

•

Signed and unsigned saturating variants (identified by an SQ or UQ prefix) are available
for a number of instructions, as with SQADD and UQADD. If a result would exceed the
maximum or minimum values of the datatype, saturating instructions return that
maximum or minimum value. The saturation limits depend on the datatype of the
instruction.
Table 7-2 Saturation ranges

•

Data type

Saturation range of x

Signed byte (S8)

-27 <= x < 27

Signed halfword (S16)

-215 <= x < 215

Signed word (S32)

-231 <= x < 231

Signed doubleword (S64)

-263 <= x < 263

Unsigned byte (U8)

0 <= x < 28

Unsigned halfword (U16)

0 <= x < 216

Unsigned word (U32)

0 <= x < 232

Unsigned doubleword (U64)

0 <= x < 264

The ARMv7 P prefix for pairwise operations is now a suffix in ARMv8, as for example,
in ADDP. Pairwise instructions operate on adjacent pairs of doubleword or quadword
operands. For example:
ADDP V0.4S, V1.4S, V2.4S

ARM DEN0024A
ID050815

7-11

AArch64 Floating-point and NEON

V1.4S

V2.4S

V0.4S

Figure 7-9 Pairwise operation

•

A V suffix has been added for an across-all-lanes (whole register) operation, for example,
as in ADDV. For example:
ADDV S0, V1.4S

Vn.4S

Figure 7-10 Across all lanes operation

•

A 2 suffix, known as the second and upper half specifier, has been added for the new
widening, narrowing or lengthening second part instructions. If present, it causes the
operation to be performed on the upper 64 bits of the registers holding the narrower
elements:
—

Widening instructions with a 2 suffix get their input data from the high numbered
lanes of the vector that contains the narrower values, and write the expanded results
to the 128-bit destination. For example:
SADDW2 V0.2D, V1.2D, V2.4S

V2.4S

V1.2D

V0.2D

Figure 7-11 SADDW2
ARM DEN0024A
ID050815

7-12

AArch64 Floating-point and NEON

—

Narrowing instructions with a 2 suffix get their input data from the 128-bit source
operands and insert their narrowed results into the high numbered lanes of the
128-bit destination, leaving the lower lanes unchanged. For example:
XTN2 V0.4S, V1.2D

V1.2D

V0.4S

Figure 7-12 XTN2

—

Lengthening instructions with a 2 suffix get their input data from the high numbered
lanes of the 128-bit source vectors and write the lengthened results to the 128-bit
destination. For example:
SADDL2 V0.2D, V1.4S, V2.4S

V2.4S

V1.4S

V0.2D

Figure 7-13 SADDL2

•

ARM DEN0024A
ID050815

Comparison instructions now use the condition code names to indicate what the condition
is and whether (if it applies) the condition is signed or unsigned, for example, CMGT and
CMHI, CMGE and CMHS.

7-13

AArch64 Floating-point and NEON

7.4

NEON coding alternatives
NEON code may be written in a number of ways. These are briefly listed here (but see the ARM
NEON Programmers Guide for details). These include the use of intrinsics, automatic
vectorization of C code, the use of libraries and of course directly writing in assembly language.
Intrinsics are C or C++ pseudo-function calls that the compiler replaces with the appropriate
NEON instructions. This allows you to use the data types and operations available in the NEON
implementation, while allowing the compiler to handle instruction scheduling and register
allocation. These intrinsics are defined in the ARM C Language Extensions document.
Auto-vectorization is controlled with the -fvectorize option in ARM Compiler 6, but is enabled
automatically at higher optimization levels (-O2 and above). Auto-vectorization is disabled at
-O0 even if you specify -fvectorize. Therefore, you would use the following to enable
auto-vectorization at -O1:
armclang --target=armv8a-arm-none-eabi -fvectorize -O1 -c file.c

There are various libraries available which can use NEON code. The exact status of such
libraries changes over time and so current support is not covered in this guide.
Although it is technically possible to optimize NEON assembly by hand, this can be very
difficult because the pipeline and memory access timings have complex inter-dependencies.
Instead of hand assembly, ARM strongly recommends the use of intrinsics:
•

It is easier to write code using instrinsics than using assembly mnemonics.

•

Instrinsics provide good portability for cross-platform development.

•

There is no need to worry about pipeline and memory access timings.

•

For most cases, the result is good performance.
If you are not an experienced assembly language programmer, intrinsics can often achieve
better performance than assembly. Intrinsics provide almost as much control as writing
assembly language, but leave the allocation of registers to the compiler, so that you can
focus on the algorithms. This leads to more maintainable source code than using assembly
language.

ARM DEN0024A
ID050815

7-14

Chapter 8
Porting to A64

This chapter is not intended to act as an exhaustive guide to writing portable code for all
systems, however, this should cover the main areas that application engineers should know for
code porting on ARM specific machines. There are some significant differences that you should
be aware of when moving code to the A64 instruction set in AArch64 from A32 and T32
instruction sets:
•

Most instructions in the A32 instruction set can be executed conditionally. That is, it is
possible to append a condition code to the instruction and have the instruction execute (or
not) based on the outcome of a previous flag setting instruction. Although this enables
programming tricks to reduce code size and cycle count, this significantly complicates the
design of high performance processors with out-of-order execution.
The necessary bits reserved in the opcode field to denote the predication could usefully be
put to other purposes (for example, providing the space for selecting from a larger pool of
general-purpose registers). In A64 code therefore, only a small set of instructions can be
executed conditionally, while some comparison and selection operations depend upon a
condition. See Conditional instructions on page 6-8.

ARM DEN0024A
ID050815

•

Many A64 instructions can apply an arbitrary constant shift to the source register or
registers limited only by the size of the operand. In addition, A64 provides
extended-register forms which can be very useful. Explicit instructions are required to
handle more complicated cases such as variable shifts. T32 is also more restrictive than
A32, so in some ways A64 is a continuation of the same principles. The flexible Operand2
of A32 does not exist as such in A64, but individual instruction classes have their own
options.

•

There are some changes to the available addressing modes for load and store instructions.
The offset, pre-index and post-index forms from A32 and T32 are still available in A64.
There is a new, PC-relative addressing mode, as the PC cannot be accessed in the same
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

8-1

Porting to A64

way as a general-purpose register. A64 loads can shift the register inline (though not with
as much flexibility as in A32), and they can use some of the extend modes too (so you can
have a 32-bit array index, for example).
•

A64 removes all multiple memory access instructions (Load or Store Multiple) from
previous ARM architectures, which were able to read or write an arbitrary list of registers
from memory. Load Pair (LDP) and Store Pair (STP) instructions, which can operate on any
two registers, should be used instead. PUSH and POP have also been removed.

•

ARMv8 adds load and store instructions that include a unidirectional memory barrier:
load-acquire and store-release. These are available in ARMv8 A32 and T32 as well as
A64. A load-acquire instruction requires that any subsequent memory accesses (in
program order) are only visible after the load-acquire. A store-release ensures that all
earlier memory accesses are visible before the store-release becomes visible. See
Memory barrier and fence instructions on page 6-18.

•

AArch64 does not support the concept of coprocessors, including CP15. New system
instructions allow access the registers that are accessed via CP15 coprocessor instructions
in AArch32.

•

The CPSR does not exist in AArch64 as a single register. Instead, PSTATE fields (such as
NZCV) can be accessed using special-purpose registers.

For many applications, porting code from older versions of the ARM Architecture, or other
processor architectures, to A64 means simply recompiling the source code. However, there are
a number of areas where C code is not fully portable.
The similarity between A64 and A32/T32 is illustrated in the following example. The three
sequences below show a simple C function and the output code in first T32 and then A64. The
correspondence between the two is very easy to see.
//C code
int foo(int val)
{
int newval = bar(val);
return val + newval;
}
//T32
foo:
sub sp, sp, #8
strd r4, r14, [sp]
mov r4, r0
bl bar
add r0, r0, r4
ldrd r4, r14, [sp]
add sp, sp, #8
bx lr

//A64
foo:
sub sp, sp #16
stp x19, x30, [sp]
mov w19, w0
bl bar
add w0, w0, w19
ldp x19, x30, [sp]
add sp, sp, #16
ret

The general-purpose functionality provided by A64 has evolved from that found in A32 and
T32, so porting code between the two is fairly straightforward. Translating A32 assembly code
to A64 is also generally straightforward. Most instructions map easily between these instruction
sets and many sequences become simpler in A64.

ARM DEN0024A
ID050815

8-2

Porting to A64

8.1

Alignment
Data and code must be aligned to appropriate boundaries. The alignment of accesses can affect
performance on ARM cores and can represent a portability problem when moving code from an
earlier architecture to ARMv8-A. It is worth being aware of alignment issues for performance
reasons, or when porting code that makes assumptions about pointers or 32-bit and 64-bit
integer variables.
Previous versions of the ARM compiler syntax assembly provide the ALIGN n directive, where n
specifies the alignment boundary in bytes. For example, the directive ALIGN 128 aligns addresses
to 128-byte boundaries.
The GNU assember syntax (ARM Complier 6 syntax) provides the .balign n directive, which
uses the same format as ALIGN.
Note
GNU syntax assembly also provides the .align n directive. However, the format of n varies from
system to system. The .balign directive provides the same alignment functionality as .align
with a consistent behavior across all architectures
You should convert all instances of ALIGN n to .balign n whn moving from the older compilers
to ARM Compiler 6.

ARM DEN0024A
ID050815

8-3

Porting to A64

8.2

Data types
In many programming environments for C and C-derived languages on 64-bit machines, int
variables are still 32 bits wide, but long integers and pointers are 64 bits wide. These are
described as having an LP64 data model. This chapter assumes LP64, though other data models
are available, see Table 5-1 on page 5-7.
The ARM ABI defines a number of basic data types for LP64. Some of these can vary between
architectures, and are included in the following:
Table 8-1 Basic data types
Type

A32

A64

Description

int/long

32-bit

integer

short

16-bit

integer

char

8-bit

byte

long long

64-bit

integer

float

32-bit

single-precision IEEE floating-point

double

64-bit

double-precision IEEE floating-point

bool

8-bit

Boolean

wchar_ta

16-bit unsigned

short (compiler dependent)

32-bit unsigned

int (compiler dependent)

void* pointer

32-bit

64-bit

addresses to data or code

enumerated types

32-bit

32-bitb

signed or unsigned integer

bit fields

not larger than their natural container size
ABI defined extension types

__int128/__uint128

128-bit

signed/unsigned quadword

__f16

16-bit

half precision

a. Environment-dependent. In GNU-based systems (such as Linux) this type is always 32-bit.
b. If the set of values in an enumerated type cannot be represented using either int or unsigned int as a
container type, and the language permits extended enumeration sets, then a long long or unsigned long
long container may be used.

When comparing AArch64 with previous versions of the ARM architecture, 64-bit data types
can typically be handled more efficiently, because of 64-bit general-purpose registers and
operations. An int is still 32-bit, which can be handled efficiently through the available 32-bit
view of the general-purpose registers (W registers). Pointers, however, are 64-bit addresses to
data or code. The ARM ABI defines char to be unsigned by default. This is also true for previous
versions of the architecture.
Porting is simplified if your code does not manipulate pointers in non-portable ways, such as
cases of casting to or from non-pointer types or performing pointer arithmetic. This means you
have never stored a pointer in an int variable (with the possible exception of intptr_t and
uintptr_t) and have never cast a pointer to an int. For more information on this, see Issues when
porting code from a 32-bit to 64-bit environment on page 8-8.

ARM DEN0024A
ID050815

8-4

Porting to A64

Among other effects, this changes the size, and possibly the alignment of structures and
parameter lists. Use the int32_t and int64_t types from stdint.h in cases where storage size
matters. Note that size_t and ssize_t are both 64 bit in AAPCS64-LP64.
For performance reasons, the compiler tries to align data on natural size boundaries. Most
compilers try to optimize the layout of global data within a compilation module.
AArch64 provides support for 16, 32, 64 and 128-bit data unaligned accesses, where the address
used is not a multiple of the quantity to be loaded or stored. However, exclusive load or store
and load-acquire or store-release instructions can only access aligned addresses. This means that
variables used to construct semaphores and other locking mechanisms must typically be
aligned.
Note
Under normal circumstances all variables should be aligned. Unaligned access are still less
efficient on average than aligned access in most cases.
Unaligned accesses are never guaranteed to be atomic with respect to other CPUs or bus masters
in the system.
The only major exception to this rule is access to packed data structures -- this can save
significant effort when marshaling data to/from the outside world, via files or network
connection etc.
Unaligned accesses might have a performance impact when compared with aligned accesses.
Data aligned on a natural size boundary is accessed more efficiently and unaligned accesses
might cost additional bus or cache cycles. The packed attribute ( __attribute__((packed,
aligned(1))) should be used to warn the compiler of potential unaligned accesses, for example
when manually casting pointers pointing to different data types.
8.2.1

Assembly code
Many A32 assembly instructions can be easily replaced with similar A64 instructions.
Unfortunately there is no automated mechanism. However, much can be fairly simply
translated. The following table shows the close match in many areas between the A32/T32 and
A64 instruction sets.
Table 8-2 Instructions that are similar for A32 and A64

ARM DEN0024A
ID050815

A32

A64

ADD Rd,Rn,#7

ADD Wd,Wn,#7

ADDS Rd,Rn,Rm,LSL #2

ADDS Wd,Wn,Wm,LSL #2

B label

BFI Rd,Rn,#lsb,#wid

BFI Wd,Wn,#lsb,#wid

BL label

CBZ Rn,label

CBZ Wn,label

CLZ Rd,Rm

CLZ Wd,Wm

LDR Rt,[Rn,#imm]

LDR Wt,[Xn,#imm]

LDR Rt,[Rn,#imm]!

LDR Wt,[Xn,#imm]!

8-5

Porting to A64

Table 8-2 Instructions that are similar for A32 and A64 (continued)
A32

A64

MOV Rd,#imm

MOV Wd,#imm

MUL Rd,Rn,Rm

MUL Wd,Wn,Wm

RBIT Rd,Rm

RBIT Wd,Wm

However, there are differences in many areas that require rewrites. The following tables show
some of these.
Table 8-3 Instructions that differ between A32 and A64
A32

A64

LDM/STM and PUSH/POP instructions are replaced with LDP/STP (Load/Store Pair)
PUSH {r0-r1}

STP X0, X1, [SP, #-16]!

POP {r0-r1}

LDP X0, X1, [SP], #16

LDMIA r0, {r1, r2}

LDP X1, X2, [X0], #8

STMIA r0, {r1, r2}

STP X1, X2, [X0], #8

MLA

MADD

MOV pc, lr

RET

BX lr
MOVW

MOVZ

MOVT

MOVK

Note
The 64-bit APCS requires 128-bit (16 byte) stack alignment.
Table 8-4 shows how the CPSR is replaced by named fields within PSTATE.
Table 8-4 Use of named fields
A32

A64

CPSR is replaced with a set of separate registers and fields
Disable IRQ

MRS R0, CPSR
ORR R0, R0, #IRQ_Bit
MSR CPSR_c, R0
CPSID i

MSR DAIFSET, #IRQ_bit

ALU Flags

MRS R0, CPSR
MSR CPSR_f, R0

MRS X0, NZCV
MSR NZCV, X0

Set Endianness

SETEND BE

SCTLR_ELn.EE controls ELn data endianness
SCTLR_EL1.E0E controls EL0 data endianness
MRS
ORR
MSR
See

ARM DEN0024A
ID050815

X0, SCTLR_EL1
X0, X0, #EE_bit
SCTLR_EL1, X0
Endianness on page 4-12.

8-6

Porting to A64

The T32 conditional execution scheme compiles to the sequence as shown in the A32 column
of Table 8-4 on page 8-6. In A64, it makes use of the new conditional select instructions as
shown in the A64 column.
The difference between conditional execution in the two instruction sets (T32 and A64) is
illustrated by the following example:
//C code
int gcd (int a, int b)
{
while (a ! = b)
{
if (a >b)
{
a = a - b;
}
else
{
b = b - a;
}
return a;
}
//A32
gcd:
CMP
ITE
SUBGT
SUBLE
BNE
BX

ARM DEN0024A
ID050815

R0, R1
R0, R0, R1
R1, R1, R0
gcd

//A64
gcd:
SUBS W2, W0, W1
CSEL W0, W2, W0, gt
CSNEG W1, W1, W2, gt
BNE
gcd
RET

8-7

Porting to A64

8.3

Issues when porting code from a 32-bit to 64-bit environment
There are some common problems that can arise when migrating C code to run in a 64-bit
environment. These are not specific to ARM.

8.3.1

•

Take care with pointers and integers, as they might not be of the same size. ARM
recommends using uintptr_t or intptr_t from stdint.h for handling pointer types as
integral values. Offsets used in pointer arithmetic should be declared as ptrdiff_t, as
using an int could produce an incorrect result.

•

A 64-bit system has a much larger potential memory reach and it is possible that a 32-bit
int might not be large enough to index all entries in an array.

•

Implicit type conversions in C expressions can have some unexpected effects. Take care
to ensure that any constant values used have the same type as the mask itself.

•

Take care when performing operations with data types of differing length or sign. For
example, when unsigned and signed 32-bit integers are mixed in an expression and the
result assigned to a signed long, it might be necessary to explicitly cast one of the
operands to its 64-bit type. This causes all of the other operands to be promoted to 64 bits,
too. Note that longs are typically 64-bit types on A64 (LP64).

Recompile or rewrite code
Any port inevitably requires an element of both re-compiling as well as rewriting code. The
objective in most cases is to maximize the former and minimize the latter.
The good news is that much code simply recompiles. However, exercise due caution as the size
of many fundamental types will have changed. Although well-written C code should not have
many dependencies on the size of individual types, it is likely that you will come across some.
So, best practice must be to enable all warnings and errors when recompiling and make sure you
take notice of any warnings issued by the compiler, even if the code appears to compile
error-free.
Pay very close attention to any explicit type casts in your code as these are often the source of
errors when the sizes of the underlying types change.

8.3.2

ARM Compiler 6 options for ARMv8-A
It is important to supply the correct options to the compiler to allow code generation or an
ARMv8-A target, . The following are options are available, use:
--target

to generate code for the specified target.
The --target option is mandatory and has no default. You must always specify a target
architecture.
Syntax
--target=triple

where:
triple has the form architecture-vendor-OS-abi.

Supported targets are as follows:

ARM DEN0024A
ID050815

8-8

Porting to A64

aarch64-arm-none-eabi

The AArch64 state of the ARMv8-A architecture.
armv8a-arm-none-eabi

The AArch32 state of the ARMv8-A architecture.
armv7a-arm-none-eabi

The ARMv7-A architecture.
For example:
--target=armv8a-arm-none-eabi

Note
The --target option is an armclang option. For all of the other tools, such as armasm and
armlink, use the --cpu and --fpu options to specify target processors and architectures.
Use the --mcpu option to enable code generation for a specific ARM processor. See
-mcpu=+(no)crc
-mcpu=+(no)crypto
-mcpu=+(no)fp
-mcpu=+(no)simd

enable
enable
enable
enable

or
or
or
or

disable
disable
disable
disable

crc instructions
crytographic extension
the floating point extension
the NEON extension

where is either cortex-a53 or cortex-a57.
Compiling code for AArch32 produces very similar code to compiling for ARMv7-A. Although
AArch32 has some new instructions (such as Load-Acquire and Store-Release), and the SWP
instruction has been removed, these are not instructions generally generated by a compiler.
Compiling with the +nosimd option avoids any use of NEON/floating-point instructions or
registers. This might be useful for systems in which the NEON unit is not powered up or for
particular code segments, for example reset code and exception handlers, in which it is
important to ensure that NEON/floating-point is not used. The default is for no cryptographic
extension, but with NEON.

ARM DEN0024A
ID050815

8-9

Porting to A64

8.4

Recommendations for new C code
•

Use sizeof() instead of a constant for example:
(void**) calloc(4,100)

becomes
(void**) calloc(sizeof(void *), 100)

or better still
void *a;
(void**) calloc(sizeof(a), 100);

•

Where an explicit type is needed, use the types from stdint.h.

•

If you need to cast a pointer to an integer, use a type that is guaranteed to be able to hold
it, such as uintptr_t.
atype *bob; bob++ ; is however still preferred if you are not concerned with the actual
pointer's representation. Pointer arithmetic behaves appropriately for the underlying type.

•

Where data size and layout are important, take care when ordering structure members. For
example, the code:
struct { void *a; int b; int c} bob

is preferred over:
struct { int b; void *a; int c;}

as in AAPCS64 the element a has 32 bits of padding inserted before it to keep it 64-bit
aligned.
•

Use size_t appropriately.

•

Use limits.h where appropriate; be careful when making assumptions about data types.

•

Use the appropriate functions/macros/built-ins for the type you are using.
For example, consider using long atol(char *) instead of int atoi(char *).

•

When using atomic operations, use the correct 64-bit functions to carry them out against
64-bit types.

•

Don't assume operations to different bitfields in the same structure are handled
independently - more bits can be read and written on a 64-bit platform than on a 32-bit
platform.

•

Postfix literals with L for long when they are 32-bit on 32-bit compiles and 64 bit on 64-bit
compiles. This makes sure that they match the long type:
long value = 1L << SOMANY;

For literals that are 64-bit on 32 and 64-bit compilers, postfix with LL or ULL.
•

Alternatively, you could use the macros provided by stdint.h in C99, (for example,
INT64_C and UINT64_C) which allow the definition of a literal without explicitly postfixing
using L and LL.
For example:
size_t value = UINT64_C(1) << SOMANY;

ARM DEN0024A
ID050815

8-10

Porting to A64

8.4.1

Explicit and implicit type conversions
The internal promotion and type conversion in C/C++ can cause some unexpected problems
when data types of different length and/or sign are mixed in expressions. In particular, it is
sometimes important to understand at what point conversions are made in the evaluation of an
expression.
For example:
int + long => long;
unsigned int + signed int => unsigned int
int64_t + uint32_t => int64_t

If the loss of sign conversion is carried out before the promotion to long then the result might be
incorrect when assigned to a signed long.
In cases where unsigned and signed 32-bit integers are mixed in an expression and the result
assigned to a signed 64-bit integer, cast one of the operands to its 64-bit type. This causes the
other operands to be promoted to 64 bits and no further conversion is required when the
expression is assigned. Another solution is to cast the entire expression so that sign extension
occurs on assignment. However, there is no one-size-fits-all solution for these problems. In
practice, the best way to fix them is to understand what the code is trying to do.
Consider this example, in which you would expect the result -1 for a:
long a;
int b;
unsigned int c;
b = -2;
c = 1;
a = b + c;

This gives a result of a = -1 (represented as 0xFFFFFFFF) for 32-bit longs, and a =
0x00000000FFFFFFFF (or 4 294 967 295 in decimal) for 64-bit longs. Clearly an unexpected and
very wrong result! This is because b is converted to unsigned int before the addition (to match
c), so the result of the addition is an unsigned int.
One possible solution is to cast to the longer type before the addition.
long a;
int b;
unsigned int c;
b = -2;
c = 1;
a = (long)b + c;

This gives a result of -1 (or 0xFFFFFFFFFFFFFFFF) in two’s complement representation, and is the
expected result. The calculation is carried out in 64-bit arithmetic and the conversion to signed
now gives the correct result.
8.4.2

Bit manipulation operations
Take care to ensure that bitmasks are of the correct width. There is the possibility that implicit
type conversions in C expressions can have some unexpected effects. Consider the following
function for setting a specified bit in a 64-bit variable:
long SetBitN(long value, unsigned bitNum)
{
long mask;
mask = 1 << bitNum;
return value | mask;
}

ARM DEN0024A
ID050815

8-11

Porting to A64

This function works fine in a 32-bit environment and allows bits [31:0] to be set. To port it to a
64-bit system, you might think it sufficient to change the type of mask to allow bits [63:0] to be
set, as follows:
long long SetBitN(long long value, unsigned bitNum)
{
long long mask;
mask = 1 << bitNum;
return value | mask;
}

Again, this does not work correctly as the numeric literal 1 has int type. The exact behavior
depends on the configuration and assumptions of the individual compiler.
To make the code function correctly, you need to give the constant the same type as the mask:
long long SetBitN(long long value, unsigned bitNum)
{
long long mask;
mask = 1LL << bitNum;
return value | mask;
}

If you need an integer that is a particular size, use types such as uint32_t and the UINT32_C
family of macros, which are defined in stdint.h.
8.4.3

Indexes
When using large arrays or objects in a 64-bit environment, be aware that an int might no longer
be large enough to index all entries. In particular, be careful when iterating over an array using
an int index.
static char array[BIG_NUMBER];
for (unsigned int index = 0; index != BIG_NUMBER; index++) ...

Since size_t is a 64-bit type and unsigned int is a 32-bit type, it is possible to define the size
of the object so that the loop never terminates.

ARM DEN0024A
ID050815

8-12

Chapter 9
The ABI for ARM 64-bit Architecture

The Application Binary Interface (ABI) for the ARM Architecture specifies fundamental rules
to which all executable native code modules must adhere so that they can work correctly
together. These fundamental rules are supplemented by additional rules for specific
programming languages (for example, C++). Individual operating systems or execution
environments (for example, Linux) may specify additional rules to meet their own specific
requirements, beyond those rules specified by the ARM ABI.
There are a number of components to the ABI for the AArch64 architecture:
Executable and Linkable Format (ELF)
ELF for the ARM 64-bit Architecture (AArch64) specifies the object and
executable format.
Procedure Call Standard (PCS)
Procedure Call Standard for the ARM 64-bit Architecture (AArch64) ABI release
specifies how subroutines can be separately written, compiled and assembled to
work together. It specifies the contract between a calling routine and a callee, or
between a routine and its execution environment, for example, the obligations
when calling a routine or stack layout.
DWARF

This is a widely used standardized debugging data format. AArch64 DWARF is
based on DWARF 3.0, but with some additional rules. See DWARF for the ARM
64-bit Architecture (AArch64) for details.

C and C++ libraries
ARM Compiler ARM C and C++ Libraries and Floating-Point Support User
Guide describes the ARM C and C++ libraries.

ARM DEN0024A
ID050815

9-1

The ABI for ARM 64-bit Architecture

C++ ABI

ARM DEN0024A
ID050815

C++ Application Binary Interface Standard for the ARM 64-bit Architecture
describes the generic C++ ABI.

9-2

The ABI for ARM 64-bit Architecture

9.1

Register use in the AArch64 Procedure Call Standard
It can be useful to have knowledge of the standards for register use. Understanding how
parameters are passed can help you to:
•
Write more efficient C code.
•
Understand disassembled code.
•
Write assembly code.
•
Call functions written in a different language.

9.1.1

Parameters in general-purpose registers
For the purposes of function calls, the general-purpose registers are divided into four groups:
Argument registers (X0-X7)
These are used to pass parameters to a function and to return a result. They can
be used as scratch registers or as caller-saved register variables that can hold
intermediate values within a function, between calls to other functions. The fact
that 8 registers are available for passing parameters reduces the need to spill
parameters to the stack when compared with AArch32.
Caller-saved temporary registers (X9-X15)
If the caller requires the values in any of these registers to be preserved across a
call to another function, the caller must save the affected registers in its own stack
frame. They can be modified by the called subroutine without the need to save
and restore them before returning to the caller.
Callee-saved registers (X19-X29)
These registers are saved in the callee frame. They can be modified by the called
subroutine as long as they are saved and restored before returning.
Registers with a special purpose (X8, X16-X18, X29, X30)
•

X8 is the indirect result register. This is used to pass the address location of
an indirect result, for example, where a function returns a large structure.

•

X16 and X17 are IP0 and IP1, intra-procedure-call temporary registers.
These can be used by call veneers and similar code, or as temporary
registers for intermediate values between subroutine calls. They are
corruptible by a function. Veneers are small pieces of code which are
automatically inserted by the linker, for example when the branch target is
out of range of the branch instruction.

•

X18 is the platform register and is reserved for the use of platform ABIs.
This is an additional temporary register on platforms that don't assign a
special meaning to it.

•

X29 is the frame pointer register (FP).

•

X30 is the link register (LR).

Figure 9-1 on page 9-4 shows the 64-bit X registers. For more information on registers, see
Chapter 4. For information on floating-point parameters, see Floating-point parameters on
page 7-7.

ARM DEN0024A
ID050815

9-3

The ABI for ARM 64-bit Architecture

Indirect result
location register

X8 (XR)

X19

X20

X10

X21

X22

X11

X3
Parameter and
result registers

Caller saved
temporary
registers

Callee saved
registers

X12

X23

X13

X24

X14

X25

X15

X26

X27

Intra-procedure
call scratch
registers

X16 (IP0)

Platform register

X18 (PR)

X17 (IP1)

X28
Frame pointer

X29 (FP)

Procedure Link register

X30 (LR)

Figure 9-1 General-purpose register use in the ABI

9.1.2

Indirect result location
To reiterate, the X8 (XR) register is used to pass the indirect result location. Here is some code:
//test.c//
struct struct_A
{
int i0;
int i1;
double d0;
double d1;
} AA;
struct struct_A foo(int i0, int i1, double d0, double d1)
{
struct struct_A A1;
A1.i0
A1.i1
A1.d0
A1.d1

=
=
=
=

i0;
i1;
d0;
d1;

return A1;
}
void bar()
{
AA = foo(0, 1, 1.0, 2.0);
}

and that can be compiled using:
armclang -target aarch64-arm-none-eabi -c test.c
fromelf-c test.o

ARM DEN0024A
ID050815

9-4

The ABI for ARM 64-bit Architecture

Note
This code is compiled without optimization to demonstrate the mechanisms and principles
involved. It is possible that with optimization, the compiler might remove all of this.
foo//
SUB SP, SP, #0x30
STR W0, [SP, #0x2C]
STR W1, [SP, #0x28]
STR D0, [SP, #0x20]
STR D1, [SP, #0x18]
LDR W0, [SP, #0x2C]
STR W0, [SP, #0]
LDR W0, [SP, #0x28]
STR W0, [SP, #4]
LDR W0, [SP, #0x20]
STR W0, [SP, #8]
LDR W0, [SP, #0x18]
STR W0, [SP, #10]
LDR X9, [SP, #0x0]
STR X9, [X8, #0]
LDR X9, [SP, #8]
STR X9, [X8, #8]
LDR X9, [SP, #0x10]
STR X9, [X8, #0x10]
ADD SP, SP, #0x30
RET
bar//
STP X29, X30, [SP, #0x10]!
MOV X29, SP
SUB SP, SP, #0x20
ADD X8, SP, #8
MOV W0, WZR
ORR W1, WZR, #1
FMOV D0, #1.00000000
FMOV D1, #2.00000000
BL foo:
ADRP X8, {PC}, 0x78
ADD X8, X8, #0
LDR X9, [SP, #8]
STR X9, [X8, #0]
LDR X9, [SP, #0x10]
STR X9, [X8, #8]
LDR X9, [SP, #0x18]
STR X9, [X8, #0x10]
MOV SP, X29
LDP X20, X30, [SP], #0x10
RET

In this example, the structure contains more than 16 bytes. According to the AAPCS for
AArch64, the returned object is written to the memory pointed to by XR.
The generated code shows:

ARM DEN0024A
ID050815

•

W0, W1, D0 and D1 are used to pass the integer and double parameters.

•

bar() makes space on the stack for the return structure value of foo() and puts sp into X8.

•

bar() passes X8, together with the parameters in W0, W1, D0 and D1 into foo() before foo()
takes the address for further operations.

•

foo() might corrupt X8, so bar() accesses the return structure using SP.

9-5

The ABI for ARM 64-bit Architecture

The advantage of using X8 (XR) is that it does not reduce the availability of registers for passing
the function parameters.
An AAPC64 stack frame shown in Figure 9-2. The frame pointer (X29) should point to the
previous frame pointer saved on stack, with the saved LR (X30) stored after it. The final frame
pointer in the chain should be set to 0. The Stack Pointer must always be aligned on a 16 byte
boundary. There can be some variation of the exact layout of a stack frame, particularly in the
case of variadic or frameless functions. Consult the AAPCS64 document for details.

LR’’
FP’’
Caller

FP’
Stack args area
SP’
Local variables

Callee save area
Callee

LR’
FP’
FP
Stack args area
SP

Figure 9-2 Stack frame

Note
The AAPCS only specifies the FP, LR block layout and how these blocks are chained together.
Everything else in Figure 9-2 (including the precise location of the boundary between frames of
the two functions) is unspecified, and can be freely chosen by the compiler.
Figure 9-2 illustrates a frame that uses two callee-saved registers (X19 and X20) and one
temporary variable, with the following layout (number on left is offset from the FP in bytes):
40:
32:
24:
16:
8:
0:

temp
X20
X19
LR'
FP'

The padding is necessary to maintain the 16 byte alignment of the Stack Pointer.
function:
STP X29, X30, [SP, #-48]! //
MOV X29, SP
//
//
STP X19, X20, [X29, #16] //
:
:
Main body of code
:
:

ARM DEN0024A
ID050815

Push down stack pointer and store FP and LR
Set the frame pointer to the bottom of the new
frame
Save X19 and X20

9-6

The ABI for ARM 64-bit Architecture

LDP X19, X20, [X29, #16]
LDP X29, X30, [SP], #48
RET

9.1.3

//
//
//
//

Restore X19 and X29
Restore FP' and LR' before setting the stack
pointer to its original position
Return to caller

Parameters in NEON and floating-point registers
The ARM 64-bit architecture also has thirty-two registers, v0-v31, which can be used by NEON
and floating-point operations. The name used to refer to the register changes indicating the size
of the access.
Note
Unlike in AArch32, in AArch64 the 128-bit and 64-bit views of a NEON and floating-point
register do not overlap multiple registers in a narrower view, so q1, d1 and s1 all refer to the
same entry in the register bank.

v16

v17

v10

v18

v11

v19

v12

v20

v13

v21

Parameter and result
registers

v4
v5

Callee must preserve
lower 64 bits across
subroutine calls

v14

v15

Not preserved by callee

v22

Caller should preserve
these before calls if the
registers are in use

v23
v24
v25
v26
v27
v28
v29
v30
v31

Figure 9-3 SIMD and floating-point registers in the ABI

•

V0-V7 are used to pass argument values into a subroutine and to return result values from
a function. They may also be used to hold intermediate values within a routine (but, in
general, only between subroutine calls).

•

V8-V15 must be preserved by a callee across subroutine calls.
Only the bottom 64 bits of each value stored in V8-V15 need to be preserved.

•

ARM DEN0024A
ID050815

V16-V31 do not need to be preserved (or should be preserved by the caller).

9-7

Chapter 10
AArch64 Exception Handling

Strictly speaking, an interrupt is something that interrupts the flow of software execution.
However, in ARM terminology, that is actually an exception. Exceptions are conditions or
system events that require some action by privileged software (an exception handler) to ensure
smooth functioning of the system. There is an exception handler associated with each exception
type. Once the exception has been handled, privileged software prepares the core to resume
whatever it was doing before taking the exception.
The following types of exception exist:
Interrupts

There are two types of interrupts called IRQ and FIQ.
FIQ is higher priority than IRQ. Both of these kinds of exception are typically
associated with input pins on the core. External hardware asserts an interrupt
request line and the corresponding exception type is raised when the current
instruction finishes executing (although some instructions, those that can load
multiple values, can be interrupted), assuming that the interrupt is not disabled.
Both FIQ and IRQ are physical signals to the core, and when asserted, the core
takes the corresponding exception if it is currently enabled. On almost all
systems, various interrupt sources are connected using an interrupt controller. The
interrupt controller arbitrates and prioritizes interrupts, and in turn, provides a
serialized single signal that is then connected to the FIQ or IRQ signal of the core.
For more information see The Generic Interrupt Controller on page 10-17.
Because the occurrence of IRQ and FIQ interrupts are not directly related to the
software being executed by the core at any given time, they are classified as
asynchronous exceptions.

ARM DEN0024A
ID050815

10-1

AArch64 Exception Handling

Aborts

Aborts can be generated either on failed instruction fetches (instruction aborts) or
failed data accesses (Data Aborts). They can come from the external memory
system giving an error response on a memory access (indicating perhaps that the
specified address does not correspond to real memory in the system).
Alternatively, the abort can be generated by the Memory Management Unit
(MMU) of the core. An OS can use MMU aborts to dynamically allocate memory
to applications.
An instruction can be marked within the pipeline as aborted, when it is fetched.
The instruction abort exception is taken only if the core then tries to execute it.
The exception takes place before the instruction executes. If the pipeline is
flushed before the aborted instruction reaches the execute stage of the pipeline,
the abort exception will not occur. A Data Abort exception happens as a result of
a load or store instruction and is considered to happen after the data read or write
has been attempted.
An abort is described as synchronous if it is generated as a result of execution or
attempted execution of the instruction stream, and where the return address
provides details of the instruction that caused it.
An asynchronous abort is not generated by executing instructions, while the
return address might not always provide details of what caused the abort. In
ARMv8-A, the instruction and Data Aborts are synchronous. The asynchronous
exceptions are IRQ/FIQ and System errors (SError). See Synchronous and
asynchronous exceptions on page 10-7.

Reset

Reset is treated as a special vector for the highest implemented Exception level.
This is the location of the instruction that the ARM processor jumps to when an
exception is raised. This vector uses an IMPLEMENTATION DEFINED address.
RVBAR_ELn contains this reset vector address, where n is the number of the
highest implemented Exception level.
All cores have a reset input and take the reset exception immediately after they
have been reset. It is the highest priority exception and cannot be masked. This
exception is used to execute code on the core to initialize it, after power-up.

Exception generating instructions
Execution of certain instructions can generate exceptions. Such instructions are
typically executed to request a service from software that runs at a higher
privilege level:
•

The Supervisor Call (SVC) instruction enables User mode programs to
request an OS service.

•

The Hypervisor Call (HVC) instruction enables the guest OS to request
hypervisor services.

•

The Secure monitor Call (SMC) instruction enables the Normal world to
request Secure world services.

If the resulting exception was generated as a result of an instruction fetch at EL0,
it is taken as an exception to EL1, unless the HCR_EL2.TGE bit is set in the
Non-secure state, in which case it is taken to EL2.
If the exception was generated as a result of an instruction fetch at any other
Exception level, the Exception level remains unchanged.
Earlier in the book, we saw that the ARMv8-A architecture has four Exception levels. Processor
execution can only move between Exception levels by taking, or returning from, an exception.
When the processor moves from a higher to a lower Exception level, the execution state can stay

ARM DEN0024A
ID050815

10-2

AArch64 Exception Handling

the same, or it can switch from AArch64 to AArch32. Conversely, when moving from a lower
to a higher Exception level, the execution state can stay the same or switch from AArch32 to
AArch64.

Core branches to higher
level handler specified by
vector table

Call to lower level handler for
specified source

Program
flow

Exception
occurs
fro
m

Re
tu
r
ex
ce n
pt
io
n

Figure 10-1 Exception flow

Figure 10-1 shows schematically the program flow associated with an exception occurring
when running an application. The processor branches to a vector table which contains entries
for each exception type. The vector table contains a dispatch code which typically identifies the
cause of the exception, and select and call the appropriate function to handle it. This code
completes execution and then return to the high-level handler which then executes the ERET
instruction to return to the application.

ARM DEN0024A
ID050815

10-3

AArch64 Exception Handling

10.1

Exception handling registers
Chapter 4 describes how the current state of the processor is stored within separate PSTATE fields.
If an exception is taken, the PSTATE information is saved in the Saved Program Status Register
(SPSR_ELn) which exists as SPSR_EL3, SPSR_EL2 and SPSR_EL1.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V

SS IL

D A I F

M [3:0]

Figure 10-2 When exceptions are taken from AArch64

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V Q

IT [7:2]

E A I F T M

M [3:0]

Figure 10-3 When exceptions are taken from AArch32

The SPRSR.M field (bit 4) is used to record the execution state (0 indicates AArch64 and 1
indicates AArch32).
Table 10-1 PSTATE fields
PSTATE fields

Description

NZCV

Condition flags

Cumulative saturation bit

DAIF

Exception mask bits

SPSel

SP selection (EL0 or ELn), not applicable to EL0

Data endianness (AArch32 only)

Illegal flag

Software stepping bit

The exception bit mask bits (DAIF) allow the exception events to be masked. The exception is
not taken when the bit is set.
D

Debug exceptions mask.

SError interrupt Process state mask, for example, asynchronous External Abort.

IRQ interrupt Process state mask.

FIQ interrupt Process state mask.

The SPSel field selects whether the current Exception level Stack Pointer or SP_EL0 should be
used. This can be done at any Exception level, except EL0. This is discussed later in the chapter.
The IL field, when set, causes execution of the next instruction to trigger an exception. It is used
in illegal execution returns, for example, trying to return to EL2 as AArch64 when it is
configured for AArch32.

ARM DEN0024A
ID050815

10-4

AArch64 Exception Handling

The Software Stepping (SS) bit is covered in Chapter 18 Debug. It is used by debuggers to
execute a single instruction and then take a debug exception on the following instruction.
Some of these separate fields (CurrentEL, DAIF, NZCV and so on) are copied into a compact
form in SPSR_ELn when taking an exception (and the other way around when returning).
When an event which causes an exception occurs, the processor hardware automatically
performs certain actions. The SPSR_ELn is updated, (where n is the Exception level where the
exception is taken), to store the PSTATE information required to correctly return at the end of the
exception. PSTATE is updated to reflect the new processor status (and this may mean that the
Exception level is raised, or it may stay the same). The return address to be used at the end of
the exception is stored in ELR_ELn.

Program
flow in
EL0
L1

E
AT
T
S

S
->

Exception
Handler

Exception
occurs
EL

TA
TE

Figure 10-4 Exception handling

Remember that the _ELn suffix on register names denotes that there are multiple copies of these
registers existing at different Exception levels. For example, SPSR_EL1 is a different physical
register to SPSR_EL2. Additionally, in the case of a synchronous or SError exception,
ESR_ELn is also updated with a value which indicates the cause of the exception.
The processor has to be told when to return from an exception by software. This is done by
executing the ERET instruction. This restores the pre-exception PSTATE from SPSR_ELn and
returns program execution back to the original location by restoring the PC from ELR_ELn.
We have seen how the SPSR records the necessary state information for an exception return. We
will now look at the link register(s) used to store the program address information. The
architecture provides separate link registers for function calls and for exception returns.
As we saw in Chapter 6 The A64 instruction set, register X30 is used (in conjunction with the
RET instruction) to return from subroutines. Its value is updated with the address of the
instruction to return back to, whenever we execute a branch with link instruction (BL or BLR.)
The ELR_ELn register is used to store the return address from an exception. The value in this
register (actually several registers, as we have seen) is automatically written upon entry to an
exception and is written to the PC as one of the effects of executing the ERET instruction used to
return from exceptions.

ARM DEN0024A
ID050815

10-5

AArch64 Exception Handling

Note
When returning from an exception, you will see an error if the value in the SPSR conflicts with
the settings in the System Registers.
ELR_ELn contains the return address which is preferred for the specific exception type. For
some exceptions, this is the address of the next instruction after the one which generated the
exception. For example, when an SVC (system call) instruction is executed, we simply wish to
return to the following instruction in the application. In other cases, we may wish to re-execute
the instruction that generated the exception.
For asynchronous exceptions, the ELR_ELn points to the address of the first instruction that has
not been executed, or executed fully, as a result of taking the interrupt. Handler code is permitted
to modify the ELR_En if, for example, it was necessary to return to the instruction after an
aborting a synchronous exception. The ARMv8-A model is significantly simpler than that used
in ARMv7-A, where for backward compatibility reasons, it was necessary to subtract 4 or 8
from the Link register value when returning from certain types of exception.
In addition to the SPSR and ELR registers, each Exception level has its own dedicated Stack
Pointer register. These are named SP_EL0, SP_EL1, SP_EL2 and SP_EL3. These registers are
used to point to a dedicated stack that can, for example, be used to store registers which are
corrupted by the exception handler, so that they can be restored to their original value before
returning to the original code.
Handler code may switch from using SP_ELn to SP_EL0. For example, it may be that SP_EL1
points to a piece of memory which holds a small stack that the kernel can guarantee to always
be valid. SP_EL0 might point to a kernel task stack which is larger, but not guaranteed to be safe
from overflow. This switching is controlled by writing to the [SPSel] bit, as shown in the
following code:
MSR SPSel, #0
MSR SPSel, #1

ARM DEN0024A
ID050815

// switch to SP_EL0
// switch to SP_ELn

10-6

AArch64 Exception Handling

10.2

Synchronous and asynchronous exceptions
In AArch64, exceptions may be either synchronous, or asynchronous. An exception is described
as synchronous if it is generated as a result of execution or attempted execution of the instruction
stream, and where the return address provides details of the instruction that caused it. An
asynchronous exception is not generated by executing instructions, while the return address
might not always provide details of what caused the exception.
Sources of asynchronous exceptions are IRQ (normal priority interrupt), FIQ (fast interrupt) or
SError (System Error). System errors have a number of possible causes, the most common being
asynchronous Data Aborts (for example, an abort triggered by writeback of dirty data from a
cache line to external memory).
There are a number of sources of Synchronous exceptions:

10.2.1

•

Instruction aborts from the MMU. For example, by reading an instruction from a memory
location marked as Execute Never.

•

Data Aborts from the MMU. For example, Permission failure or alignment checking.

•

SP and PC alignment checking.

•

Synchronous external aborts. For example, an abort when reading translation table.

•

Unallocated instructions.

•

Debug exceptions.

Synchronous aborts
Synchronous exceptions can occur for a number of possible reasons:
•

Aborts from the MMU. For example, permission failures or memory areas marked as
Access flag fault.

•

SP and PC alignment checking.

•

Unallocated instructions.

•

Service Calls (SVCs, SMCs and HVCs).

Such exceptions may be part of the normal operation of the OS. For example, in Linux, when a
task wishes to request allocation of a new memory page, this is handled through the MMU abort
mechanism.
In the ARMv7-A architecture, the prefetch abort, Data Abort and undef exceptions are separate
items. In AArch64, all of these events generate a Synchronous abort. The exception handler may
then read the syndrome and FAR registers to obtain the necessary information to distinguish
between them (described in more detail later.)
10.2.2

Handling synchronous exceptions
Registers are provided to supply information to exception handlers about the cause of a
synchronous exception. The Exception Syndrome Register (ESR_ELn) gives information about
the reasons for the exception. The Fault Address Register (FAR_ELn) holds the faulting virtual
address for all synchronous instruction and Data Aborts and alignment faults.
The Exception Link Register (ELR_ELn) holds the address of the instruction which caused the
aborting data access (for Data Aborts). This is generally updated after a memory fault, but are
set in other circumstances, for example, by branching to a misaligned address.

ARM DEN0024A
ID050815

10-7

AArch64 Exception Handling

If an exception is taken from an Exception level using AArch32 into an Exception level using
AArch64, and that exception writes the Fault Address Register associated with the target
Exception level, the top 32 bits of the FAR_ELn are all set to zero.
For systems which implement EL2 (Hypervisor) or EL3 (Secure Kernel), Synchronous
exceptions are normally taken in the current or a higher Exception level. Asynchronous
exceptions can (if required) be routed to a higher Exception level to be dealt with by a
Hypervisor or Secure kernel. The SCR_EL3 register specifies which exceptions are to be routed
to EL3 and similarly, HCR_EL2 specifies which exceptions are to be routed to EL2. There are
separate bits which allow individual control over routing of IRQ, FIQ and SError.
10.2.3

System calls
Some instructions or system functions can only be carried out at a specific Exception level. If
code running at a lower Exception level needs to perform a privileged operation, for example,
when application code requests functionality from the kernel. One way to do this is by using the
SVC instruction. This allows applications to generate an exception. Parameters may be passed in
registers, or coded within the System call.

10.2.4

System calls to EL2/EL3
We saw earlier how SVC may be used to call from user applications at EL0 to the kernel at EL1.
The HVC and SMC system call instructions move the processor in a similar fashion to EL2 and EL3.
When the processor is executing at EL0 (Application), it cannot call directly into the hypervisor
(EL2) or Secure monitor (EL3). This is only possible from EL1 and above. Applications must
therefore use SVC to call into kernel and allow the kernel to call into higher Exception levels
on their behalf.
From the OS kernel (EL1), software can call the hypervisor (EL2) with the HVC instruction, or
call the Secure monitor (EL3) with the SMC instruction. If the processor is implemented with
EL3, the ability to have EL2 trap SMC instructions from EL1 is provided. If there is no EL3, the
SMC is unallocated and triggers at the current Exception level.
Similarly, from hypervisor code (EL2), the program can call the Secure monitor (EL3) with the
SMC instruction. If you make an SVC call when in EL2 or EL3 it will still cause a synchronous

exception at the same Exception level, and the handler for that Exception level can decide how
to respond.
10.2.5

Unallocated instructions
Unallocated instructions cause a Synchronous Abort in AArch64. This exception type is
generated when the processor executes one of the following:

ARM DEN0024A
ID050815

•

An instruction opcode that is not allocated.

•

An instruction that requires a higher level of privilege than the current Exception level.

•

An instruction that has been disabled.

•

Any instruction when the PSTATE.IL field is set.

10-8

AArch64 Exception Handling

10.2.6

The Exception Syndrome Register
The Exception Syndrome Register, ESR_ELn, contains information which allows the exception
handler to determine the reason for the exception. It is updated only for synchronous exceptions
and SError. It is not updated for IRQ or FIQ as these interrupt handlers typically obtain status
information from registers in the Generic Interrupt Controller (GIC). (See The Generic
Interrupt Controller on page 10-17.) The bit coding for the register is:

ARM DEN0024A
ID050815

•

Bits [31:26] of ESR_ELn indicate the exception class which allows the handler to
distinguish between the various possible exception causes (such as unallocated
instruction, exceptions from MCR/MRC to CP15, exception from FP operation, SVC,
HVC or SMC executed, Data Aborts, and alignment exceptions).

•

Bit [25] indicates the length of the trapped instruction (0 for a 16-bit instruction or 1 for a
32-bit instruction) and is also set for certain exception classes.

•

Bits [24:0] form the Instruction Specific Syndrome (ISS) field containing information
specific to that exception type. For example, when a system call instruction (SVC, HVC
or SMC) is executed, the field contains the immediate value associated with the opcode
such as 0x123456 for SVC 0x123456.

10-9

AArch64 Exception Handling

10.3

Changes to execution state and Exception level caused by exceptions
When an exception is taken, the processor may change execution state (from AArch64 to
AArch32) or stay in the same execution state. For example, an external source may generate an
IRQ (interrupt) exception while executing an application running in AArch32 mode and then
execute the IRQ handler within the OS Kernel running in AArch64 mode.
The SPSR includes the execution state and Exception level to return back to. This is
automatically set by the processor when an exception is taken. However, the execution state for
exceptions in each Exception level is controlled as follows:
•

The reset execution state of the highest Exception level (not necessarily EL3) is
determined typically by a hardware configuration input. But this is not fixed as we have
the RMR_ELn register to change the execution state (register width) of the highest
Exception level at run-time (causing a soft reset).
Remember that EL3 is associated with Secure monitor code. The monitor is a small
trusted piece of code that always runs in a specific state.

•

For EL2 and EL1, the execution state is controlled by the SCR_EL3.RW and
HCR_EL2.RW bits. The SCR_EL3.RW bit is programmed in EL3 (Secure monitor) and
sets the state of the next lower level (EL2). The HCR_EL2.RW bit may be programmed
in EL2 or EL3, and sets the state of EL1/0.

•

You never take an exception in EL0, (remembering that EL0 is the lowest priority level,
used for application code).

Consider an application running in EL0, which is interrupted by an IRQ as in Figure 10-5. The
Kernel IRQ handler runs at EL1. The processor determines which execution state to set when it
takes the IRQ exception. It does this by looking at the RW bit of the control register for the
Exception level above the one that the exception is being handled in. So, in the example, where
the exception is taken in EL1, it is HCR_EL2.RW which controls the execution state for the
handler.

. . .
Application code
. . .

EL0

EL1

entry
Kernel IRQ handler
exit

Figure 10-5 Exception to EL1

We must now consider what Exception level an exception is taken at. Again, when an exception
is taken, the Exception level may stay the same, or it can get higher. Exceptions are never taken
to EL0, as we have already seen.
Synchronous exceptions are normally taken in the current or a higher Exception level. However,
asynchronous exceptions can be routed to a higher Exception level. For secure code, SCR_EL3
specifies which exceptions are to be routed to EL3. For hypervisor code, HCR_EL2 specifies
exceptions to be routed to EL2.

ARM DEN0024A
ID050815

10-10

AArch64 Exception Handling

In both cases, there are separate bits to control routing of IRQ, FIQ and SError. The processor
only takes the exception into the Exception level to which it is routed. The Exception level can
never go down by taking an exception. Interrupts are always masked at the Exception level
where the interrupt is taken.
When taking an exception from AArch32 to AArch64, there are some special considerations.
AArch64 handler code may require access to AArch32 registers and the architecture therefore
defines mappings to allow access to AArch32 registers.
AArch32 registers R0 to R12 are accessed as X0 to X12. The banked versions of the SP and LR
in the various AArch32 modes are accessed through X13 to X23, while the banked R8 to R12
FIQ registers are accessed as X24 to X29. Bits [63:32] of these registers are not available in
AArch32 state and contains either 0 or the last value written in AArch64. There is no
architectural guarantee on which value it is. It is therefore usual to access registers as W
registers.

ARM DEN0024A
ID050815

10-11

AArch64 Exception Handling

10.4

AArch64 exception table
When an exception occurs, the processor must execute handler code which corresponds to the
exception. The location in memory where the handler is stored is called the exception vector. In
the ARM architecture, exception vectors are stored in a table, called the exception vector table.
Each Exception level has its own vector table, that is, there is one for each of EL3, EL2 and EL1.
The table contains instructions to be executed, rather than a set of addresses. Vectors for
individual exceptions are located at fixed offsets from the beginning of the table. The virtual
address of each table base is set by the Vector Based Address Registers VBAR_EL3,
VBAR_EL2 and VBAR_EL1.
Each entry in the vector table is 16 instructions long. This in itself represents a significant
change compared to ARMv7, where each entry was 4 bytes. This spacing of the ARMv7 vector
table meant that each entry would almost always be some form of branch to the actual exception
handler elsewhere in memory. In AArch64, the vectors are spaced more widely, so that the
top-level handler can be written directly in the vector table.
Table 10-2 shows one of the vector tables. The base address is given by VBAR_ELn and then
each entry has a defined offset from this base address. Each table has 16 entries, with each entry
being 128 bytes (32 instructions) in size. The table effectively consists of 4 sets of 4 entries.
Which entry is used depends upon a number of factors:
•

The type of exception (SError, FIQ, IRQ or Synchronous)

•

If the exception is being taken at the same Exception level, the Stack Pointer to be used
(SP0 or SPx)

•

If the exception is being taken at a lower Exception level, the execution state of the next
lower level (AArch64 or AArch32)
Table 10-2 Vector table offsets from vector table base address
Address
VBAR_ELn + 0x000

ARM DEN0024A
ID050815

Exception type

Description

Synchronous

Current EL with SP0

+ 0x080

IRQ/vIRQ

+ 0x100

FIQ/vFIQ

+ 0x180

SError/vSError

+ 0x200

Synchronous

+ 0x280

IRQ/vIRQ

+ 0x300

FIQ/vFIQ

+ 0x380

SError/vSError

+ 0x400

Synchronous

+ 0x480

IRQ/vIRQ

+ 0x500

FIQ/vFIQ

+ 0x580

SError/vSError

Current EL with SPx

Lower EL using AArch64

10-12

AArch64 Exception Handling

Table 10-2 Vector table offsets from vector table base address (continued)
Address

Exception type

Description

+ 0x600

Synchronous

Lower EL using AArch32

+ 0x680

IRQ/vIRQ

+ 0x700

FIQ/vFIQ

+ 0x780

SError/vSError

Considering an example might make this easier to understand.
If kernel code is executing at EL1 and an IRQ interrupt is signaled, an IRQ exception occurs.
This particular interrupt is not associated with the hypervisor or secure environment and is also
handled within the kernel, also at SP_EL1, and the SPSel bit is set, so you are using SP_EL1.
Execution is therefore from address VBAR_EL1 + 0x280.
In the absence of LDR PC, [PC, #offset] in the ARMv8-A architecture, you must use more
instructions to enable the destination to be read from a table of registers. The choice of spacing
of the vectors is designed to avoid cache pollution for typical sized instruction cache lines from
vectors that are not being used. The Reset Address is a completely separate address, which is
IMPLEMENTATION DEFINED, and is typically set by hardwired configuration within the core. This
address is visible in the RVBAR_EL1/2/3 register.
Having a separate exception vector for each exception, either from the current Exception level
or from the lower Exception level, gives the flexibility for the OS or hypervisor to determine the
AArch64 and AArch32 state of the lower Exception levels. The SP_ELn is used for exceptions
generated from lower levels. However, the software can switch to use SP_EL0 inside the
handler. When you use this mechanism, it facilitates access to the values from the thread in the
handler.

ARM DEN0024A
ID050815

10-13

AArch64 Exception Handling

10.5

Interrupt handling
ARM commonly uses interrupt to mean interrupt signal. On ARM A-profile and R-profile
processors, that means an external IRQ or FIQ interrupt signal. The architecture does not specify
how these signals are used. FIQ is often reserved for secure interrupt sources. In earlier
architecture versions, FIQ and IRQ were used to denote high and standard interrupt priority, but
this is not the case in ARMv8-A.
When the processor takes an exception to AArch64 execution state, all of the PSTATE interrupt
masks is set automatically. This means that further exceptions are disabled. If software is to
support nested exceptions, for example, to allow a higher priority interrupt to interrupt the
handling of a lower priority source, then software needs to explicitly re-enable interrupts.
For the following instruction:
MSR DAIFClr, #imm

This immediate value is in fact a 4-bit field, as there are also masks for:
•

PSTATE.A (for SError)

•

PSTATE.D (for Debug)

Save corruptible registers
Program
flow
L1

L
_E

E
AT
T
S

Identify interrupt source
Clear interrupt source
Handle interrupt

ASM
IRQ
Handler

Exception
occurs
SP

C
Subroutine

TA
TE

Restore corruptible registers

Figure 10-6 Interrupt handler in C code

An example assembly language IRQ handler might look like this:
IRQ_Handler
STP X0, X1, [SP, #-16]!
...
STP X2, X3, [SP, #-16]!

BL read_irq_source

ARM DEN0024A
ID050815

// Stack all corruptible registers
// SP = SP -16
// SP = SP - 16
// unlike in ARMv7, there is no STM instruction and so
// we may need several STP instructions
// a function to work out why we took an interrupt

10-14

AArch64 Exception Handling

//
//
//
//
//

BL C_irq_handler
LDP X2, X3, [SP], #16
LDP X0, X1, [SP], #16
…
ERET

and clear the request
the C interrupt handler
restore from stack the corruptible registers
S = SP + 16
S = SP + 16

However, from a performance point of view, the following sequence might be preferable:
IRQ_Handler
SUB SP, SP, #
STP X0, X1, [SP}
STP X2, X3, [SP]
...
...

//
//
//
//

SP = SP -
Store X0 and X1 at the base of the frame
Store X2 and X3 at the base of the frame + 16 bytes
more register storing

// Interrupt handling
BL read_irq_source

//
//
BL C_irq_handler
//
//
LDP X0, X1, [SP]
//
LDP X2, X3, [SP]
//
...
//
ADD SP, SP, # //
…
ERET

a function to work out why we took an interrupt
and clear the request
the C interrupt handler
restore from stack the corruptible registers
Load X0 and X1 at the base of the frame
Load X2 and X3 at the base of the frame + 16 bytes
more register loading
Restore SP at its original value

Save corruptible registers
Save SPSR_EL1
Save ELR_EL1
Enable interrupts
Program
flow
1
L1 EL
_E R_
R S
EL P
-> ->S
PC TE
TA
PS

L
>E

E
AT
T
S

ASM
IRQ
Handler
SP

EL
_E

TA
TE

SP
->

C
subroutine

ASM
IRQ
Handler

C
Subroutine
LR

LR
PC

TA
TE

Disable interrupts
Restore ELR_EL1
Restore SPSR_EL1
Restore corruptible registers

Figure 10-7 Handling nested interrupts

ARM DEN0024A
ID050815

10-15

AArch64 Exception Handling

The nested handler requires a little extra code. It must preserve on the stack the contents of
SPSR_EL1 and ELR_EL1. We must also re-enable IRQs after determining (and clearing) the
interrupt source. However (unlike in ARMv7-A), as the link register for subroutine calls is
different to the link register for exceptions, we avoid having to do anything special with LR or
modes.

ARM DEN0024A
ID050815

10-16

AArch64 Exception Handling

10.6

The Generic Interrupt Controller
ARM provides a standard interrupt controller which can be used for ARMv8-A systems. The
programming interface to this interrupt controller is defined in the GIC Architecture. There are
multiple versions of the GIC Architecture Specification. This document concentrates on
version 2 (GICv2). ARMv8-A processors are typically connected to a GIC, for example the
GIC-400 or GIC-500. The Generic Interrupt Controller (GIC) supports routing of software
generated, private and shared peripheral interrupts between cores in a multi-core system.
The GIC architecture provides registers that can be used to manage interrupt sources and
behavior and (in multi-core systems) for routing interrupts to individual cores. It enables
software to mask, enable and disable interrupts from individual sources, to prioritize (in
hardware) individual sources and to generate software interrupts. The GIC accepts interrupts
asserted at the system level and can signal them to each core it is connected to, potentially
resulting in an IRQ or FIQ exception being taken.
From a software perspective, a GIC has two major functional blocks:
Distributor
To which all interrupt sources in the system are connected. The Distributor has
registers to control the properties of individual interrupts such as priority, state,
security, routing information and enable status. The Distributor determines which
interrupt is to be forwarded to a core, through the attached CPU interface.
CPU Interface
Through which a core receives an interrupt. The CPU interface hosts registers to
mask, identify and control states of interrupts forwarded to that core. There is a
separate CPU interface for each core in the system.
Interrupts are identified in the software by a number, called an interrupt ID. An interrupt ID
uniquely corresponds to an interrupt source. Software can use the interrupt ID to identify the
source of interrupt and to invoke the corresponding handler to service the interrupt. The exact
interrupt ID presented to the software is determined by the system design,
Interrupts can be of a number of different types:
Software Generated Interrupt (SGI)
This is generated explicitly by software by writing to a dedicated Distributor
register, the Software Generated Interrupt Register (GICD_SGIR). It is most
commonly used for inter-core communication. SGIs can be targeted at all, or a
selected group of cores in the system. Interrupt IDs 0-15 are reserved for this. The
interrupt ID used for a given interrupt is set by the software that generated it..
Private Peripheral Interrupt (PPI)
This is a global peripheral interrupt that the Distributor can route to a specified
core or cores. Interrupt IDs16-31 are reserved for this. These identify interrupt
sources private to the core, and is independent of the same source on another core,
for example, a per-core timer.
Shared Peripheral Interrupt (SPI)
This is generated by a peripheral that the GIC can route to more than one core.
Interrupt numbers 32-1020 are used for this. SPIs are used to signal interrupts
from various peripherals accessible across the whole the system.
Locality-specific Peripheral Interrupt (LPI)
These are message-based interrupts that are routed to a particular core. LPIs are
not supported in GICv2 or GICv1.

ARM DEN0024A
ID050815

10-17

AArch64 Exception Handling

Interrupts can either be edge-triggered (considered to be asserted when the GIC detects a rising
edge on the relevant input, and to remain asserted until cleared) or level-sensitive (considered
to be asserted only when the relevant input to the GIC is HIGH).
An interrupt can be in a number of different states:
•

Inactive – this means that the interrupt is not currently asserted.

•

Pending – this means that the interrupt source has been asserted, but is waiting to be
handled by a core. Pending interrupts are candidates to be forwarded to the CPU interface
and then later on to the core.

•

Active – this means that the interrupt that has been acknowledged by a core and is
currently being serviced.

•

Active and pending – this describes the situation where a core is servicing the interrupt
and the GIC also has a pending interrupt from the same source.

The priority and list of cores to which an interrupt can be delivered to are all configured in the
Distributor. An interrupt asserted to the Distributor by a peripheral is in the Pending state (or
Active and Pending if it was already Active). The Distributor determines the highest priority
pending interrupt that can be delivered to a core and forwards that to the CPU interface of the
core. At the CPU interface, the interrupt is in turn signaled to the core, at which point the core
takes the FIQ or IRQ exception.
The core executes the exception handler in response. The handler must query the interrupt ID
from a CPU interface register and begin servicing the interrupt source. When finished, the
handler must write to a CPU interface register to report the end of processing.
For a given interrupt the typical sequence is:
•

Inactive -> Pending
When the interrupt is asserted by the peripheral.

•

Pending -> Active
When the handler acknowledges the interrupt.

•

Active -> Inactive
When the handle has finished dealing with the interrupt.

The Distributor provides registers which report the current state of the different interrupt IDs..
In mutli-core/multi-processor systems, a single GIC can be shared by multiple cores (up to eight
in GICv2). The GIC provides registers to control which core, or cores, a SPI is targeted at. This
mechanism enables the operating system to share and distribute interrupts across cores and
coordinate activities.
More detailed information on GIC behavior can be found in the TRMs for the individual
processor types and in the ARM Generic Interrupt Controller Architecture specification.
10.6.1

Configuration
The GIC is accessed as a memory-mapped peripheral. All cores can access the common
Distributor, but the CPU interface is banked, that is, each core uses the same address to access
its own private CPU interface. It is not possible for a core to access the CPU interface of another
core.

ARM DEN0024A
ID050815

10-18

AArch64 Exception Handling

The Distributor hosts a number of registers that you can use to configure the properties of
individual interrupts. These configurable properties are:
•

An interrupt priority (GICD_IPRIORITY). The distributor uses this to determine
which interrupt is next forwarded to the CPU interface.

•

An interrupt configuration (GICD_ICFGR). This determines if an interrupt is level or
edge-sensitive. Not applicable to SGIs.

•

An interrupt target (GICD_ITARGETSR). This determines a list of cores to which an
interrupt can be forwarded. Only applicable to SPIs.

•

Interrupt enable or disable status (GICD_ISENABLER and
GICD_ICENABLER). Only those interrupts that are enabled in the distributor are
eligible to be forwarded when they become pending.

•

Interrupt security (GICD_IGROUPR) determines whether the interrupt is allocated to
Secure or Normal world software.

•

An Interrupt state.

The Distributor also provides priority masking by which interrupts below a certain priority are
prevented from reaching the core. The distributor uses this when determining whether a pending
interrupt can be forwarded to a particular core.
The CPU interfaces on each core helps with fine-tuning interrupt control and handling on that
core.
10.6.2

Initialization
Both the Distributor and the CPU interfaces are disabled at reset. The GIC must be initialized
after reset before it can deliver interrupts to the core.
In the Distributor, software must configure the priority, target, security and enable individual
interrupts. The Distributor must subsequently be enabled through its control register
(GICD_CTLR). For each CPU interface, software must program the priority mask and
preemption settings.
Each CPU interface block itself must be enabled through its control register (GICD_CTLR).
This prepares the GIC to deliver interrupts to the core.
Before interrupts are expected in the core, software prepares the core to take interrupts by setting
a valid interrupt vector in the vector table, and clearing interrupt mask bits in PSTATE, and setting
the routing controls..
The entire interrupt mechanism in the system can be disabled by disabling the Distributor.
Interrupt delivery to an individual core can be disabled by disabling its CPU interface.
Individual interrupts can also be disabled (or enabled) in the distributor.
For an interrupt to reach the core, the individual interrupt, Distributor and CPU interface must
all be enabled. The interrupt also needs to be of sufficient priority, that is, higher than the core's
priority mask.

10.6.3

Interrupt handling
When the core takes an interrupt, it jumps to the top-level interrupt vector obtained from the
vector table and begins execution.
The top-level interrupt handler reads the Interrupt Acknowledge Register from the CPU
Interface block to obtain the interrupt ID.

ARM DEN0024A
ID050815

10-19

AArch64 Exception Handling

As well as returning the interrupt ID, the read causes the interrupt to be marked as active in the
Distributor. Once the interrupt ID is known (identifying the interrupt source), the top-level
handler can now dispatch a device-specific handler to service the interrupt.
When the device-specific handler finishes execution, the top-level handler writes the same
interrupt ID to the End of Interrupt (EoI) register in the CPU Interface block, indicating the end
of interrupt processing.
Apart from removing the active status, which makes the final interrupt status either Inactive, or
Pending (if the state was Active and Pending), this enables the CPU Interface to forward more
pending interrupts to the core. This concludes the processing of a single interrupt.
It is possible for there to be more than one interrupt waiting to be serviced on the same core, but
the CPU Interface can signal only one interrupt at a time. The top-level interrupt handler could
repeat the above sequence until it reads the special interrupt ID value 1023, indicating that there
are no more interrupts pending at this core. This special interrupt ID is called the spurious
interrupt ID.
The spurious interrupt ID is a reserved value, and cannot be assigned to any device in the
system. When the top-level handler has read the spurious interrupt ID it can complete its
execution, and prepare the core to resume the task it was doing before taking the interrupt.
A Generic Interrupt Controller (GIC) generally manages input from multiple interrupt sources
and distributes them to IRQ or FIQ requests.

ARM DEN0024A
ID050815

10-20

Chapter 11
Caches

When the ARM architecture was first developed, the clock speed of the processor and the access
speeds of memory were broadly similar. Processor cores today are much more complicated and
can be clocked orders of magnitude faster. However, the frequency of external buses and of
memory devices has not scaled to the same extent. It is possible to implement small blocks of
on-chip SRAM that can operate at the same speeds as the core, but such RAM is very expensive
in comparison to standard DRAM blocks, which can have thousands of times more capacity. In
many ARM processor-based systems, access to external memory takes tens or even hundreds of
core cycles.
A cache is a small, fast block of memory that sits between the core and main memory. It holds
copies of items in main memory. Accesses to the cache memory occur significantly faster than
those to main memory. Whenever the core reads or writes a particular address, it first looks for
it in the cache. If it finds the address in the cache, it uses the data in the cache, rather than
performing an access to main memory. This significantly increases the potential performance of
the system, by reducing the effect of slow external memory access times. It also reduces the
power consumption of the system, by avoiding the need to drive external signals.

ARM DEN0024A
ID050815

11-1

Caches

Cluster
Cluster
Internal Level 1 Cache

Internal Level 1 Cache

Main Memory
Internal Level 2 Cache
Core 0

Core 1

External Level 3 Cache

Bus

Figure 11-1 A basic cache arrangement

Processors that implement the ARMv8-A Architecture are usually implemented with two or
more levels of cache. This typically means that the processor has small L1 Instruction and Data
caches for each core. The Cortex-A53 and Cortex-A57 processors are normally implemented
with two or more levels of cache, that is a small L1 Instruction and Data cache and a larger,
unified L2 cache, which is shared between multiple cores in a cluster. Additionally, there can be
an external L3 cache as an external hardware block, shared between clusters.
The initial access that provided the data to the cache is no faster than normal. It is any
subsequent accesses to the cached values that are faster, and it is from this that the performance
increase derives. The core hardware checks all instruction fetches and data reads or writes in the
cache, although you must mark some parts of memory, such as those containing peripheral
devices, for example, as non-cacheable. Because the cache holds only a subset of main memory,
you require a way to determine quickly whether the address you are looking for is in the cache.
Occasionally, data and instructions in the cache and data in external memory might not be the
same; this is because the processor can update the cache contents, which have not yet been
written back to main memory. Alternatively, an agent might update main memory after a core
has taken its own copy. This is a problem with coherency, which is described in Chapter 14. This
can be a particular problem when you have multiple cores or memory agents such as an external
DMA controller.

ARM DEN0024A
ID050815

11-2

Caches

11.1

Cache terminology
In a von Neumann architecture, a single cache is used for instruction and data (a unified cache).
A modified Harvard architecture has separate instruction and data buses and therefore there are
two caches, an instruction cache (I-cache) and a data cache (D-cache). In the ARMv8
processors, there are distinct instruction and data L1 caches backed by a unified L2 cache.
The cache is required to hold an address, some data and some status information.
The following is a brief summary of some of the terms used and a diagram illustrating the
fundamental structure of a cache:

Offset
Line

64
Data RAM

64-bit address
0

Tag RAM
Tag

Index

Offset

Index

Set
Way

Tag

Figure 11-2 Cache terminology

•

The tag is the part of a memory address stored within the cache that identifies the main
memory address associated with a line of data.
The top bits of the 64-bit address tell the cache where the information came from in main
memory and is known as the tag. The total cache size is a measure of the amount of data
it can hold, although the RAMs used to hold tag values are not included in the calculation.
The tag does, however, take up physical space in the cache.

•

It would be inefficient to hold one word of data for each tag address, so several locations
are typically grouped together under the same tag. This logical block is commonly known
as a cache line, and refers to the smallest loadable unit of a cache, a block of contiguous
words from main memory. A cache line is said to be valid when it contains cached data or
instructions, and invalid when it does not.
Associated with each line of data are one or more status bits. Typically, you have a valid
bit that marks the line as containing data that can be used. This means that the address tag
represents some real value. In a data cache, you might also have one or more dirty bits that
mark whether the cache line (or part of it) holds data that is not the same as (newer than)
the contents of main memory.

•

The index is the part of a memory address that determines in which lines of the cache the
address can be found.
The middle bits of the address, or index, identify the line. The index is used as address for
the cache RAMs and does not require storage as a part of the tag. This is covered in more
detail later in this chapter.

•

ARM DEN0024A
ID050815

A way is a subdivision of a cache, each way being of equal size and indexed in the same
fashion. A set consists of the cache lines from all ways sharing a particular index.

11-3

Caches

•

11.1.1

This means that the bottom few bits of the address, called the offset, are not required to
be stored in the tag. You require the address of a whole line, not of each byte within the
line, so the five or six least significant bits are always 0.

Set associative caches and ways
The main caches of ARM cores are always implemented using a set of associative caches. This
significantly reduces the likelihood of the cache thrashing seen with direct mapped caches,
improving program execution speed and giving more deterministic execution. It comes at the
cost of increased hardware complexity and a slight increase in power, because multiple tags are
compared on each cycle.
With this kind of cache organization, the cache is divided into a number of equally-sized pieces,
called ways. A memory location can then map to a way rather than a line. The index field of the
address continues to be used to select a particular line, but now it points to an individual line in
each way. Commonly, there are two or four ways for an L1 Data cache. The Cortex-A57 has a
3-way L1 Instruction cache. It is common for an L2 cache to have 16 ways.
An external L3 cache implementation, such as the ARM CCN-504 Cache Coherent Network
(See Compute subsystems and mobile applications on page 14-18), can have larger numbers of
ways, that is higher associativity, because of their much larger size. The cache lines with the
same index value are said to belong to a set. To check for a hit, you must look at each of the tags
in the set.
In Figure 11-3, a 2-way cache is shown. Data from address 0x00, 0x40 or 0x80 might be found in
line 0 of either, but not both of the two cache ways.

Main memory

Cache way 0

0x0000.0000
0x0000.0010
0x0000.0020
0x0000.0030
0x0000.0040
Cache way 1

0x0000.0050
0x0000.0060
0x0000.0070
0x0000.0080
0x0000.0090

Figure 11-3 A 2-way set-associative cache

Increasing the associativity of the cache reduces the probability of thrashing. The ideal case is
a fully associative cache, where any main memory location can map anywhere within the cache.
However, building such a cache is impractical for anything other than very small caches, for
example, those associated with MMU TLBs. In practice, performance improvements are
minimal for above 8-way, with 16-way associativity being more useful for larger L2 caches.

ARM DEN0024A
ID050815

11-4

Caches

11.1.2

Cache tags and Physical Addresses
Each line has a tag associated with it which records the Physical Address in external memory
associated with that line. The size of a cache line is implementation defined. However, all the
cores should have the same cache line size because of the interconnect.
The Physical Address of the access is used to determine the location of data in cache. The least
significant bits are used to select the relevant item within a cache line. The middle bits are used
as an index to select a specific line within a cache set. The most significant bits identify the
remainder of the address and are used for comparison with the stored tag for that line. In
ARMv8, data caches are normally Physically Indexed, Physically Tagged (PIPT), but can also
be non-aliasing Virtually Indexed, Physically Tagged (VIPT).
Each line in the cache includes:
•

A tag value from the associated Physical Address.

•

Valid bits to indicate whether the line exists in the cache, that is whether the tag is valid.
Valid bits can also be state bits for MESI state if the cache is coherent across multiple
cores.

•

Dirty data bits to indicate whether the data in the cache line is not coherent with external
memory.

ARM caches are set associative. This means that there are multiple possible cache locations, or
ways, for any given address. A set associative cache significantly reduces the likelihood of
cache thrashing and so improves program execution speed, but at the cost of increased hardware
complexity and a slight increase in power.
A simplified four-way set associative 32KB L1 cache (such as the data cache of the Cortex-A57
processor), with a 16-word (64 byte) cache line length, is shown in Figure 11-4:

Address
Tag
14

Tag

Data line 0

Data line 1

Data line 2

Data line 3

Data line 254

Data line 255

Word

Set (=Index)

V=valid bit

Byte
1 0

1514 13 12 11 10 9 8 7 6 5 4 3 2 1 0 D
Cache line

D=dirty bit

Figure 11-4 A 32KB 4-way set associative data cache

ARM DEN0024A
ID050815

11-5

Caches

11.1.3

Inclusive and exclusive caches
Consider a simple memory read, for example, LDR X0, [X1] in a single core processor.
•

If X1 points to a location in memory, which is marked as cacheable, then there is a cache
lookup in the L1 data cache.

•

If the address is found within the L1 cache, then data is read from the L1 cache and
returned to the core.

L1 Instruction cache

Figure 11-5 Found in the L1 cache

•

If the address is not found in the L1 cache, but is in the L2 cache, then the cache line is
loaded into the L1 cache from the L2 cache and the data is returned to the core. This can
cause a line to be evicted from the L1 to make room, but it might still be present in the
larger L2 cache.

L1 Instruction cache

L2 unified cache

Figure 11-6 Found in the L2 cache

•

ARM DEN0024A
ID050815

If the address is not in either the L1 or L2 caches, data is loaded into both the L1 and L2
caches from external memory and supplied to the core. This can cause lines to be evicted.

11-6

Caches

L1 Instruction cache

L2 unified cache

External memory

Figure 11-7 Found in external memory

This is a rather simplistic view. For multi-core and multi-cluster systems, before performing a
load from external memory, the caches of L2 or L1 caches of cores within the cluster or of other
clusters might also be checked. In addition, there is no consideration of either L3 or system
caches at this point.
This is an inclusive cache model, where the same data can be present in both the L1 and L2
caches. In an exclusive cache, data can be present in only one cache and an address cannot be
found in both the L1 and L2 caches at the same time.

ARM DEN0024A
ID050815

11-7

Caches

11.2

Cache controller
The cache controller is a hardware block responsible for managing the cache memory, in a way
that is largely invisible to the program. It automatically writes code or data from main memory
into the cache. It takes read and write memory requests from the core and performs the necessary
actions to the cache memory or the external memory.
When it receives a request from the core, it must check to see whether the requested address is
to be found in the cache. This is known as a cache look-up. It does this by comparing a subset
of the address bits of the request with tag values associated with lines in the cache. If there is a
match, known as a hit, and the line is marked valid, then the read or write occurs using the cache
memory.
When the core requests instructions or data from a particular address, but there is no match with
the cache tags, or the tag is not valid, a cache miss results and the request must be passed to the
next level of the memory hierarchy, an L2 cache, or external memory. It can also cause a cache
linefill. A cache linefill causes the contents of a piece of main memory to be copied into the
cache. At the same time, the requested data or instructions are streamed to the core. This process
occurs transparently and is not directly visible to a software developer. The core need not wait
for the linefill to complete before using the data. The cache controller typically accesses the
critical word within the cache line first. For example, if you perform a load instruction that
misses in the cache and triggers a cache linefill, the core first retrieves that part of the cache line
that contains the requested data. This critical data is supplied to the core pipeline, while the
cache hardware and external bus interface then read the rest of the cache line, in the background.

ARM DEN0024A
ID050815

11-8

Caches

11.3

Cache policies
The cache policies enable us to describe when a line should be allocated to the data cache and
what should happen when a store instruction is executed that hits in the data cache.
The cache allocation policies are:
Write allocation (WA)
A cache line is allocated on a write miss. This means that executing a store
instruction on the processor might cause a burst read to occur. There is a
linefill to obtain the data for the cache line, before the write is performed.
The cache contains the whole line, which is its smallest loadable unit, even
if you are only writing to a single byte within the line.
Read allocation (RA)
A cache line is allocated on a read miss.
The cache update policies are:
Write-back (WB)
A write updates the cache only and marks the cache line as dirty. External
memory is updated only when the line is evicted or explicitly cleaned.

Reads
Linefills

Data Cache
Writes

Dirty Data
(Eviction)

Writes Miss
Write Buffer

External
Write

Write-Back Data Cache Mode

Level 2 System

Core

Figure 11-8 Write-back

Write-through (WT)
A write updates both the cache and the external memory system. This does
not mark the cache line as dirty.

Core
Reads
Data Cache
Writes
Write Buffer
Write-Through Data Cache Mode

Figure 11-9 Write-through

Data reads which hit in the cache behave the same in both WT and WB cache modes.

ARM DEN0024A
ID050815

11-9

Caches

Level 2
Cache

Inner Cacheable

Cores

Level 1
Cache

Memory
System

Outer Cacheable

Level 2 Cache

Cores

Level 1
Cache

Level 3 Cache

The cacheable properties of normal memory are specified separately as inner and outer
attributes. The divide between inner and outer is IMPLEMENTATION DEFINED and is covered in
greater detail in Chapter 13. Typically, inner attributes are used by the integrated caches, and
outer attributes are made available on the processor memory bus for use by external caches.

Memory
System

Figure 11-10 Cacheable properties of memory

Normal memory can be speculatively accessed by the processor and this means that it can
potentially automatically load data into the cache without the programmer having explicitly
requested a specific address. This is covered in more detail in Chapter 13 Memory Ordering.
However, it is also possible for the programmer to give an indication to the core about which
data is used in the future. The ARMv8-A provides preload hint instructions. It is
IMPLEMENTATION DEFINED whether the caches support speculation and preload. The following
instructions are available:
•

AArch64: PRFM PLDL1KEEP, [Xm, #imm] ; This indicates a Prefetch for a load from Xm +
offset into the L1 cache as a temporal prefetch, which means that the data might be used
more than once.

•

AArch32: PLD Rm // Preload data from address in Rm to cache

More generally, the A64 instruction to prefetch memory has the following form:
PRFM , addr

Where:

| #uimm5

PLD for prefetch for load
PST for prefetch for store

L1 for L1 cache, L2 for L2 cache, L3 for L3 cache

KEEP for retain or temporal prefetch means allocate in cache normally
STRM for streaming or non-temporal prefetch means the memory is used only once

uimm5

ARM DEN0024A
ID050815

Represents the hint encodings as a 5-bit immediate. These are optional.

11-10

Caches

11.4

Point of coherency and unification
For set-based and way-based clean and invalidate, the operation is performed on a specific level
of cache. For operations that use a Virtual Address, the architecture defines two points:
•

Point of Coherency (PoC). For a particular address, the PoC is the point at which all
observers, for example, cores, DSPs, or DMA engines, that can access memory, are
guaranteed to see the same copy of a memory location. Typically, this is the main external
system memory.

System control
registers

Data Cache

Core 0

Core 1

Point of Coherency

Figure 11-11 Point of Coherency

•

Point of Unification (PoU). The PoU for a core is the point at which the instruction and
data caches and translation table walks of the core are guaranteed to see the same copy of
a memory location. For example, a unified level 2 cache would be the point of unification
in a system with Harvard level 1 caches and a TLB for caching translation table entries.
If no external cache is present, main memory would be the Point of Unification.

System control
registers

Instruction
Cache

Core

Data Cache

TLB

Point of Unification

Figure 11-12 Point of Unification

Knowledge of the PoU enables self-modifying code to ensure future instruction fetches are
correctly made from the modified version of the code. They can do this by using a two-stage
process:
•
Clean the relevant data cache entries by address.

ARM DEN0024A
ID050815

11-11

Caches

•

Invalidate instruction cache entries by address.

The ARM architecture does not require the hardware to ensure coherency between instruction
caches and memory, even for locations of shared memory.

ARM DEN0024A
ID050815

11-12

Caches

11.5

Cache maintenance
It is sometimes necessary for software to clean or invalidate a cache. This might be required
when the contents of external memory have been changed and it is necessary to remove stale
data from the cache. It can also be required after MMU-related activity such as changing access
permissions, cache policies, or virtual to Physical Address mappings, or when I and D-caches
must be synchronized for dynamically generated code such as JIT-compilers and dynamic
library loaders.
•

Invalidation of a cache or cache line means to clear it of data, by clearing the valid bit of
one or more cache lines. The cache must always be invalidated after reset as its contents
are undefined. This can also be viewed as a way of making changes in the memory domain
outside the cache visible to the user of the cache.

•

Cleaning a cache or cache line means writing the contents of cache lines that are marked
as dirty, out to the next level of cache, or to main memory, and clearing the dirty bits in
the cache line. This makes the contents of the cache line coherent with the next level of
the cache or memory system. This is only applicable for data caches in which a write-back
policy is used. This is also a way of making changes in the cache visible to the user of the
outer memory domain, but is only available for data cache.

•

Zero. This zeroes a block of memory within the cache, without the need to first of all read
its contents from the outer domain. This is only available for data cache.

For each of these operations, you can select which of the entries the operation should apply to:
•

All, means the entire cache and is not available for the data or unified cache

•

Modified Virtual Address (MVA), another name for VA, is the cache line that contains a
specific Virtual Address

•

Set or Way is a specific cache line selected by its position within the cache structure

AArch64 cache maintenance operations are performed using instructions which have the
following general form:
{, }

A number of operations are available.
Table 11-1 Data cache, instruction cache, and unified cache operations
Cache

Operation

Description

AArch32
Equivalent

CISW

Clean and invalidate by Set/Way

DCCISW

CIVAC

Clean and Invalidate by Virtual Address to Point of Coherency

DCCIMVAC

CSW

Clean by Set/Way

DCCSW

CVAC

Clean by Virtual Address to Point of Coherency

DCCMVAC

CVAU

Clean by Virtual Address to Point of Unification

DCCMVAU

ISW

Invalidate by Set/Way

DCISW

IVAC

Invalidate by Virtual Address, to Point of Coherency

DCIMVAC

ZVA

Cache zero by Virtual Address

ARM DEN0024A
ID050815

11-13

Caches

Table 11-1 Data cache, instruction cache, and unified cache operations (continued)
Cache

Operation

Description

AArch32
Equivalent

IALLUIS

Invalidate all, to Point of Unification, Inner Sharable

ICIALLUIS

IALLU

Invalidate all, to Point of Unification, Inner Shareable

ICIALLU

IVAU

Invalidate by Virtual Address to Point of Unification

ICIMVAU

Those instructions that accept an address argument take a 64-bit register which holds the Virtual
Address to be maintained. No alignment restrictions apply for this address. Instructions that take
a Set/Way/Level argument, take a 64-bit register whose lower 32-bits follow the format
described in the ARMv7 architecture. The AArch64 Data Cache invalidate instruction by
address, DC IVAC, requires write permission or else a permission fault is generated.
All instruction cache maintenance instructions can execute in any order relative to other
instruction cache maintenance instructions, data cache maintenance instructions, and loads and
stores, unless a DSB is executed between the instructions.
Data cache operations, other than DC ZVA, that specify an address are only guaranteed to execute
in program order relative to each other if they specify the same address. Those operations that
specify an address execute in program order relative to all maintenance operations that do not
specify an address.
Consider the following code sequence.
Example 11-1 Cache invalidate and clean to PoU

IC IVAU, X0
DC CVAC, X0
IC IVAU, X1

//
//
//
//

Instruction Cache Invalidate by address to Point of Unification
Data Cache Clean by address to Point of Coherency
Might be out of order relative to the previous operations if
x0 and x1 differ

The first two instructions execute in order, as they refer to the same address. However, the final
instruction might be re-ordered relative to the previous operations, as it refers to a different
address.
Example 11-2 Cache invalidate to PoU

IC IVAU, X0
IC IALLU

// I cache Invalidate by address to Point of Unification
// I cache Invalidate All to Point of Unification
// Operations execute in order

This only applies to issuing the instruction. Completion is only guaranteed after a DSB
instruction.
The ability to preload the data cache with zero values using the DC ZVA instruction is new in
ARMv8-A. Processors can operate significantly faster than external memory systems and it can
sometimes take a long time to load a cache line from memory.
Cache line zeroing behaves in a similar fashion to a prefetch, in that it is a way of hinting to the
processor that certain addresses are likely to be used in the future. However, a zeroing operation
can be much quicker as there is no need to wait for external memory accesses to complete.

ARM DEN0024A
ID050815

11-14

Caches

Instead of getting the actual data from memory read into the cache, you get cache lines filled
with zeros. It enables hinting to the processor that the code completely overwrites the cache line
contents, so there is no need for an initial read.
Consider the case where you need a large temporary storage buffer or are initializing a new
structure. You could have code simply start using the memory, or you could write code that
prefetched it before using it. Both would use a lot of cycles and memory bandwidth in reading
the initial contents to the cache. By using a cache zero option, you could potentially save this
wasted bandwidth and execute the code faster.
The point at which a cache maintenance instruction takes place can be defined depending on
whether the instruction operates by VA or by Set/Way.
You can choose the scope, which can be either PoC or PoU, and for operations that can be
broadcast, see Chapter 14 Multi-core processors, you can select the Shareability.
The following example code illustrates a generic mechanism for cleaning the entire data or
unified cache to the PoC.
Example 11-3 Cleaning to Point of Coherency

MRS X0, CLIDR_EL1
AND W3, W0, #0x07000000
LSR W3, W3, #23
CBZ W3, Finished
MOV W10, #0
MOV W8, #1
Loop1: ADD W2, W10, W10, LSR #1
LSR W1, W0, W2
AND W1, W1, #0x7
CMP W1, #2
B.LT Skip
MSR CSSELR_EL1, X10
ISB
MRS X1, CCSIDR_EL1
AND W2, W1, #7
ADD W2, W2, #4
UBFX W4, W1, #3, #10
CLZ W5, W4

// Get 2 x Level of Coherence

//
//
//
//

W10 = 2 x cache level
W8 = constant 0b1
Calculate 3 x cache level
extract 3-bit cache type for this level

//
//
//
//
//
//
//
/*

No data or unified cache at this level
Select this cache level
Synchronize change of CSSELR
Read CCSIDR
W2 = log2(linelen)-4
W2 = log2(linelen)
W4 = max way number, right aligned
W5 = 32-log2(ways), bit position of way in DC
operand */
W9 = max way number, aligned to position in DC
operand */
W16 = amount to decrement way number per iteration
W7 = max set number, right aligned
W7 = max set number, aligned to position in DC
operand */
W17 = amount to decrement set number per iteration
W11 = combine way number and cache number...
... and set number for DC operand
Do data cache clean by set and way
Decrement set number

LSL W9, W4, W5

LSL W16, W8, W5
Loop2: UBFX W7, W1, #13, #15
LSL W7, W7, W2

//
//
/*

LSL W17, W8, W2
Loop3: ORR W11, W10, W9
ORR W11, W11, W7
DC CSW, X11
SUBS W7, W7, W17
B.GE Loop3
SUBS X9, X9, X16
B.GE Loop2
Skip: ADD W10, W10, #2
CMP W3, W10
DSB

//
//
//
//
//

// Decrement way number
// Increment 2 x cache level
/* Ensure completion of previous cache maintenance
operation */

B.GT Loop1

ARM DEN0024A
ID050815

11-15

Caches

Finished:

Some points to note:
•

Under normal circumstances, cleaning or invalidating the entire cache is something that
only the firmware should be doing, as part of the core’s power-up or power-down
sequence. It can also take significant time, the number of lines in the L2 cache can be quite
large, and it is necessary to loop over them one by one.
Therefore this kind of clean is definitely for special occasions only!

•

Cache maintenance operations such as DC CSW are described in Cache maintenance on
page 11-13.

•

The caches must be disabled at the start of the sequence to prevent the allocation of new
lines mid-sequence. If the caches were exclusive, a line could migrate between levels.

•

In an SMP system, another core might be able to take dirty cache lines from the cache
mid-sequence, preventing them from reaching the PoC. Both the Cortex-A53 and
Cortex-A7 processors can do this.

•

If there is an EL3, then the caches must be invalidated from the Secure world as some of
the entries could be ‘secure dirty’ data which cannot be invalidated from the Normal
world. If left untouched, ‘secure dirty’ data can corrupt the memory system when it is
evicted because of normal cache use in the Secure or Normal worlds.

If software requires coherency between instruction execution and memory, it must manage this
coherency using the ISB and DSB memory barriers and cache maintenance instructions. The code
sequence shown in Example 11-4 can be used for this purpose.
Example 11-4 Cleaning a line of self-modifying code

/* Coherency example for data and instruction accesses within the same Inner
Shareable domain. Enter this code with containing a new 32-bit instruction,
to be held in Cacheable space at a location pointed to by Xn. */
STR Wt, [Xn]
DC CVAU, Xn
// Clean data cache by VA to point of unification (PoU)
DSB ISH
// Ensure visibility of the data cleaned from cache
IC IVAU, Xn
// Invalidate instruction cache by VA to PoU
DSB ISH
// Ensure completion of the invalidations
ISB
// Synchronize the fetched instruction stream

This code sequence is only valid for an instruction sequence that fits into a single I or D-cache
line.
The code cleans and invalidates data and instruction caches by Virtual Address for a region
starting at the base address given in x0 and length given in x1.
Example 11-5 Cleaning by Virtual Address

//
// X0 = base address
// X1 = length (we assume the length is not 0)
//
// Calculate end of the region
ADD x1, x1, x0
// Base Address + Length

ARM DEN0024A
ID050815

11-16

Caches

//
// Clean the data cache by MVA
//
MRS X2, CTR_EL0
// Read Cache Type Register
// Get the minimun data cache line
//
UBFX X4, X2, #16, #4
MOV X3, #4
LSL X3, X3, X4
SUB X4, X3, #1
BIC X4, X0, X4
clean data cache:
DC CVAU, X4
ADD X4, X4, X3
CMP X4, X1

//
//
//
//

Extract DminLine (log2 of the cache line)
Dminline iss the number of words (4 bytes)
X3 should contain the cache line
get the mask for the cache line

// Aligned the base address of the region

B.LT clean_data_cache

//
//
//
//
//

Clean data cache line by VA to PoU
Next cache line
Is X4 (current cache line) smaller than the end
of the region
while (address < end_address)

DSB ISH

// Ensure visibility of the data cleaned from cache

//
//Clean the instruction cache by VA
//
// Get the minimum instruction cache line (X2 contains ctr_el0)
AND X2, X2, #0xF
// Extract IminLine (log2 of the cache line)
MOV X3, #4
// IminLine is the number of words (4 bytes)
LSL X3, X3, X2
// X3 should contain the cache line
SUB x4, x3, #1
// Get the mask for the cache line
BIC X4, X0, X4
clean_instruction_cache:
IC IVAU, X4
ADD X4, X4, X3
CMP X4, X1

// Aligned the base address of the region

//
//
//
//
B.LT clean_instruction_cache //
DSB ISH
ISB

ARM DEN0024A
ID050815

Clean instruction cache line by VA to PoU
Next cache line
Is X4 (current cache line) smaller than the end
of the region
while (address < end_address)

// Ensure completion of the invalidations
// Synchronize the fetched instruction stream

11-17

Caches

11.6

Cache discovery
Cache maintenance operations can be performed either by cache set, or way, or by Virtual
Address. Code that is platform-independent might need to know the size of a cache, the size of
the cache lines, numbers of sets and ways, and how many levels of cache there are in the system.
This requirement is most likely to arise for post-reset cache invalidation and zero operations.
All other operations on architectural caches are likely to be made on a PoC or PoU basis.
There are a number of system control registers that contain this information:.
•

The number of cache levels present can be determined by having software read the Cache
Level ID Register (CLIDR_EL1).

•

The cache line size is given in the Cache Type Register (CTR_EL0).

•

If this needs to be accessed by user code, running at execution level EL0, this can be done
by setting the UCT bit of the System Control Register (SCTLR/SCTLR_EL1).

Exception level accesses to two separate registers are required to determine the number of sets
and ways in a cache.
1.

Code must first write to the Cache Size Selection Register (CSSELR_EL1) to select which
cache you want the information for.

Code then reads the Cache Size ID Register (CCSIDR/CCSIDR_EL1).

The Data cache Zero ID Register (DCZID_EL0) contains the block size to be zeroed for
Zero operations.

The [DZE] bit of the SCTLR/SCTLR_EL1 and the [TDZ] bit in the Hypervisor
Configuration Register (HCR/HCR_EL2) control which execution levels and which
worlds can access DCZID_EL0. CLIDR_EL1, CSSELR_EL1, and CCSIDR_EL1 are
only accessible via privileged code, that is, PL1 or higher in AArch32, or EL1 or higher
in AArch64.

If execution of the Data Cache Zero by Virtual Address (DC ZVA) instruction is prohibited
at an Exception level, as controlled for EL0 by the SCTLR_EL1.DZE bit, and for
Non-secure execution in EL1 and EL0 by the HCR_EL2.TDZ bit, then reading this
register returns a value that indicates that the instruction is not supported.

The CLIDR register is only aware of how many levels of cache are integrated into the
processor itself. It cannot provide information about any caches in the external memory
system.
For example, if only L1 and L2 are integrated, CLIDR/CLIDR_EL1 identifies two levels
of cache and the processor is unaware of any external L3 cache.
It might be necessary to take into account non-integrated caches when performing cache
maintenance, or code that is maintaining coherency with integrated caches.

ARM DEN0024A
ID050815

11-18

Caches

CPU
Level 1 Cache

Level 2 Cache
Bus Interface Unit

Level 3 Cache

AMBA Interconnect

Level 4 System Cache

Figure 11-13

In addition, in a big.LITTLE system, the described cache hierarchy can differ from core
to core, for example, the Cortex-A53 and Cortex-A57 processors have different
CTR.L1IP fields.

ARM DEN0024A
ID050815

11-19

Chapter 12
The Memory Management Unit

An important function of the Memory Management Unit (MMU) is to enable the system to run
multiple tasks, as independent programs running in their own private virtual memory space.
They do not need any knowledge of the physical memory map of the system, that is, the
addresses that are actually used by the hardware, or about other programs that might execute at
the same time.
MMU
Memory
ARM Core
TLBs

Table
Walk Unit

Caches

Translation
tables

Figure 12-1 The Memory Management Unit

You can use the same virtual memory address space for each program. You can also work with
a contiguous virtual memory map, even if the physical memory is fragmented. This Virtual
Address space is separate from the actual physical map of memory in the system. You can write,
compile, and link applications to run in the virtual memory space.
An example system, illustrating the virtual and physical views of memory, is shown in
Figure 12-2 on page 12-2. Different processors and devices in a single system might have
different virtual and Physical Address maps. The OS programs the MMU to translate between
these two views of memory.

ARM DEN0024A
ID050815

12-1

The Memory Management Unit

Virtual memory

Physical memory
0xFFFFFFFF_FFFFFFFF

Reserved
Peripherals
Kernel
space

Reserved
Peripherals

ROM
Reserved
Reserved

RAM

ROM

Reserved

RAM
Reserved
RAM

User
space

RAM
Reserved
Reserved

0x00000000_00000000

Figure 12-2 Virtual and physical memory

To do this, the hardware in a virtual memory system must provide address translation, which is
the translation of the Virtual Address issued by the processor to a Physical Address in the main
memory.
Virtual Addresses are those used by you, and the compiler and linker, when placing code in
memory. Physical Addresses are those used by the actual hardware system.
The MMU uses the most significant bits of the Virtual Address to index entries in a translation
table and establish which block is being accessed. The MMU translates the Virtual Addresses
of code and data to the Physical Addresses in the actual system. The translation is carried out
automatically in hardware and is transparent to the application. In addition to address
translation, the MMU controls memory access permissions, memory ordering, and cache
policies for each region of memory.

ARM DEN0024A
ID050815

12-2

The Memory Management Unit

0xFFFFFFFF_FFFFFFFF

Virtual memory

Physical memory

Reserved

Peripherals
Kernel
Not available space
in EL2 or EL3

Peripherals

ROM
Reserved

RAM

Translation table
TTBR1_EL1

Reserved

0xFFFF0000_00000000
ROM

Reserved
RAM

0x0000FFFF_FFFFFFFF

User
space
0x00000000_00000000

Translation table
RAM
Reserved

TTBR0_EL0

Reserved

Figure 12-3 Address translation using translation tables

The MMU enables tasks or applications to be written in a way that requires them to have no
knowledge of the physical memory map of the system, or about other programs that might be
running simultaneously. This allows you to use the same virtual memory address space for each
program.
It also lets you work with a contiguous virtual memory map, even if the physical memory is
fragmented. This Virtual Address space is separate from the actual physical map of memory in
the system. Applications are written, compiled and linked to run in the virtual memory space.

ARM DEN0024A
ID050815

12-3

The Memory Management Unit

12.1

The Translation Lookaside Buffer
The Translation Lookaside Buffer (TLB) is a cache of recently accessed page translations in the
MMU. For each memory access performed by the processor, the MMU checks whether the
translation is cached in the TLB. If the requested address translation causes a hit within the TLB,
the translation of the address is immediately available.
Each TLB entry typically contains not just physical and Virtual Addresses, but also attributes
such as memory type, cache policies, access permissions, the Address Space ID (ASID), and the
Virtual Machine ID (VMID). If the TLB does not contain a valid translation for the Virtual
Address issued by the processor, known as a TLB miss, an external translation table walk or
lookup is performed. Dedicated hardware within the MMU enables it to read the translation
tables in memory. The newly loaded translation can then be cached in the TLB for possible reuse
if the translation table walk does not result in a page fault. The exact structure of the TLB differs
between implementations of the ARM processors.
If the OS modifies translation entries that may have been cached in the TLB, it is then the
responsibility of the OS to invalidate these stale TLB entries.
When executing A64 code, there is a TLBI, which is a TLB invalidate instruction.
TLBI {IS} {, }

The following list gives some of the more common selections for the type field. A complete list
is given in Table 12-1 on page 12-5.
ALL

All TLB entries.

VMALL

All TLB entries. This is stage 1 for current guest OS.

VMALLS12 All TLB entries. This is stage 1 and 2 for current guest OS.
ASID

Entries that match ASID in Xt.

Entry for Virtual Address and ASID specified in Xt.

VAA

Entries for Virtual Address specified in Xt, with any ASID.

Each Exception level, that is EL3, EL2, or EL1, has its own Virtual Address space that the
operation applies to. The IS field specifies that this is only for Inner Shareable entries.
Note
See Context switching on page 12-27 for information about ASIDs and Translation table
configuration on page 12-18 for more about the concept of shareability.
The field simply specifies the Exception level Virtual Address space (can be 3, 2 or 1)
that the operation should apply to.

ARM DEN0024A
ID050815

12-4

The Memory Management Unit

The IS field specifies that this is only for Inner Shareable entries.
Table 12-1 TLB configuration instructions
TLB
invalidate

Variant

Description

TLBI

ALLEn

TLB invalidate All, ELn.

ALLEnIS

TLB invalidate All, ELn, Inner Shareable.

ASIDE1

TLB invalidate by ASID, EL1.

ASIDE1IS

TLB invalidate by ASID, EL1, Inner Shareable.

IPAS2E1

TLB invalidate by IPA, Stage 2, EL1.

IPAS2E1IS

TLB invalidate by IPA, Stage 2, EL1, Inner Shareable.

IPAS2LE1IS

TLB invalidate by IPA, Stage 2, Last level, EL1, Inner
Shareable.

VAAE1

TLB invalidate by VA, All ASID, EL1.

VAAE1IS

TLB invalidate by VA, All ASID, EL1, Inner Shareable.

VAALE1IS

TLB invalidate for the Last level, by VA, All ASID, EL1,
Inner Shareable.

VAEn

TLB invalidate by VA, ELn.

VAEnIS

TLB invalidate by VA, ELn, Inner Shareable.

VALEn

TLB invalidate by VA, Last level, ELn.

VALEnIS

TLB invalidate by VA, Last level, ELn, Inner Shareable.

VMALLE1

TLB invalidate by VMID, All at stage 1, EL1.

VMALLE1IS

TLB invalidate by VMID, EL1, Inner Shareable.

VMALLS12E1

TLB invalidate by VMID, All at Stage 1 and 2, EL1.

VMALLS12E1

TLB invalidate by VMID, All at Stage 1 and 2, EL1.

VMALLS12E1IS

TLB invalidate by VMID, All at Stage 1 and 2, EL1 Inner
Shareable.

VMALLS12E1IS

TLB invalidate by VMID, All at Stage 1 and 2, EL1 Inner
Shareable.

The following code example shows a sequence for writes to translation tables backed by inner
shareable memory:
<< Writes to Translation Tables >>
DSB ISHST
// ensure write has completed
TLBI ALLE1
// invalidate all TLB entries
DSB ISH
// ensure completion of TLB invalidation
ISB
// synchronize context and ensure that no instructions are
// fetched using the old translation

See Barriers on page 13-6 for more information about the DSB and ISB barrier instructions shown
in the example.

ARM DEN0024A
ID050815

12-5

The Memory Management Unit

For a change to a single entry, for example, use the instruction:
TLBI VAE1, X0

which invalidates an entry associated with the address specified in the register X0.
The TLB can hold a fixed number of entries. You can achieve best performance by minimizing
the number of external memory accesses caused by translation table traversal and obtaining a
high TLB hit rate. The ARMv8-A architecture provides a feature known as contiguous block
entries to efficiently use TLB space. Translation table block entries each contain a contiguous
bit. When set, this bit signals to the TLB that it can cache a single entry covering translations
for multiple blocks. A lookup can index anywhere into an address range covered by a
contiguous block. The TLB can therefore cache one entry for a defined range of addresses,
making it possible to store a larger range of Virtual Addresses within the TLB than is otherwise
possible.
To use a contiguous bit, the contiguous blocks must be adjacent, that is they must correspond to
a contiguous range of Virtual Addresses. They must start on an aligned boundary, have
consistent attributes, and point to a contiguous output address range at the same level of
translation. The required alignment is that VA[20:16] for a 4KB granule or VA[28:21] for a
64KB granule, are the same for all addresses. The following numbers of contiguous blocks are
required:
•

16 × 4KB adjacent blocks giving a 64KB entry with 4KB granule.

•

32 × 32MB adjacent blocks giving a 1GB entry for L2 descriptors, 128 × 16KB giving a
2MB entry for L3 descriptors when using a 16KB granule.

•

32 × 64Kb adjacent blocks giving a 2MB entry with a 64KB granule.

If these conditions are not met, a programming error occurs, which can cause TLB aborts or
corrupted lookups. Possible examples of such an error include:
•

One or more of the table entries do not have the contiguous bit set.

•

The output of one of the entries points outside the aligned range.

With the ARMv8 architecture, incorrect use does not allow permissions checks outside of EL0
and EL1 valid address space to be escaped, or to erroneously provide access to EL3 space.

ARM DEN0024A
ID050815

12-6

The Memory Management Unit

12.2

Separation of kernel and application Virtual Address spaces
Operating systems typically have a number of applications or tasks running concurrently. Each
of these has its own unique set of translation tables and the kernel switches from one to another
as part of the process of switching context between one task and another. However, much of the
memory system is used only by the kernel and has fixed virtual to Physical Address mappings
where the translation table entries rarely change. The ARMv8 architecture provides a number
of features to efficiently handle this requirement.
The table base addresses are specified in the Translation Table Base Registers (TTBR0_EL1)
and (TTBR1_EL1). The translation table pointed to by TTBR0 is selected when the upper bits
of the VA are all 0. TTBR1 is selected when the upper bits of the VA are all set to 1. You can
enable VA tagging to exclude the top eight bits from the check.
The Virtual Address from the processor of an instruction fetch or data access is 64 bits.
However, you must map both of the two regions defined above within a single 48-bit Physical
Address memory map.
EL2 and EL3 have a TTBR0, but no TTBR1. This means:
•

If EL2 is using AArch64, it can only use Virtual Addresses in the range 0x0 to
0x0000FFFF_FFFFFFFF.

•

If EL3 is using AArch64, it can only use Virtual Addresses in the range 0x0 to
0x0000FFFF_FFFFFFFF.

Figure 12-4 shows how the kernel space can be mapped to the most significant area of memory
and the Virtual Address space associated with each application mapped to the least significant
area of memory. However, both of these are mapped to a much smaller Physical Address space.
0xFFFFFFFF_FFFFFFFF
Kernel
space

TTBR1
0xFFFF0000_00000000

FAULT

0x0000FFFF_FFFFFFFF
App
space

TTBR0
0x00000000_00000000
Virtual address space

Physical address space

Figure 12-4 Kernel and application memory mapping

ARM DEN0024A
ID050815

12-7

The Memory Management Unit

The Translation Control Register TCR_EL1 defines the exact number of most significant bits
that are checked. TCR_EL1 contains the size fields T0SZ[5:0] and T1SZ[5:0]. The integer in
the field gives the number of the most significant bits that must be either all 0s or all 1s. There
are specified minimum and maximum values for these fields, which vary with granule size and
starting table level. Therefore, you must always use both spaces and at least two translation
tables are required in all systems. A simple bare metal system without an OS still requires a
small upper table that contains only fault entries.
TBI
0/1

IPS size

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
TG1 SH1

ORGN
1

IRGN
1

T1SZ

TG0 SH0

ORGN
0

IRGN
0

T0SZ

Figure 12-5 Translation table control configuration

TCR_EL1 controls other memory management features at EL1 and EL0. Figure 12-5 shows
only those fields that control address ranges and granule size.
IPA size
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
TG1

T1SZ

TG0

T0SZ

Figure 12-6 Translation table control register

The Intermediate Physical Address Size (IPS) field controls the maximum output address size.
If translations specify output addresses outside this range, then access is faulted, 000=32 bits of
Physical Address, 101=48 bits. The two-bit Translation Granule (TG) TG1 and TG0 fields give
the granule size for kernel or user space respectively, 00=4KB, 01=16KB, 11=64KB.
You can configure the level of translation table that is used for the first lookup. The full
translation process can require three or four levels of tables. You need not implement all levels..
The first level of lookup is, in effect, determined by the granule size and TCR_ELn.TxSZ fields.
You can configure it separately for TTBR0_EL1 and TTBR1_EL1.

ARM DEN0024A
ID050815

12-8

The Memory Management Unit

12.3

Translating a Virtual Address to a Physical Address
When the processor issues a 64-bit Virtual Address for an instruction fetch, or data access, the
MMU hardware translates the Virtual Address to the corresponding Physical Address. For a
Virtual Address the top 16 bits [63:47] must be all 0s or 1s, otherwise the address triggers a fault.
The least significant bits are then used to give an offset within the selected section, so that the
MMU combines the Physical Address bits from the block table entry with the least significant
bits from the original address to produce the final address.
The architecture also supports tagged addresses. This is where the most significant eight bits of
the address are ignored (treated as not being part of the address). This means that the bits can be
used for something else, for example, recording information about a pointer.

Virtual address from core

63
TTBR select

41
Level 2 index

29 28
0
Physical address [28:0]

6
63

0
Low bits of virtual
address form low bits of
physical address

TTBRx
...

Page table entry
Page table
base address

Index in table

...
Level 2 page table with 8192 entries
Page table entry
contains PA [47:29]

PA[47:29]

Physical address [28:0]

Figure 12-7 Virtual to Physical Address translation for a 512MB block

In a simple address translation involving only one level of look-up. It assumes we are using a
64KB granule with a 42-bit Virtual Address. The MMU translates a Virtual Address as follows:

ARM DEN0024A
ID050815

If VA[63:42] = 1 then TTBR1 is used for the base address for the first page table. When
VA[63:42] = 0, TTBR0 is used for the base address for the first page table.

The page table contains 8192 64-bit page table entries, and is indexed using VA[41:29].
The MMU reads the pertinent level 2 page table entry from the table.

The MMU checks the page table entry for validity and whether or not the requested
memory access is allowed. Assuming it is valid, the memory access is allowed.

12-9

The Memory Management Unit

In Figure 12-7 on page 12-9, the page table entry refers to a 512MB page (it is a block
descriptor).

Bits [47:29] are taken from this page table entry and form bits [47:29] of the Physical
Address.

Because we have a 512MB page, bits [28:0] of the VA are taken to form PA[28:0]. See
Effect of granule sizes on translation tables on page 12-15.

The full PA[47:0] is returned, along with additional information from the page table entry.

In practice, such a simple translation process severely limits how finely you can divide up your
address space. Instead of using only this first-level translation table, a first-level table entry can
also point to a second-level page table.
In this way, an OS can further divide a large section of virtual memory into smaller pages. For
a second-level table, the first-level descriptor contains the physical base address of the
second-level page table. The Physical Address that corresponds to the Virtual Address requested
by the processor, is found in the second-level descriptor.
Figure 12-8 shows an example of translation for a 64-bit granule starting at stage 1, level 2 for
a normal 64KB page.

Virtual address from core

63
TTBR select
VA

29 28
16 15
0
Level 2 index
Level 3 index PA [15:0]

0
63

...
...

Low bits of virtual
address form low bits of
physical address

TTBRx
Index in table
Page table
base address

...
...
L2 page table
Page table
base address

L3 page table

Page table entry
contains PA [47:29]

PA[47:16]

PA [15:0]

Figure 12-8 Virtual to Physical Address translation for a 64KB page

Each second-level table is associated with one or more first-level entries. You can have multiple
first-level descriptors that point to the same second-level table, which means you can alias
several virtual locations to the same Physical Address.

ARM DEN0024A
ID050815

12-10

The Memory Management Unit

Figure 12-8 on page 12-10 describes a situation where there are two levels of look-up. Again,
this assumes a 64KB granule and 42-bit Virtual Address space.

12.3.1

If VA[63:42] = 1 then TTBR1 is used for the base address for the first page table. When
VA[63:42] = 0, TTBR0 is used for the base address for the first page table.

The page table contains 8192 64-bit page table entries, and is indexed via VA[41:29]. The
MMU reads the pertinent level 2 page table entry from the table.

The MMU checks the level 2 page table entry for validity and whether or not the requested
memory access is allowed. Assuming it is valid, the memory access is allowed.

In Figure 12-8 on page 12-10, the level 2 page table entry refers to the address of the level
3 page table (it is a table descriptor).

Bits [47:16] are taken from the level 2 page table entry and form the base address of the
level 3 page table.

Bits [28:16] of the VA are used to index the level 3 page table entry. The MMU reads the
pertinent level 3 page table entry from the table.

The MMU checks the level 3 page table entry for validity and whether or not the requested
memory access is allowed. Assuming it is valid, the memory access is allowed.

In Figure 12-8 on page 12-10, the level 3 page table entry refers to a 64KB page (it is a
page descriptor).

Bits [47:16] are taken from the level 3 page table entry and used to form PA[47:16].

10.

Because we have a 64KB page, VA[15:0] is taken to form PA[15:0].

11.

The full PA[47:0] is returned, along with additional information from the page table
entries.

Secure and Non-secure addresses
In theory the Secure and Non-secure Physical Address spaces are independent of each other, and
exist in parallel. A system could be designed to have two entirely separate memory systems.
However, most real systems treat Secure and Non-secure as an attribute for access control. The
Normal (Non-secure) world can only access the Non-secure Physical Address space. The
Secure world can access both Physical Address spaces. Again this is controlled through
translation tables.

ARM DEN0024A
ID050815

12-11

The Memory Management Unit

Secure EL1/EL0

Secure physical address space

Secure
peripherals

RAM

Secure data

Translation tables

Peripherals

Non-secure data
FLASH
Secure code

Non-secure EL1/EL0

Non-secure physical address space

Non-secure
peripherals

RAM

Translation tables

Non-secure data

Peripherals

FLASH
Non-secure code

Figure 12-9 Physical Address spaces

This also has cache coherency implications. For example, because Secure 0x8000 and
Non-secure 0x8000 are, technically speaking, different Physical Addresses, they could both be
in the cache at the same time.
In a system where Secure and Non-secure memory are in different locations, there would be no
problem. It is more likely that they would be in the same location. Ideally a memory system
would block Secure accesses to Non-secure memory and Non-secure accesses to Secure
memory. In practice most only block Non-secure access to Secure memory. Again, this means
you could end up with the same physical memory in the cache twice, Secure and Non-secure.
This is always a programming error. To avoid this the Secure world must always use Non-secure
accesses to Non-secure memory.
12.3.2

Configuring and enabling the MMU
Writes to the system registers controlling the MMU are context-changing events and there are
no ordering requirements between them. The results of these events are not guaranteed to be
seen until a context synchronization event (See Barriers on page 13-6).
MSR TTBR0_EL1, X0
MSR TTBR1_EL1, X1
MSR TCR_EL1, X2
ISB
MRS X0, SCTLR_EL1
ORR X0, X0, #1
MSR SCTLR_EL1, X0
ISB

ARM DEN0024A
ID050815

//
//
//
//
//
//
//
//
//
//

Set TTBR0
Set TTBR1
Set TCR
The ISB forces these changes to be seen before /
the MMU is enabled.
Read System Control Register configuration data
Set [M] bit and enable the MMU.
Write System Control Register configuration data
The ISB forces these changes to be seen by the /
next instruction

12-12

The Memory Management Unit

This is aside from the requirement for flat mapping, which is to make sure we know which
instruction is executed directly after the write to SCTLR_EL1.M. If we see the result of the write
it is the instruction at VA+4 using the new translation regime. If we don’t see the result it is still
the instruction at VA+4 but where the VA = PA. The ISB doesn't help here as we cannot
guarantee it is the next instruction executed unless we flat map.
12.3.3

Operation when the Memory Management Unit is disabled
When the stage 1 MMU is disabled, for Non-secure EL0 and EL1 accesses when the
HCR_EL2.DC bit is set to enable the data cache, the default memory type is Normal
Non-shareable, Inner Write-Back Read-Write Allocate, Outer Write-Back Read-Write Allocate.

ARM DEN0024A
ID050815

12-13

The Memory Management Unit

12.4

Translation tables in ARMv8-A
The ARMv8-A architecture provides support for three different sets of translation table format:
•

ARMv8-A AArch64 Long Descriptor format.

•

ARMv7-A Long Descriptor format such as the Large Physical Address Extension (LPAE)
to the ARMv7-A architecture, found in, for example, the ARM Cortex-A15 processor.

•

ARMv7-A Short Descriptor format.

In AArch32 state, you can use the existing ARMv7-A long and short descriptor formats to run
existing guest operating systems and existing application code without modification. The
ARMv7-A short descriptors can only be used at EL0 and EL1 stage 1 translations. They cannot
therefore be used by hypervisors or Secure monitor code.
Always use the ARMv8-A long descriptor format in AArch64 execution state. This is very
similar to the ARMv7-A long descriptor format with large Physical Address extensions. It uses
the same 64-bit long-descriptor format, but with some changes. It introduces a new level 0 table
index, which uses the same descriptor format as the level 1 table. There is added support for up
to 48-bit input and output addresses. The input Virtual Address now comes from a 64-bit
register. However, as the architecture does not support full 64-bit addressing, bits 63:48 of the
address must all be the same, that is, all 0s or all 1s, or the top eight bits can be used for VA
tagging.
AArch64 supports three different translation granules. These define the block size at the lowest
level of translation table and control the size of translation tables in use. Larger granule sizes
reduce the number of levels of page table required and this can become an important
consideration in systems using a hypervisor to provide virtualization.
The supported granule sizes are 4KB, 16KB, and 64KB, and it is IMPLEMENTATION DEFINED
which of the three are supported. Code that creates page tables is able to read the system register
ID_AA64MMFR0_EL1, to find out which are the supported sizes. The Cortex-A53 processor
supports all three sizes, but this is not the case for early versions of some processors, such as the
Cortex-A57, which did not support the 16K granule size. The size is configurable for each
translation table within the Translation Control Register (TCR_EL1).
12.4.1

AArch64 descriptor format
You can use the descriptor format in all levels of table, from level 0 to level 3. Level 0
descriptors can only output the address of a level 1 table. Level 3 descriptors cannot point to
another table and can only output block addresses. The format of the table is therefore slightly
different for level 3.
Figure 12-10 on page 12-15 shows that the table descriptor type is identified by bits 1:0 of the
entry and can refer to either:

ARM DEN0024A
ID050815

•

The address of a next level table, in which case memory can be further subdivided into
smaller blocks.

•

The address of a variable sized block of memory.

•

Table entries, which can be marked Fault, or Invalid.

12-14

The Memory Management Unit

63
Table descriptor (levels 0, 1, and 2)

Attributes

0
11

Next level table address

Block entry (levels 1 and 2)

Upper attributes

Output block address

Lower attributes 01

Table entry (levels 1 and 2)

Upper attributes

Output block address

Lower attributes 11
X0

Ignored

Invalid entry (all levels)

Figure 12-10 A64 Table descriptor type

Note
For purposes of clarity, this diagram does not specify the width of bit fields.

12.4.2

Effect of granule sizes on translation tables
The three different granule sizes can affect the number and size of translation tables required.
Note
In all cases, you can omit the first level of table if the VA input range is restricted to 42 bits.
Depending on the size of the possible VA range, there can be even fewer levels. With a 4KB
granule, for example, if the TTBCR is set so that low addresses span only 1GB, then levels 0
and 1 are not required and the translation starts at level 2, going down to level 3 for 4KB pages.
4KB

When you use a 4kB granule size, the hardware can use a 4-level look up process.
The 48-bit address has nine address bits per level translated, that is 512 entries
each, with the final 12 bits selecting a byte within the 4kB coming directly from
the original address.
Bits 47:39 of the Virtual Address index into the 512 entry L0 table. Each of these
table entries spans a 512 GB range and points to an L1 table. Within that 512 entry
L1 table, bits 38:30 are used as index to select an entry and each entry points to
either a 1GB block or an L2 table. Bits 29:21 index into a 512 entry L2 table and
each entry points to a 2MB block or next table level. At the last level, bits 20:12
index into a 512 entry L2 table and each entry points to a 4kB block.

VA bits [47:39]
Level 0 Table Index
Each entry contains:
Pointer to L1 table
(No block entry)

VA bits [38:30]

VA bits [29:21]

Level 1 Table Index
Each entry contains:

Level 2 Table Index
Each entry contains:

Pointer to L2 table
Base address of 1GB
block (IPA)

Pointer to L3 table
Base address of 2MB
block (IPA)

VA bits [20:12]
Level 3 Table Index
Each entry contains:
Base address off 4KB
block (IPA)

VA bits [11:0]

Block offset
and PA [11:0]

Figure 12-11 4KB Granule

16KB

ARM DEN0024A
ID050815

When you use a 16kB granule size, the hardware can use a 4-level look up
process. The 48-bit address has 11 address bits per level translated, that is 2048
entries each, with the final 14 bits selecting a byte within the 4kB coming directly
from the original address. The level 0 table contains only two entries. Bit 47 of
the Virtual Address selects a descriptor from the two entry L0 table. Each of these
table entries spans a 128 TB range and points to an L1 table. Within that 2048
entry L1 table, bits 46:36 are used as an index to select an entry and each entry

12-15

The Memory Management Unit

points to an L2 table. Bits 35:25 index into a 2048 entry L2 table and each entry
points to a 32 MB block or next table level. At the final translation stage, bits
24:14 index into a 2048 entry L2 table and each entry points to a 16kB block.
VA bit [47]
Level 0 Table Index
Each entry contains:
Pointer to L1 table
(No block entry)

VA bits [46:36]

Level 1 Table Index
Each entry contains:
Pointer to L2 table

VA bits [35:25]

VA bits [24:14]

Level 2 Table Index
Each entry contains:

Level 3 Table Index
Each entry contains:

Pointer to L3 table
Base address of 32MB
block (IPA)

Base address off
16KB block (IPA)

VA bits [13:0]

Block offset
and PA [13:0]

Figure 12-12 16KB Granule

64KB

When you use a 64kB granule size, the hardware can use a 3-level look up
process. The level 1 table contains only 64 entries.
Bits 47:42 of the Virtual Address select a descriptor from the 64 entry L1 table.
Each of these table entries spans a 4TB range and points to an L2 table. Within
that 8192 entry L2 table, bits 41:29 are used as index to select an entry and each
entry points to either a 512 MB block or an L2 table. At the final translation stage,
bits 28:16 index into an 8192 entry L3 table and each entry points to a 64kB
block.
VA bit [47:42]
Level 1 Table Index
Each entry contains:
Pointer to L2 table
(No block entry)

VA bits [41:29]

VA bits [28:16]

Level 2 Table Index
Each entry contains:
Pointer to L2 table
Base address of
512MB block (IPA)

Level 3 Table Index
Each entry contains:
Base address of 64KB
block (IPA)

VA bits [15:0]

Block offset
and PA [15:0]

Figure 12-13 64KB Granule

12.4.3

Cache configuration
The MMU uses translation tables and translation registers to control which memory locations
are cacheable. The MMU controls the cache policy, memory attributes, and access permissions,
and provides Virtual to Physical Address translation.

ARM DEN0024A
ID050815

12-16

The Memory Management Unit

Coherency groups
MMU

MMU

Core

L1 I-cache

L1 D-cache

L1 I-cache

L1 D-cache

L2 cache
Bus interface unit

AMBA interconnect

L3 cache
(SRAM or DRAM)

SRAM

AMBA interconnect

External L4 cache
(memory card or
disk drive)

External
DRAM

APB
peripherals

Figure 12-14 Memory busses and caches

Software configuration is performed by system registers (some of which are listed in Chapter 4
ARMv8 Registers.)
In some designs, the external memory system might contain further implementation-specific
caches of external memories.
12.4.4

Cache policies
The MMU translation tables also define the cache policy for each block within the memory
system. Memory regions that are defined as Normal might be marked as cacheable or
non-cacheable. Bits [4:2] from the translation table entry refer to one of the eight memory
attribute encodings in the Memory Attribute Indirection Register (MAIR). The memory attribute
encodings then specify the cache policies to use when accessing that memory. These are hints
to the processor and it is IMPLEMENTATION DEFINED whether all cache policies are supported in
a particular implementation and which cache data is regarded as coherent. A memory region can
be defined in terms of its shareability property.

ARM DEN0024A
ID050815

12-17

The Memory Management Unit

12.5

Translation table configuration
In addition to storing individual translations within the TLB, you can configure the MMU to
store translation tables in cacheable memory. This usually provides much faster access to tables
than always reading from external memory. TCR_EL1 has additional fields that control this.
The additional fields specify the cacheability and shareability of translation tables for TTBR0
and TTBR1. The relevant fields are called SH0/1 Shareability, IRGN0/1 Inner Cacheable, and
ORGN0/1 Outer Cacheable. Table 12-2 shows the permitted settings for cacheability.
Table 12-2 Cacheability settings
IRGN/ORGN bits for
TTBR0/TTBR1

Cacheable Property

Normal memory, Inner Non-cacheable

Normal memory, Inner Write-Back Write-Allocate Cacheable

Normal memory, Inner Write-Through Cacheable

Normal memory, Inner Write-Back no Write-Allocate
Cacheable

The corresponding table for shareability of memory is associated with translation table walks.
For a device or strongly-ordered memory region, the value is ignored.
Table 12-3 Memory shareability
SH0 bits[13:12]

Shareability

Non-shareable

UNPREDICTABLE

Outer shareable

Inner shareable

The attributes specified in the TCR_EL1 must be the same as those specified for the virtual
memory region in which the translation tables are stored. Caching the translation tables is the
normal default behavior.
12.5.1

Virtual Address tagging
The Translation Control Register, TCR_ELn has an additional field called Top Byte Ignore
(TBI) that provides tagged addressing support. general-purpose registers are 64 bits wide, but
the most significant 16 bits of an address must be all 0xFFFF or 0x0000. Any attempt to use a
different bit value triggers a fault.
When tagged addressing support is enabled, the top eight bits, that is [63:56] of the Virtual
Address are ignored by the processor. It internally sets bit [55] to sign extend address to 64-bit
format. The top eight bits of a Virtual Address can then be used to pass data. These bits are
ignored for addressing and translation faults. The TCR_EL1 has separate enable bits for EL0
and EL1. ARM does not specify or mandate a specific use case for tagged addressing.
An example use case might be in support of object-oriented programming languages. As well
as having a pointer to an object, it might be necessary to keep a reference count that keeps track
of the number of references or pointers or handles that refer to the object, for example, so that

ARM DEN0024A
ID050815

12-18

The Memory Management Unit

automatic garbage collection code can de-allocate objects that are no longer referenced. This
reference count can be stored as part of the tagged address, rather than in a separate table,
speeding up the process of creating or destroying objects.

ARM DEN0024A
ID050815

12-19

The Memory Management Unit

12.6

Translations at EL2 and EL3
The virtualization extensions to the ARMv8-A architecture introduce a second stage of
translation. When a hypervisor is present in the system, one or more guest operating systems
might be present. These continue to use TTBRn_EL1 as previously described and MMU
operation appears unchanged.
The hypervisor must perform some extra translation steps in a two stage process to share the
physical memory system between the different guest operating systems. In the first stage, a
Virtual Address (VA) is translated to an Intermediate Physical Address (IPA). This is usually
under OS control. A second stage, controlled by the hypervisor, then performs translation of the
IPA to the final Physical Address (PA).
The hypervisor and Secure monitor also have their set of stage 1 translation tables for their own
code and data, which perform mapping directly from VA to PA.
Note
The Architecture Reference Manual uses the term Translation Regimes to refer to these different
tables.
Figure 12-15 summarizes this two stage translation process.

Peripherals

OS (EL1)
Guest OS
Translation tables
TTBRn_EL1

Application (EL0)

Translation tables
VTTBR0_EL2

Flash
RAM

Virtual memory map
Under control of guest OS

RAM

Physical memory map
seen by guest (IPA)

Flash

Monitor
Translation tables
TTBR0_EL3

Secure Monitor (EL3)

Peripherals
RAM

Hypervisor
Translation tables
TTBR0_EL2

Hypervisor (EL2)

Peripherals

Real physical memory map

Virtual memory space seen
By Hypervisor and Secure monitor

Figure 12-15 Two stage translation process

The stage 2 translations, which convert an intermediate physical address to a Physical Address,
use an extra set of tables under control of the hypervisor. These must be explicitly enabled by
writing to the Hypervisor Configuration Register HCR_EL2. This process only applies to
Non-secure EL1/0 accesses.
The base address of this stage 2 translation table is specified in the Virtualization Translation
Table Base Register VTTBR0_EL2. It specifies a single contiguous address space at the bottom
of memory. The size of the supported address space is specified in the TSZ[5:0] field of the
Virtualization Translation Control Register, VTCR_EL2.
The TG field of this register specifies the granule size while the SL0 field controls the first level
of table lookup. Any access outside the defined address range causes a translation fault.

ARM DEN0024A
ID050815

12-20

The Memory Management Unit

FAULT

0x0000FFFF_FFFFFFFF
VTTBR0

0x00000000_00000000

Figure 12-16 Maximum IPA space

The hypervisor EL2 and Secure monitor EL3 have their own level 1 tables, which map directly
from virtual to Physical Address space. The table base address is specified in TTBR0_EL2 and
TTBR0_EL3 respectively, enabling a single contiguous address space of variable size at the
bottom of memory. The TG field specifies the granule size and the SL0 field controls the first
level of table lookup. Any access outside the defined address range causes a translation fault.

FAULT

0x0000FFFF_FFFFFFFF
TTBR0
EL2/3

Hypervisor or Secure monitor
0x00000000_00000000

Figure 12-17 Maximum Virtual Address space

ARM DEN0024A
ID050815

12-21

The Memory Management Unit

The Secure monitor EL3 has its own dedicated translation tables. The table base address is
specified in TTBR0_EL3 and configured via TCR_EL3. Translation tables are capable of
accessing both Secure and Non-secure Physical Addresses. TTBR0_EL3 is used only in Secure
monitor EL3 mode, not by the trusted kernel itself. When the transition to Secure world has
completed, the trusted kernel uses the EL1 translations, that is, the translation tables pointed to
by TTBR0_EL1 and TTBR1_EL1. As these registers are not banked in AArch64, Secure
monitor code must configure new tables for the Secure world and save and restore copies of
TTBR0_EL1 and TTBR1_EL1.
The EL1 translation regime behaves differently in Secure state, compared to its normal
operation in Non-secure state. The second stage of translation is disabled and the EL1
translation regime is now able to point to both Secure or Non-secure Physical Addresses. There
is no virtualization in the Secure world so that the IPA is always the same as the final PA.
Entries in the TLB are tagged as Secure or Non-secure, so that no TLB maintenance is ever
required when you transition between Secure and Normal worlds.

ARM DEN0024A
ID050815

12-22

The Memory Management Unit

12.7

Access permissions
Access permissions are controlled through translation table entries. Access permissions control
whether a region is readable or writeable, or both, and can be set separately to EL0 for
unprivileged and to EL1, EL2, and EL3 for privileged accesses, as shown in Table 12-4.
Table 12-4 Access permissions
AP

Unprivileged (EL0)

Privileged (EL1/2/3)

No access

Read and write

No access

Read-only

The operating system kernel runs in execution level EL1. It defines the translation table
mappings, which are used by the kernel itself and by the applications that run at EL0. Distinction
between unprivileged and privileged access permissions is required as the kernel specifies
different permissions for its own code and for applications. The hypervisor, which runs at
execution level EL2, and Secure monitor EL3 only have translation schemes for their own use
and therefore there is no need for a privileged and unprivileged split in permissions.

Executable

Peripherals
OS

Not executable

Executable

Another kind of access permission is the executable attribute. Blocks can be marked as
executable or non-executable (Execute Never (XN)). You can set the attributes Unprivileged
Execute Never (UXN) and Privileged Execute Never (PXN) separately and use this to prevent,
for example, application code running with kernel privilege, or attempts to execute kernel code
while in an unprivileged state. Setting these attributes prevents the processor from performing
speculative instruction fetches to the memory location and ensures that speculative instruction
fetches do not accidentally access locations that might be perturbed by such an access, for
example, a First in, First out (FIFO) page replacement queue. Therefore, device regions must
always be marked as Execute Never.

Application data
Application data

Figure 12-18 Device regions

You can configure the processor to treat writeable regions as Execute Never, using the following
bits within the SCTLR registers:

ARM DEN0024A
ID050815

•

SCTLR_EL1.WXN. Regions writable at EL0 are treated as XN at EL0 and EL1. Regions
writable at EL1 are treated as XN at EL1.

•

SCTLR_EL2 and 3.WXN. Regions writable at ELn are treated as XN at ELn.

12-23

The Memory Management Unit

•

SCTLR.UWXN. Regions writable at EL0 are treated as XN at EL1. This is for AArch32
only.

The SCTLR_ELn bits can be cached in a TLB entry. Therefore, changing the bit in the SCTLR
might not affect entries already in the TLBs. When modifying these bits, a TLB invalidate and
ISB sequence is necessary. See Barriers on page 13-6 for information about the ISB barrier.

ARM DEN0024A
ID050815

12-24

The Memory Management Unit

12.8

Operating system use of translation table descriptors
Another memory attribute bit in the descriptor, the Access Flag (AF), indicates when a block
entry is used for the first time.
•

AF = 0: This block entry has not yet been used.

•

AF = 1: This block entry has been used.

Operating systems use an access flag bit to keep track of which pages are being used. Software
manages the flag. When the page is first created, its entry has AF set to 0. The first time the page
is accessed by code, if it has AF at 0, this triggers an MMU fault. The Page fault handler records
that this page is now being used and manually sets the AF bit in the table entry. For example,
the Linux kernel uses the [AF] bit for PTE_AF on ARM64 (the Linux kernel name for
AArch64), which is used to check whether a page has ever been accessed. This influences some
of the kernel memory management choices. For example, when a page must be swapped out of
memory, it is less likely to swap out pages that are being actively used.
Bits [58:55] of the descriptor are marked as Reserved for Software Use and can be used to record
OS-specific information in the translation tables. For example, the Linux kernel uses one of
these bits to mark an entry as clean or dirty. The dirty status records whether the page has been
written to. If the page is later swapped out of memory, a clean page can simply be discarded, but
a dirty page must have its contents saved first.
Block entry
Upper attributes

Output block address

Lower attributes
0

Reserved for software
use
[58:55]
Table entry
Table attributes

Next level table address

Reserved for software
use
[58:55]

Figure 12-19 Translation table descriptors

See Chapter 13 Memory Ordering for information about other memory attributes that specify
the memory type and its cacheability and shareability properties.

ARM DEN0024A
ID050815

12-25

The Memory Management Unit

12.9

Security and the MMU
The ARMv8-A architecture defines two security states, Secure and Non-secure. It also defines
two Physical Address spaces: Secure and Non-secure, such that the Normal world can only
access the Non-secure Physical Address space. The Secure world can access both the Secure and
Non-secure Physical Address spaces.
In Non-secure state, the NS bits and NSTable bits in translation tables are ignored. Only
Non-secure memory can be accessed. In Secure state, the NS bits and NSTable bits control
whether a Virtual Address translates to a Secure or Non-secure Physical Address. You can use
SCR_EL3.CIF to prevent the Secure world from executing from any Virtual Address that
translates to a Non-secure Physical Address. Additionally, when in the Secure world, you can
use the SCR.CIF bit to control whether Secure instruction fetches can be made to Non-secure
physical memory.

ARM DEN0024A
ID050815

12-26

The Memory Management Unit

12.10 Context switching
Processors that implement the ARMv8-A Architecture are typically used in systems running a
complex operating system with many applications or tasks that run concurrently. Each process
has its own unique translation tables residing in physical memory. When an application starts,
the operating system allocates it a set of translation table entries that map both the code and data
used by the application to physical memory. These tables can subsequently be modified by the
kernel, for example, to map in extra space, and are removed when the application is no longer
running.
There might therefore be multiple tasks present in the memory system. The kernel scheduler
periodically transfers execution from one task to another. This is called a context switch and
requires the kernel to save all execution state associated with the process and to restore the state
of the process to be run next. The kernel also switches translation table entries to those of the
next process to be run. The memory of the tasks that are not currently running is completely
protected from the task that is running.
Exactly what has to be saved and restored varies between different operating systems, but
typically a process context switch includes saving or restoring some or all of the following
elements:
•

general-purpose registers X0-X30.

•

Advanced SIMD and Floating-point registers V0 - V31.

•

Some status registers.

•

TTBR0_EL1 and TTBR0.

•

Thread Process ID (TPIDxxx) Registers.

•

Address Space ID (ASID).

For EL0 and EL1, there are two translation tables. TTBR0_EL1 provides translations for the
bottom of Virtual Address space, which is typically application space and TTBR1_EL1 covers
the top of Virtual Address space, typically kernel space. This split means that the OS mappings
do not have to be replicated in the translation tables of each task.
Translation table entries contain a non-global (nG) bit. If the nG bit is set for a particular page,
it is associated with a specific task or application. If the bit is marked as 0, then the entry is
global and applies to all tasks.
For non-global entries, when the TLB is updated and the entry is marked as non-global, a value
is stored in the TLB entry in addition to the normal translation information. This value is called
the Address Space ID (ASID), which is a number assigned by the OS to each individual task.
Subsequent TLB look-ups only match on that entry if the current ASID matches with the ASID
that is stored in the entry. This permits multiple valid TLB entries to be present for a particular
page marked as non-global, but with different ASID values. In other words, we do not
necessarily need to flush the TLBs when we context switch.
In AArch64, this ASID value can be specified as either an 8-bit or 16-bit value, controlled by
the TCR_EL1.AS bit. The current ASID value is specified in either TTBR0_EL1 or
TTBR1_EL1. TCR_EL1 controls which TTBR holds the ASID, but it is normally TTBR0_EL1,
as this corresponds to application space.

ARM DEN0024A
ID050815

12-27

The Memory Management Unit

Note
Having the current value of the ASID stored in the translation table register means that you can
atomically modify both the translation tables as well as the ASID in a single instruction. This
simplifies the process of changing the table and ASID when compared with the ARMv7-A
Architecture.
Additionally, the ARMv8-A Architecture provides Thread ID registers for use by operating
system software. These have no hardware significance and are typically used by threading
libraries as a base pointer to per-thread data. This is often referred to as Thread Local Storage
(TLS). For example, the pthreads library uses this feature and includes the following registers:

ARM DEN0024A
ID050815

•

User Read and Write Thread ID Register (TPIDR_EL0).

•

User Read-Only Thread ID Register (TPIDRRO_EL0).

•

Thread ID Register, privileged accesses only (TPIDR_EL1).

12-28

The Memory Management Unit

12.11 Kernel access with user permissions
There are instructions that allow code executing at EL1 (for example, an OS) to perform
memory accesses with EL0 or application permissions. This can be used, for example, to
de-reference pointers provided with system calls and to enable the OS to check that only data
accessible to the application is accessed. This can be achieved using the LDTR or STTR
instructions. When executed at EL1, these instructions perform the load or store as if executed
at EL0. At all other Exception levels, LDTR and STTR behave like regular LDR or STR instructions.
There are the usual size and signed and unsigned variants as normal load and store instructions,
but with a smaller offset and restricted indexing options.

ARM DEN0024A
ID050815

12-29

Chapter 13
Memory Ordering

If your code interacts directly either with the hardware or with code executing on other cores,
or if it directly loads or writes instructions to be executed, or modifies page tables, you need to
be aware of memory ordering issues.
If you are an application developer, hardware interaction is probably through a device driver,
the interaction with other cores is through Pthreads or another multithreading API, and the
interaction with a paged memory system is through the operating system. In all of these cases,
the memory ordering issues are taken care of for you by the relevant code. However, if you are
writing the operating system kernel or device drivers, or implementing a hypervisor, JIT
compiler, or multithreading library, you must have a good understanding of the memory
ordering rules of the ARM Architecture. You must ensure that where your code requires explicit
ordering of memory accesses, you are able to achieve this through the correct use of barriers.
The ARMv8 architecture employs a weakly-ordered model of memory. In general terms, this
means that the order of memory accesses is not required to be the same as the program order for
load and store operations. The processor is able to re-order memory read operations with respect
to each other. Writes may also be re-ordered (for example, write combining) .As a result,
hardware optimizations, such as the use of cache and write buffer, function in a way that
improves the performance of the processor, which means that the required bandwidth between
the processor and external memory can be reduced and the long latencies associated with such
external memory accesses are hidden.
Reads and writes to Normal memory can be re-ordered by hardware, being subject only to data
dependencies and explicit memory barrier instructions. Certain situations require stronger
ordering rules. You can provide information to the core about this through the memory type
attribute of the translation table entry that describes that memory.

ARM DEN0024A
ID050815

13-1

Memory Ordering

Very high performance systems might support techniques such as speculative memory reads,
multiple issuing of instructions, or out-of-order execution and these, along with other
techniques, offer further possibilities for hardware re-ordering of memory access:
Multiple issue of instructions
A processor might issue and execute multiple instructions per cycle, so that
instructions that are after each other in program order can be executed at the same
time.
Out-of-order execution
Many processors support out-of-order execution of non-dependent instructions.
Whenever an instruction is stalled while it waits for the result of a preceding
instruction, the processor can execute subsequent instructions that do not have a
dependency.
Speculation When the processor encounters a conditional instruction, such as a branch, it can
speculatively begin to execute instructions before it knows for sure whether that
particular instruction must be executed or not. The result is, therefore, available
sooner if conditions resolve to show the speculation was correct.
Speculative loads
If a load instruction that reads from a cacheable location is speculatively
executed, this can result in a cache linefill and the potential eviction of an existing
cache line.
Load and store optimizations
As reads and writes to external memory can have a long latency, processors can
reduce the number of transfers by, for example, merging together a number of
stores into one larger transaction.
External memory systems
In many complex System on Chip (SoC) devices, there are a number of agents
capable of initiating transfers and multiple routes to the slave devices that are read
or written. Some of these devices, such as a DRAM controller, might be capable
of accepting simultaneous requests from different masters. Transactions can be
buffered, or re-ordered by the interconnect. This means that accesses from
different masters might therefore take varying numbers of cycles to complete and
might overtake each other.
Cache coherent multi-core processing
In a multi-core processor, hardware cache coherency can migrate cache lines
between cores. Different cores might therefore see updates to cached memory
locations in a different order to each other.
Optimizing compilers
An optimizing compiler can re-order instructions to hide latencies or make best
use of hardware features. It can often move a memory access forwards, to make
it earlier, and give it more time to complete before the value is required.
In a single core system, the effects of such re-ordering are generally transparent to the
programmer, as the individual processor can check for hazards and ensure that data
dependencies are respected. However, in cases where you have multiple cores that communicate
through shared memory, or share data in other ways, memory ordering considerations become
more important. This chapter discusses several topics that relate to Multiprocessing (MP)
operation and synchronization of multiple execution threads. It also discusses memory types
and rules defined by the architecture and how these are controlled.
ARM DEN0024A
ID050815

13-2

Memory Ordering

13.1

Memory types
The ARMv8 architecture defines two mutually-exclusive memory types. All regions of memory
are configured as one or the other of these two types, which are Normal and Device. A third
memory type, Strongly Ordered, is part of the ARMv7 architecture. The differences between
this type and Device memory are few and it is therefore now omitted in ARMv8. (See Device
memory on page 13-4.)
In addition to the memory type, attributes also provide control over cacheability, shareability,
access, and execution permissions. Shareable and cache properties pertain only to Normal
memory. Device regions are always deemed to be non-cacheable and outer-shareable. For
cacheable locations, you can use attributes to indicate cache allocation policy to the processor.
The memory type is not directly encoded in the translation table entry. Instead, each block entry
specifies a 3-bit index into a table of memory types. This table is stored in the Memory Attribute
Indirection Register MAIR_ELn. This table has eight entries and each of those entries has eight
bits, as shown in Figure 13-1.
Although the translation table block entry itself does not directly contain the memory type
encoding, the TLB entry inside the processor usually stores this information for a specific entry.
Therefore, changes to MAIR_ELn might not be observed until after both an ISB instruction
barrier and a TLB invalidate operation.
MAIR_ELn
7

0
0

Type encoding

Figure 13-1 Type encoding

13.1.1

Normal memory
You can use Normal memory for all code and for most data regions in memory. Examples of
Normal memory include areas of RAM, Flash, or ROM in physical memory. This kind of
memory provides the highest processor performance as it is weakly ordered and has fewer
restrictions placed on the processor. The processor can re-order, repeat, and merge accesses to
Normal memory.
Furthermore, address locations that are marked as Normal can be accessed speculatively by the
processor, so that data or instructions can be read from memory without being explicitly
referenced in the program, or in advance of the actual execution of an explicit reference. Such
speculative accesses can occur as a result of branch prediction, speculative cache linefills,
out-of-order data loads, or other hardware optimizations.
For best performance, always mark application code and data as Normal and in circumstances
where an enforced memory ordering is required, you can achieve it through the use of explicit
barrier operations. Normal memory implements a weakly-ordered memory mode. There is no
requirement for Normal accesses to complete in order with respect to either other Normal
accesses or to Device accesses.
However, the processor must always handle hazards caused by address dependencies.
For example, consider the following simple code sequence:
STR X0, [X2]
LDR X1, [X2]

ARM DEN0024A
ID050815

13-3

Memory Ordering

The processor always ensures that the value placed in X1 is the value that was written to the
address stored in X2.
This of course applies to more complex dependencies.
Consider the following code:
ADD X4, X3, #3
ADD X5, X3, #2
STR X0, [X3]
STRB W1, [X4]
LDRH W2, [X5]

In this case, the accesses take place to addresses that overlap each other. The processor must
ensure that the memory is updated as if the STR and STRB occurred in order, so that the LDRH
returns the most up-to-date value. It would still be valid for the processor to merge the STR and
STRB into a single access that contained the latest, correct data to be written.
13.1.2

Device memory
You can use Device memory for all memory regions where an access might have a side-effect.
For example, a read to a FIFO location or timer is not repeatable, as it returns different values
for each read. A write to a control register might trigger an interrupt. It is typically only used for
peripherals in the system. The Device memory type imposes more restrictions on the core.
Speculative data accesses cannot be performed to regions of memory marked as Device. There
is a single, uncommon exception to this. If NEON operations are used to read bytes from Device
memory, the processor might read bytes not explicitly referenced if they are within an aligned
16-byte block that contains one or more bytes that are explicitly referenced.
Trying to execute code from a region marked as Device, is generally UNPREDICTABLE. The
implementation might either handle the instruction fetch as if it were to a memory location with
the Normal non-cacheable attribute, or it might take a permission fault.
There are four different types of device memory, to which different rules apply.
•

Device-nGnRnE most restrictive (equivalent to Strongly Ordered memory in the ARMv7
architecture).

•

Device-nGnRE

•

Device-nGRE

•

Device-GRE least restrictive

The letter suffixes refer to the following three properties:
Gathering or non Gathering (G or nG)
This property determines whether multiple accesses can be merged into a single
bus transaction for this memory region. If the address is marked as non Gathering
(nG), then the number and size of accesses on the memory bus performed to that
location must exactly match the number and size of explicit accesses in the code.
If the address is marked as Gathering (G), then the processor can, for example,
merge two byte writes into a single half-word write.
For a region marked as Gathering, multiple memory accesses to the same
memory location can also be merged. For example, if the program reads the same
location twice, the core only needs to perform the read once and can return the

ARM DEN0024A
ID050815

13-4

Memory Ordering

same result for both instructions. For reads from regions marked as non
Gathering, the data value must come from the end device. It cannot be snooped
from a write-buffer or other location.
Re-ordering (R or nR)
This determines whether accesses to the same device can be re-ordered with
respect to each other. If the address is marked as non Re-ordering (nR), then
accesses within the same block always appear on the bus in program order. The
size of this block is IMPLEMENTATION DEFINED. Where the size of this block is
large, it could span several table entries. In this case, the ordering rule is observed
with respect to any other accesses also marked as nR.
Early Write Acknowledgement (E or nE)
This determines whether an intermediate write buffer between the processor and
the slave device being accessed is allowed to send an acknowledgement of a write
completion. If the address is marked as non Early Write Acknowledgement (nE),
then the write response must come from the peripheral. If the address is marked
as Early Write Acknowledgement (E), then it is permissible for a buffer in the
interconnect logic to signal write acceptance, in advance of the write actually
being received by the end device. This is essentially a message to the external
memory system.

ARM DEN0024A
ID050815

13-5

Memory Ordering

13.2

Barriers
The ARM architecture includes barrier instructions to force access ordering and access
completion at a specific point. In some architectures, similar instructions are known as a fence.
If you are writing code where ordering is important, see Appendix J7 Barrier Litmus Tests in the
ARM Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile and Appendix
G Barrier Litmus Tests in the ARM Architecture Reference Manual ARMv7-A/R Edition, which
includes many worked examples.
The ARM Architecture Reference Manual defines certain key words, in particular, the terms
observe and must be observed. In typical systems, this defines how the bus interface of a master,
for example, a core or GPU and the interconnect, must handle bus transactions. Only masters
are able to observe transfers. All bus transactions are initiated by a master. The order that a
master performs transactions in is not necessarily the same order that such transactions complete
at the slave device, because transactions might be re-ordered by the interconnect unless some
ordering is explicitly enforced.
A simple way to describe observability is to say that “I have observed your write when I can
read what you wrote and I have observed your read when I can no longer change the value you
read” where both I and you refer to cores or other masters in the system.
There are three types of barrier instruction provided by the architecture:
Instruction Synchronization Barrier (ISB)
This is used to guarantee that any subsequent instructions are fetched, again, so
that privilege and access are checked with the current MMU configuration. It is
used to ensure any previously executed context-changing operations, such as
writes to system control registers, have completed by the time the ISB completes.
In hardware terms, this might mean that the instruction pipeline is flushed, for
example. Typical uses of this would be in memory management, cache control,
and context switching code, or where code is being moved about in memory.
Data Memory Barrier (DMB)
This prevents re-ordering of data accesses instructions across the barrier
instruction. All data accesses, that is, loads or stores, but not instruction fetches,
performed by this processor before the DMB, are visible to all other masters
within the specified shareability domain before any of the data accesses after the
DMB.
For example:
LDR x0, [x1] // Must be seen by the memory system before the STR below.
DMB ISHLD
ADD x2, #1
// May be executed before or after the memory system sees
LDR.
STR x3, [x4] // Must be seen by the memory system after the LDR above.

It also ensures that any explicit preceding data or unified cache maintenance
operations have completed before any subsequent data accesses are executed.
DC CSW, x5
// Data clean by Set/way
LDR x0, [x1] // Effect of data cache clean might not be seen by this
// instruction
DMB ISH
LDR x2, [x3] // Effect of data cache clean will be seen by this
instruction

ARM DEN0024A
ID050815

13-6

Memory Ordering

Data Synchronization Barrier (DSB)
This enforces the same ordering as the Data Memory Barrier, but has the
additional effect of blocking execution of any further instructions, not just loads
or stores, or both, until synchronization is complete. This can be used to prevent
execution of a SEV instruction, for instance, that would signal to other cores that
an event occurred. It waits until all cache, TLB and branch predictor maintenance
operations issued by this processor have completed for the specified shareability
domain.
For example:
DC ISW, x5
// operation must have completed before DSB can complete
STR x0, [x1]
// Access must have completed before DSB can complete
DSB ISH
ADD x2, x2, #3 // Cannot be executed until DSB completes

As you can see from the above examples, the DMB and DSB instructions take a parameter which
specifies the types of access to which the barrier operates, before or after, and a shareability
domain to which it applies.
The available options are listed in the table.
Table 13-1 Barrier parameters

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : Yes
Create Date                     : 2015:05:08 08:47:18Z
Copyright                       : Copyright ©€2015 ARM. All rights reserved.
Author                          : ARM Limited
Creator                         : FrameMaker 8.0
Keywords                        : Cortex-A, Cortex-A50, Cortex-A53, Cortex-A57, ARMv8
Title                           : ARM Cortex-A Series Programmer’s Guide for ARMv8-A
Modify Date                     : 2017:12:07 07:56:44-05:00
Producer                        : 3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)
Page Count                      : 296
Page Mode                       : UseOutlines

EXIF Metadata provided by EXIF.tools

ARM Cortex A Series Programmer’s Guide For ARMv8 Programmer's V1.0 Min

Navigation menu

Versions of this User Manual:

Views

Navigation