ARM Cortex A Series Programmer’s Guide For ARMv8 Programmer's V1.0 Min

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 296

DownloadARM Cortex-A Series Programmer’s Guide For ARMv8-A Programmer's V1.0-min
Open PDF In BrowserView PDF
ARM Cortex -A Series
®

®

Version: 1.0

Programmer’s Guide for ARMv8-A

Copyright © 2015 ARM. All rights reserved.
ARM DEN0024A (ID050815)

ARM Cortex-A Series
Programmer’s Guide for ARMv8-A
Copyright © 2015 ARM. All rights reserved.
Release Information
The following changes have been made to this book.
Change history
Date

Issue

Confidentiality

Change

24 March 2015

A

Non-Confidential

First release

Proprietary Notice
This document is protected by copyright and other related rights and the practice or implementation of the information
contained in this document may be protected by one or more patents or pending patent applications. No part of this
document may be reproduced in any form by any means without the express prior written permission of ARM. No
license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document
unless specifically stated.
Your access to the information in this document is conditional upon your acceptance that you will not use or permit
others to use the information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS
FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes
no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of,
third party patents, copyrights, trade secrets, or other rights.
This document may include technical inaccuracies or typographical errors.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or
disclosure of this document complies fully with any relevant export laws and regulations to assure that this document
or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word “partner”
in reference to ARM’s customers is not intended to create or refer to any partnership relationship with any other
company. ARM may make changes to this document at any time and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any signed written agreement
covering this document with ARM, then the signed written agreement prevails over and supersedes the conflicting
provisions of these terms. This document may be translated into other languages for convenience, and you agree that if
there is any conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.
Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in the
EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks
of their respective owners. Please follow ARM’s trademark usage guidelines at
http://www.arm.com/about/trademark-usage-guidelines.php
Copyright © 2015, ARM Limited or its affiliates. All rights reserved.
ARM Limited. Company 02557590 registered in England.
110 Fulbourn Road, Cambridge, England CB1 9NJ.
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license
restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this
document to.
Product Status
The information in this document is final, that is for a developed product.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

ii

Web Address
http://www.arm.com

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

iii

Contents
ARM Cortex-A Series Programmer’s Guide for
ARMv8-A

Preface
Glossary ...................................................................................................................... ix
References ............................................................................................................... xiii
Feedback on this book ............................................................................................... xv

Chapter 1

Introduction
1.1

Chapter 2

ARMv8-A Architecture and Processors
2.1
2.2

Chapter 3

Execution states ...................................................................................................... 3-4
Changing Exception levels ...................................................................................... 3-5
Changing execution state ........................................................................................ 3-8

ARMv8 Registers
4.1
4.2
4.3
4.4
4.5
4.6

ARM DEN0024A
ID050815

ARMv8-A ................................................................................................................. 2-3
ARMv8-A Processor properties ............................................................................... 2-5

Fundamentals of ARMv8
3.1
3.2
3.3

Chapter 4

How to use this book ............................................................................................... 1-3

AArch64 special registers ........................................................................................ 4-3
Processor state ........................................................................................................ 4-6
System registers ...................................................................................................... 4-7
Endianness ............................................................................................................ 4-12
Changing execution state (again) .......................................................................... 4-13
NEON and floating-point registers ......................................................................... 4-17

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

iv

Contents

Chapter 5

An Introduction to the ARMv8 Instruction Sets
5.1
5.2
5.3

Chapter 6

The A64 instruction set
6.1
6.2
6.3
6.4
6.5

Chapter 7

The Translation Lookaside Buffer .......................................................................... 12-4
Separation of kernel and application Virtual Address spaces ................................ 12-7
Translating a Virtual Address to a Physical Address ............................................. 12-9
Translation tables in ARMv8-A ............................................................................ 12-14
Translation table configuration ............................................................................. 12-18
Translations at EL2 and EL3 ............................................................................... 12-20
Access permissions ............................................................................................. 12-23
Operating system use of translation table descriptors ........................................ 12-25
Security and the MMU ......................................................................................... 12-26
Context switching ................................................................................................. 12-27
Kernel access with user permissions ................................................................... 12-29

Memory Ordering
13.1

ARM DEN0024A
ID050815

Cache terminology ................................................................................................. 11-3
Cache controller ..................................................................................................... 11-8
Cache policies ....................................................................................................... 11-9
Point of coherency and unification ....................................................................... 11-11
Cache maintenance ............................................................................................. 11-13
Cache discovery .................................................................................................. 11-18

The Memory Management Unit
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.9
12.10
12.11

Chapter 13

Exception handling registers .................................................................................. 10-4
Synchronous and asynchronous exceptions ......................................................... 10-7
Changes to execution state and Exception level caused by exceptions ............. 10-10
AArch64 exception table ...................................................................................... 10-12
Interrupt handling ................................................................................................. 10-14
The Generic Interrupt Controller .......................................................................... 10-17

Caches
11.1
11.2
11.3
11.4
11.5
11.6

Chapter 12

Register use in the AArch64 Procedure Call Standard ............................................ 9-3

AArch64 Exception Handling
10.1
10.2
10.3
10.4
10.5
10.6

Chapter 11

Alignment ................................................................................................................. 8-3
Data types ................................................................................................................ 8-4
Issues when porting code from a 32-bit to 64-bit environment ................................ 8-8
Recommendations for new C code ........................................................................ 8-10

The ABI for ARM 64-bit Architecture
9.1

Chapter 10

New features for NEON and Floating-point in AArch64 ........................................... 7-2
NEON and Floating-Point architecture .................................................................... 7-4
AArch64 NEON instruction format ........................................................................... 7-9
NEON coding alternatives ..................................................................................... 7-14

Porting to A64
8.1
8.2
8.3
8.4

Chapter 9

Instruction mnemonics ............................................................................................. 6-2
Data processing instructions .................................................................................... 6-3
Memory access instructions .................................................................................. 6-12
Flow control ........................................................................................................... 6-19
System control and other instructions .................................................................... 6-21

AArch64 Floating-point and NEON
7.1
7.2
7.3
7.4

Chapter 8

The ARMv8 instruction sets ..................................................................................... 5-2
C/C++ inline assembly ............................................................................................. 5-9
Switching between the instruction sets .................................................................. 5-10

Memory types ........................................................................................................ 13-3

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

v

Contents

13.2
13.3

Chapter 14

Multi-core processors
14.1
14.2
14.3
14.4

Chapter 15

TrustZone hardware architecture ...........................................................................
Switching security worlds through interrupts .........................................................
Security in multi-core systems ...............................................................................
Switching between Secure and Non-secure state .................................................

17-3
17-5
17-6
17-8

ARM debug hardware ............................................................................................ 18-3
ARM trace hardware .............................................................................................. 18-9
DS-5 debug and trace .......................................................................................... 18-12

ARMv8 Models
19.1
19.2
19.3

ARM DEN0024A
ID050815

Structure of a big.LITTLE system .......................................................................... 16-2
Software execution models in big.LITTLE ............................................................. 16-4
big.LITTLE MP ....................................................................................................... 16-7

Debug
18.1
18.2
18.3

Chapter 19

15-3
15-6
15-7
15-8

Security
17.1
17.2
17.3
17.4

Chapter 18

Idle management ...................................................................................................
Dynamic voltage and frequency scaling ................................................................
Assembly language power instructions .................................................................
Power State Coordination Interface .......................................................................

big.LITTLE Technology
16.1
16.2
16.3

Chapter 17

Multi-processing systems ...................................................................................... 14-3
Cache coherency ................................................................................................. 14-10
Multi-core cache coherency within a cluster ........................................................ 14-13
Bus protocol and the Cache Coherent Interconnect ............................................ 14-17

Power Management
15.1
15.2
15.3
15.4

Chapter 16

Barriers .................................................................................................................. 13-6
Memory attributes ................................................................................................ 13-11

ARM Fast Models .................................................................................................. 19-2
ARMv8-A Foundation Platform .............................................................................. 19-4
The Base Platform FVP ....................................................................................... 19-16

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

vi

Preface

In 2013, ARM released its 64-bit ARMv8 architecture, the first major change to the ARM
architecture since ARMv7 in 2007, and the most fundamental and far reaching change since the
original ARM architecture was created.
Development of the architecture has continued for some years. Early versions were being used
before the Cortex-A Series Programmer’s Guide for ARMv7-A was first released. The first of
the Programmer’s Guide series from ARM, it post-dated the introduction of the 32-bit ARMv7
architecture by some years. Almost immediately there were requests for a version to cover the
ARMv8 architecture. It was intended from the outset that a guide to ARMv8 should be available
as soon as possible.
This book was started when the first versions of the ARMv8 architecture were being tested and
codified. As always, moving from a system that is known and understood to something new and
unknown can present a number of problems. The engineers who supplied information for the
present book are, by and large, the same engineers who supplied the information for the original
Cortex-A Series Programmer’s Guide. This book has been made richer by their observations and
insights as they use, and solve the problems presented by the new architecture.
The Programmer’s Guides are meant to complement, rather than replace, other ARM
documentation available, such as the Technical Reference Manuals (TRMs) for the processors
themselves, documentation for individual devices or boards or, most importantly, the ARM
Architecture Reference Manual (the ARM ARM). They are intended to provide a gentle
introduction to the ARM architecture, and cover all the main concepts that you need to know
about, in an easy to read format, with examples of actual code in both C and assembly language,
and with hints and tips for writing your own code.
It might be argued that if you are an application developer, you do not need to know what goes
on inside a processor. ARM Application processors can easily be regarded as black boxes which
simply run your code when you say go. Instead, this book provides a single guide, bringing

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

vii

Preface

together information from a wide variety of sources, for those programmers who get the system
to the point where application developers can run applications, such as those involved in ASIC
verification, or those working on boot code and device drivers.
During bring-up of a new board or System-on-Chip (SoC), engineers may have to investigate
issues with the hardware. Memory system behavior is among the most common places for these
to manifest, for example, deadlocks where the processor cannot make forward progress because
of memory system lock. Debugging these problems requires an understanding of the operation
and effect of cache or MMU use. This is different from debugging a failing piece of code.
In a similar vein, system architects (usually hardware engineers) make choices early in the
design about the implementation of DMA, frame buffers and other parts of the memory system
where an understanding of data flow between agents in required. In this case it is difficult to
make sensible decisions about it if you do not understand when a cache will help you and when
it gets in the way, or how the OS will use the MMU. Similar considerations apply in many other
places.
This is not an introductory level book, nor is it a purely technical description of the architecture
and processors, which merely state the facts with little or no explanation of ‘how’ and ‘why’.
ARM and all who have collaborated on this book hope it successfully navigates between the two
extremes, while attempting to explain some of the more intricate aspects of the architecture.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

viii

Preface

Glossary
Abbreviations and terms used in this document are defined here.

ARM DEN0024A
ID050815

AAPCS

ARM Architecture Procedure Call Standard.

AArch32 state

The ARM 32-bit execution state that uses 32-bit general-purpose registers,
and a 32-bit Program Counter (PC), Stack Pointer (SP), and Link Register
(LR). AArch32 execution state provides a choice of two instruction sets,
A32 and T32, previously called the ARM and Thumb instruction sets.

AArch64 state

The ARM 64-bit execution state that uses 64-bit general-purpose registers,
and a 64-bit Program Counter (PC), Stack Pointer (SP), and Exception
Link Registers (ELR). AArch64 execution state provides a single
instruction set, A64.

ABI

Application Binary Interface.

ACE

AXI Coherency Extensions.

AES

Advanced Encryption Standard.

AMBA®

Advanced Microcontroller Bus Architecture.

AMP

Asymmetric Multi-Processing.

ARM ARM

The ARM Architecture Reference Manual.

ASIC

Application Specific Integrated Circuit.

ASID

Address Space ID.

AXI

Advanced eXtensible Interface.

BE8

Byte Invariant Big-Endian Mode.

BTAC

Branch Target Address Cache.

BTB

Branch Target Buffer.

CCI

Cache Coherent Interface.

CHI

Coherent Hub Interface.

CP15

Coprocessor 15 for AArch32 and ARMv7-A- System control coprocessor.

DAP

Debug Access Port.

DMA

Direct Memory Access.

DMB

Data Memory Barrier.

DS-5™

The ARM Development Studio.

DSB

Data Synchronization Barrier.

DSP

Digital Signal Processing.

DSTREAM

An ARM debug and trace unit.

DVFS

Dynamic Voltage/Frequency Scaling.

EABI

Embedded ABI.

ECC

Error Correcting Code.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

ix

Preface

ECT

Embedded Cross Trigger.

EL0

Exception level used to execute user applications.

EL1

Exception level normally used to run operating systems.

EL2

Hypervisor Exception level. In the Normal world, or Non-Secure state,
this is used to execute hypervisor code.

EL3

Secure Monitor exception level.This is used to execute the code that
guards transitions between the Secure and Normal worlds.

ETB

Embedded Trace Buffer™.

ETM

Embedded Trace Macrocell™.

Execution state

The operational state of the processor, either 64-bit (AArch64) or 32-bit
(AArch32).

FIQ

An interrupt type (formerly fast interrupt).

FPSCR

Floating-Point Status and Control Register.

GCC

GNU Compiler Collection.

GIC

Generic Interrupt Controller.

Harvard architecture
Architecture with physically separate storage and signal pathways for
instructions and data.
HCR

Hyp Configuration Register.

HMP

Heterogenous Multi-Processing.

IMPLEMENTATION DEFINED
Some properties of the processor are defined by the manufacturer.

ARM DEN0024A
ID050815

IPA

Intermediate Physical Address.

IRQ

Interrupt Request, normally for external interrupts.

ISA

Instruction Set Architecture.

ISB

Instruction Synchronization Barrier.

ISR

Interrupt Service Routine.

Jazelle™

The ARM bytecode acceleration technology.

LLP64

Indicates the size in bits of basic C data types. Under LLP64 int and long
data types are 32 bit, pointers and long long are 64 bits.

LP64

Indicates the size in bits of basic C data types. Under LP64 int types are
32 bits, all others are 64 bits.

LPAE

Large Physical Address Extension.

LSB

Least Significant Bit.

MESI

A cache coherency protocol with four states that are Modified, Exclusive,
Shared and Invalid.

MMU

Memory Management Unit.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

x

Preface

MOESI

A cache coherency protocol with five states that are Modified, Owned,
Exclusive, Shared and Invalid.

Monitor mode

When EL3 is using AArch32, the PE mode in which the Secure Monitor
must execute. This mode guards transitions between the Secure and
Normal worlds.

MPU

Memory Protection Unit.

NEON™

The ARM Advanced SIMD Extensions.

NIC

Network InterConnect.

Normal world

The execution environment when the processor is in the Non-secure state.

PCS

Procedure Call Standard.

PIPT

Physically Indexed, Physically Tagged.

PoC

Point of Coherency.

PoU

Point of Unification.

PSR

Program Status Register.

SCU

Snoop Control Unit.

Secure world

The execution environment when the processor is in the Secure State.

SIMD

Single Instruction, Multiple Data.

SMC

Secure Monitor Call. An ARM assembler instruction that causes an
exception that is taken synchronously to EL3.

SMC32

32-bit SMC calling convention

SMC64

64-bit SMC calling convention

SMC Function Identifier
A 32-bit integer which identifies which function is being invoked by this
SMC call. Passed in R0 or W0 to every SMC call

ARM DEN0024A
ID050815

SMMU

System MMU.

SMP

Symmetric Multi-Processing.

SoC

System on Chip.

SP

Stack Pointer.

SPSR

Saved Program Status Register.

Streamline

A graphical performance analysis tool.

SVC

Supervisor Call instruction.

SYS

System Mode.

Thumb®

An instruction set extension to ARM.

Thumb-2

A technology extending the Thumb instruction set to support both 16-bit
and 32-bit instructions.

TLB

Translation Lookaside Buffer.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

xi

Preface

TrustedOS

This is the operating system running in the Secure World. It supports the
execution of trusted applications in Secure EL0. When EL3 is using
AArch64 it executes in Secure EL1. When EL3 is using AArch32 it
executes in Secure EL3 modes other than Monitor mode.

TrustZone®

The ARM security extension.

TTB

Translation Table Base.

TTBR

Translation Table Base Register.

UART

Universal Asynchronous Receiver/Transmitter.

UEFI

Unified Extensible Firmware Interface.

U-Boot

A Linux Bootloader.

UNK

Unknown.

UNKNOWN

Values in a register cannot be known before they are reset.

UNPREDICTABLE
The value taken cannot be predicted.

ARM DEN0024A
ID050815

USR

User mode, a non-privileged processor mode.

VFP

The ARM floating-point instruction set. Before ARMv7, the VFP
extension was called the Vector Floating-Point architecture, and was used
for vector operations.

VIPT

Virtually Indexed, Physically Tagged.

VMID

Virtual Machine Identifier.

XN

Execute Never.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

xii

Preface

References
ANSI/IEEE Std 754-1985, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 754-2008, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 1003.1-1990, “Standard for Information Technology - Portable Operating
System Interface (POSIX) Base Specifications, Issue 7”.
ANSI/IEEE Std 1149.1-2001, “IEEE Standard Test Access Port and Boundary-Scan
Architecture”.
The ARMv8 Architecture Reference Manual, known as the ARM ARM, fully describes the
ARMv8 instruction set architecture, programmer’s model, system registers, debug features and
memory model. It forms a detailed specification to which all implementations of ARM
processors must adhere.
References to the ARM Architecture Reference Manual in this document are to:
ARM® Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile (ARM DDI
0487).
Note
In the event of a contradiction between this book and the ARM ARM, the ARM ARM is
definitive and must take precedence. In most instances, however, the ARM ARM and the
Cortex-A Series Programmer’s Guide for ARMv8-A cover two separate world views. The most
likely scenario is that this book describes something in a way that does not cover all
architecturally permitted behaviors, or simply rewords an abstract concept in more practical
terms.
ARM® Cortex®-A Series Programmer’s Guide for ARMv7-A (DEN 0013).
ARM® NEON™ Programmer’s Guide (DEN 0018).
ARM® Cortex®-A53 MPCore Processor Technical Reference Manual (DDI 0500).
ARM® Cortex®-A57 MPCore Processor Technical Reference Manual (DDI 0488).
ARM® Generic Interrupt Controller Architecture Specification (ARM IHI 0048).
ARM® Compiler armasm Reference Guide v6.01 (DUI 0802).
ARM® Compiler Software Development Guide v5.05 (DUI 0471).
ARM® C Language Extensions (IHI 0053).
ELF for the ARM® Architecture (ARM IHI 0044).
The individual processor Technical Reference Manuals provide a detailed description of the
processor behavior. They can be obtained from the ARM website documentation area
http://infocenter.arm.com.
Connected community
The ARM Connected Community makes it easier to design using ARM processors and IP. It is
an interactive platform containing information, discussions and blogs which help you to develop
an ARM-based design efficiently, in collaboration with ARM engineers and our 1200+

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

xiii

Preface

ecosystem Partners and enthusiasts. Visitors also use the community to find new companies to
work with from the many ARM Partners who first introduced their products and services in their
dedicated area. You can join the Connected Community on http://community.arm.com.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

xiv

Preface

Feedback on this book
ARM hopes you find the Cortex-A Series Programmer’s Guide for ARMv8-A easy to read while
in enough depth to provide the comprehensive introduction to using the processors.
If you have any comments on this book, don’t understand our explanations, think something is
missing, or think that it is incorrect, send an e-mail to errata@arm.com. Give:
•
The title.
•
The number, ARM DEN0024A.
•
The page number(s) to which your comments apply.
•
What you think needs to be changed.
ARM also welcomes general suggestions for additions and improvements.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

xv

Chapter 1
Introduction

ARMv8-A is the latest generation of the ARM architecture that is targeted at the Applications
Profile. In this book, the name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution states. ARMv8 introduces the ability to
perform execution with 64-bit wide registers, but provides mechanisms for backwards
compatibility to enable existing ARMv7 software to be executed.
AArch64 is the name used to describe the 64-bit execution state of the ARMv8 architecture.
AArch32 describes the 32-bit execution state of the ARMv8 architecture, which is almost
identical to ARMv7. GNU and Linux documentation (except for Redhat and Fedora
distributions) sometimes refers to AArch64 as ARM64.
Because many of the concepts of the ARMv8-A architecture are shared with the ARMv7-A
architecture, the details of all those concepts are not covered here. As a general introduction to
the ARMv7-A architecture, refer to the ARM® Cortex®-A Series Programmer’s Guide. This
guide can also help you to familiarize yourself with some of the concepts discussed in this
volume. However, the ARMv8-A architecture profile is backwards compatible with earlier
iterations, like most versions of the ARM architecture. Therefore, there is a certain amount of
overlap between the way the ARMv8 architecture and previous architectures function. The
general principles of the ARMv7 architecture are only covered to explain the differences
between the ARMv8 and earlier ARMv7 architectures.
Cortex-A series processors now include both ARMv8-A and ARMv7-A implementations:

ARM DEN0024A
ID050815

•

The Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15, and Cortex-A17
processors all implement the ARMv7-A architecture.

•

The Cortex-A53 and Cortex-A57 processors implement the ARMv8-A architecture.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

1-1

Introduction

ARMv8 processors still support software (with some exceptions) written for the ARMv7-A
processors. This means, for example, that 32-bit code written for the ARMv7 Cortex-A series
processors also runs on ARMv8 processors such as the Cortex-A57. However, the code will
only run when the ARMv8 processor is in the AArch32 execution state. The A64 64-bit
instruction set, however, does not run on ARMv7 processors, and only runs on the ARMv8
processors.
Some knowledge of the C programming language and microprocessors is assumed of the
readers of this book. There are pointers to further reading, referring to books and websites that
can give you a deeper level of background to the subject matter.

The change from 32-bit to 64-bit
There are several performance gains derived from moving to a 64-bit processor.
•

The A64 instruction set provides some significant performance benefits, including a
larger register pool. The additional registers and the ARM Architecture Procedure Call
Standard (AAPCS) provide a performance boost when you must pass more than four
registers in a function call. On ARMv7, this would require using the stack, whereas in
AArch64 up to eight parameters can be passed in registers.

•

Wider integer registers enable code that operates on 64-bit data to work more efficiently.
A 32-bit processor might require several operations to perform an arithmetic operation on
64-bit data. A 64-bit processor might be able to perform the same task in a single
operation, typically at the same speed required by the same processor to perform a 32-bit
operation. Therefore, code that performs many 64-bit sized operations is significantly
faster.

•

64-bit operation enables applications to use a larger virtual address space. While the Large
Physical Address Extension (LPAE) extends the physical address space of a 32-bit
processor to 40-bit, it does not extend the virtual address space. This means that even with
LPAE, a single application is limited to a 32-bit (4GB) address space. This is because
some of this address space is reserved for the operating system.

•

Software running on a 32-bit architecture might need to map some data in or out of
memory while executing. Having a larger address space, with 64-bit pointers, avoids this
problem. However, using 64-bit pointers does incur some cost. The same piece of code
typically uses more memory when running with 64-pointers than with 32-bit pointers.
Each pointer is stored in memory and requires eight bytes instead of four. This might
sound trivial, but can add up to a significant penalty. Furthermore, the increased usage of
memory space associated with a move to 64-bits can cause a drop in the number of
accesses that hit in the cache. This in turn can reduce performance.
The larger virtual address space also enables memory-mapping larger files. This is the
mapping of the file contents into the memory map of a thread. This can occur even though
the physical RAM might not be large enough to contain the whole file.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

1-2

Introduction

1.1

How to use this book
This book provides a single guide for programmers who want to use the Cortex-A series
processors that implement the ARMv8 architecture. The guide brings together information from
a wide variety of sources that is useful to both ARM assembly language and C programmers. It
is meant to complement rather than replace other ARM documentation available for ARMv8
processors. The other documents for specific information includes the ARM Technical
Reference Manuals (TRMs) for the processors themselves, documentation for individual
devices or boards or, most importantly, the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile - the ARM ARM.
This book is not written at an introductory level. It assumes some knowledge of the C
programming language and microprocessors. Hardware concepts such as caches and Memory
Management Units are covered, but only where this knowledge is valuable to the application
writer. The book looks at the way operating systems utilize ARMv8 features, and how to take
full advantage of the capabilities of the ARMv8 processors. Some chapters contain pointers to
additional reading. We also refer to books and web sites that can give a deeper level of
background to the subject matter, but often the main focus is the ARM-specific detail. No
assumptions are made on the use of any particular toolchain, and both GNU and ARM tools are
mentioned throughout the book.
If you are new to the ARMv8 architecture, Chapter 2 ARMv8-A Architecture and Processors
describes the previous 32-bit ARM architectures, introduces ARMv8, and describes some of the
properties of the ARMv8 processors. Next, Chapter 3 Fundamentals of ARMv8 describes the
building blocks of the architecture in the form of Exception levels and Execution states.
Chapter 4 ARMv8 Registers then describes the registers available to you in the ARMv8
architecture.
One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set, which complements the existing 32-bit architecture. Chapter 5 An
Introduction to the ARMv8 Instruction Sets describes the differences between the Instruction Set
Architecture (ISA) of ARMv7 (A32), and that of the A64 instruction set. Chapter 6 The A64
instruction set looks at the Instruction Set and its use in more detail. In addition to a new
instruction set for general operation, ARMv8 also has a changed NEON and floating-point
instruction set. Chapter 7 AArch64 Floating-point and NEON describes the changes in ARMv8
to ARM Advanced SIMD (NEON) and floating-point instructions. For a more detailed guide to
NEON and its capabilities at ARMv7, refer to the ARM® NEON™ Programmer’s Guide.
Chapter 8 Porting to A64 of this book covers the problems you might encounter when porting
code from other architectures, or previous ARM architectures to ARMv8. Chapter 9 The ABI
for ARM 64-bit Architecture describes the Application Binary Interface (ABI) for the ARM
architecture specification. The ABI is a specification for all the programming behavior of an
ARM target, which governs the form your 64-bit code takes. Chapter 10 AArch64 Exception
Handling describes the exception handling behavior of ARMv8 in AArch64 state.
Following this, the focus moves to the internal architecture of the processor. Chapter 11 Caches
describes the design of caches and how the use of caches can improve performance.
An important motivating factor behind ARMv8 and moving to a 64-bit architecture is
potentially enabling access to larger address space than is possible using just 32 bits. Chapter 12
The Memory Management Unit describes how the MMU converts virtual memory addresses to
physical addresses.
Chapter 13 Memory Ordering describes the weakly-ordered model of memory in the ARMv8
architecture. Generally, this means that the order of memory accesses is not required to be the
same as the program order for load and store operations. Only some programmers must be aware
of memory ordering issues. If your code interacts directly with the hardware or with code

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

1-3

Introduction

executing on other cores, directly loads or writes instructions to be executed, or modifies page
tables, then you might have to think about ordering and barriers. This also applies if you are
implementing your own synchronization functions or lock-free algorithms.
Chapter 14 Multi-core processors describes how the ARMv8-A architecture supports systems
with multiple cores. Systems that use the ARMv8 processors are almost always implemented in
such a way. Chapter 15 Power Management describes how ARM cores use their hardware that
can reduce power use. A further aspect of power management, applied to multi-core and
multi-cluster systems is covered in Chapter 16 big.LITTLE Technology. This chapter describes
how big.LITTLE technology from ARM couples together an energy efficient LITTLE core with
a high performance big core, to provide a system with high performance and power efficiency.
Chapter 17 Security describes how the ARMv8 processors can create a Secure, or trusted system
that protects assets such as passwords or credit card details from unauthorized copying or
damage. The main part of the book then concludes with Chapter 18 Debug describing the
standard debug and trace features available in the Cortex-A53 and Cortex-A57 processors.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

1-4

Chapter 2
ARMv8-A Architecture and Processors

The ARM architecture dates back to 1985, but it has not stayed static. On the contrary, it has
developed massively since the early ARM cores, adding features and capabilities at each step:
ARMv4 and earlier
These early processors used only the ARM 32-bit instruction set.
ARMv4T

The ARMv4T architecture added the Thumb 16-bit instruction set to the ARM
32-bit instruction set. This was the first widely licensed architecture. It was
implemented by the ARM7TDMI® and ARM9TDMI® processors.

ARMv5TE The ARMv5TE architecture added improvements for DSP-type operations,
saturated arithmetic, and for ARM and Thumb interworking. The ARM926EJ-S®
implements this architecture.
ARMv6

ARMv6 made several enhancements, including support for unaligned memory
accesses, significant changes to the memory architecture and for multi-processor
support. Additionally, some support for SIMD operations operating on bytes or
halfwords within the 32-bit registers was included. The ARM1136JF-S®
implements this architecture. The ARMv6 architecture also provided some
optional extensions, notably Thumb-2 and Security Extensions (TrustZone®).
Thumb-2 extends Thumb to be a mixed length 16-bit and 32-bit instruction set.

ARMv7-A

The ARMv7-A architecture makes the Thumb-2 extensions mandatory and adds
the Advanced SIMD extensions (NEON). Before ARMv7, all cores conformed to
essentially the same architecture or feature set. To help address an increasing
range of differing applications, ARM introduced a set of architecture profiles:
•

ARM DEN0024A
ID050815

ARMv7-A provides all the features necessary to support a platform
Operating System such as Linux.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-1

ARMv8-A Architecture and Processors

ARM DEN0024A
ID050815

•

ARMv7-R provides predictable real-time high-performance.

•

ARMv7-M is targeted at deeply-embedded microcontrollers.
An M profile was also added to the ARMv6 architecture to enable features
for the older architecture. The ARMv6M profile is used by low-cost
microprocessors with low power consumption.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-2

ARMv8-A Architecture and Processors

2.1

ARMv8-A
The ARMv8-A architecture is the latest generation ARM architecture targeted at the
Applications Profile. The name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution. It introduces the ability to perform
execution with 64-bit wide registers, while preserving backwards compatibility with existing
ARMv7 software.

v5
VFPv2

Thumb-2
TrustZone
SIMD

v8

v7

v6

VFPv3/v4
NEON

Key Feature ARMv7-A
Compatibility

A32+T32 ISAs

A64 ISAs

Scalar FP (SP
and DP)
Adv SIMD (SP
Float)

Scalar FP (SP
and DP)
Adv SIMD (SP &
DP Float)

AArch32

AArch64

Crypto

Crypto

Figure 2-1 Development of the ARMv8 architecture

The ARMv8-A architecture introduces a number of changes, which enable significantly higher
performance processor implementations to be designed.
Large physical address
This enables the processor to access beyond 4GB of physical memory.
64-bit virtual addressing
This enables virtual memory beyond the 4GB limit. This is important for modern
desktop and server software using memory mapped file I/O or sparse addressing.
Automatic event signaling
This enables power-efficient, high-performance spinlocks.
Larger register files
Thirty-one 64-bit general-purpose registers increase performance and reduce
stack use.
Efficient 64-bit immediate generation
There is less need for literal pools.
Large PC-relative addressing range
A +/-4GB addressing range for efficient data addressing within shared libraries
and position-independent executables.
ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-3

ARMv8-A Architecture and Processors

Additional 16KB and 64KB translation granules
This reduces Translation Lookaside Buffer (TLB) miss rates and depth of page
walks.
New exception model
This reduces OS and hypervisor software complexity.
Efficient cache management
User space cache operations improve dynamic code generation efficiency. Fast
Data cache clear using a Data Cache Zero instruction.
Hardware-accelerated cryptography
Provides 3× to 10× better software encryption performance. This is useful for
small granule decryption and encryption too small to offload to a hardware
accelerator efficiently, for example https.
Load-Acquire, Store-Release instructions
Designed for C++11, C11, Java memory models. They improve performance of
thread-safe code by eliminating explicit memory barrier instructions.
NEON double-precision floating-point advanced SIMD
This enables SIMD vectorization to be applied to a much wider set of algorithms,
for example, scientific computing, High Performance Computing (HPC) and
supercomputers.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-4

ARMv8-A Architecture and Processors

2.2

ARMv8-A Processor properties
Table 2-1 compares the properties of the processor implementations from ARM that support the
ARMv8-A architecture.
Table 2-1 Comparison of ARMv8-A processors
Processor
Cortex-A53

Cortex-A57

Release date

July 2014

January 2015

Typical clock speed

2GHz on 28nm

1.5 to 2.5 GHz on 20nm

Execution order

In-order

Out of order, speculative
issue, superscalar

Cores

1 to 4

1 to 4

Integer Peak throughput

2.3MIPS/MHz

4.1 to 4.76MIPS/MHza

Floating-point Unit

Yes

Yes

Half-precision

Yes

Yes

Hardware Divide

Yes

Yes

Fused Multiply Accumulate

Yes

Yes

Pipeline stages

8

15+

Return stack entries

4

8

Generic Interrupt Controller

External

External

AMBA interface

64-bit I/F AMBA 4
(Supports AMBA 4
and AMBA 5)

128-bit I/F AMBA 4
(Supports AMBA 4 and
AMBA 5)

L1 Cache size (Instruction)

8KB to 64 KB

48KB

L1 Cache structure (Instruction)

2-way set associative

3-way set associative

L1 Cache size (Data)

8KB to 64KB

32KB

L1 Cache structure (Data)

4-way set associative

2-way set associative

L2 Cache

Optional

Integrated

L2 Cache size

128KB to 2MB

512KB to 2MB

L2 Cache structure

16-way set associative

16-way set associative

Main TLB entries

512

1024

uTLB entries

10

48 I-side
32 D-side

A. IMPLEMENTATION DEFINED

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-5

ARMv8-A Architecture and Processors

2.2.1

ARMv8 processors
This section describes each of the processors that implement the ARMv8-A architecture. It only
gives a general description in each case. For more specific information on each processor, see
Table 2-1 on page 2-5.
The Cortex-A53 processor
The Cortex-A53 processor is a mid-range, low-power processor with between one and four
cores in a single cluster, each with an L1 cache subsystem, an optional integrated GICv3/4
interface, and an optional L2 cache controller.
The Cortex-A53 processor is an extremely power efficient processor capable of supporting
32-bit and 64-bit code. It delivers significantly higher performance than the highly successful
Cortex-A7 processor. It is capable of deployment as a standalone applications processor, or
paired with the Cortex-A57 processor in a big.LITTLE configuration for optimum performance,
scalability, and energy efficiency.

ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller

NEON
Data Engine
with crypto ext
Cortex-A53 processor
Floating-point
unit

Level 1
Instruction
Cache

Level 1 Data
Cache w/ECC

Performance Monitor
Unit

SCU

Memory
Management
Unit

Data Processing
Unit

ACP

3
2

Core

1

0

Integrated Level 2 Cache w/ECC

AMBA 4 ACE or AMBA 5 CHI Coherent Bus Interface

Figure 2-2 Cortex-A53 processor

The Cortex-A53 processor has the following features:

ARM DEN0024A
ID050815

•

In-order, eight stage pipeline.

•

Lower power consumption from the use of hierarchical clock gating, power domains, and
advanced retention modes.

•

Increased dual-issue capability from duplication of execution resources and dual
instruction decoders.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-6

ARMv8-A Architecture and Processors

•

Power-optimized L2 cache design delivers lower latency and balances performance with
efficiency.

The Cortex-A57 processor
The Cortex-A57 processor is targeted at mobile and enterprise computing applications
including compute intensive 64-bit applications such as high end computer, tablet, and server
products. It can be used with the Cortex-A53 processor into an ARM big.LITTLE configuration,
for scalable performance and more efficient energy use.
The Cortex-A57 processor features cache coherent interoperability with other processors,
including the ARM Mali™ family of Graphics Processing Units (GPUs) for GPU compute and
provides optional reliability and scalability features for high-performance enterprise
applications. It provides significantly more performance than the ARMv7 Cortex-A15
processor, at a higher level of power efficiency. The inclusion of cryptography extensions
improves performance on cryptography algorithms by 10 times over the previous generation of
processors.

ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller

NEON
Data Engine
with crypto ext
Cortex-A57 processor
Floating-point
unit

Level 1
Instruction
Cache

Level 1 Data
Cache w/ECC

3
2

Performance Monitor Unit

SCU

Memory
Protection Unit

ACP

Core

1

0
Integrated Level 2 Cache w/ECC

AMBA 4 ACE or AMBA5 CHI Coherent Bus Interface

Figure 2-3 Cortex-A57 processor core

The Cortex-A57 processor fully implements the ARMv8-A architecture. It enables multi-core
operation with between one and four cores multi-processing within a single cluster. Multiple
coherent SMP clusters are possible, through AMBA5 CHI or AMBA 4 ACE technology. Debug
and trace are available through CoreSight technology.
The Cortex-A57 processor has the following features:
•
ARM DEN0024A
ID050815

Out-of-order, 15+ stage pipeline.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-7

ARMv8-A Architecture and Processors

ARM DEN0024A
ID050815

•

Power-saving features include way-prediction, tag-reduction, and cache-lookup
suppression.

•

Increased peak instruction throughput through duplication of execution resources.
Power-optimized instruction decode with localized decoding, 3-wide decode bandwidth.

•

Performance optimized L2 cache design enables more than one core in the cluster to
access the L2 at the same time.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

2-8

Chapter 3
Fundamentals of ARMv8

In ARMv8, execution occurs at one of four Exception levels. In AArch64, the Exception level
determines the level of privilege, in a similar way to the privilege levels defined in ARMv7. The
Exception level determines the privilege level, so execution at ELn corresponds to privilege
PLn. Similarly, an Exception level with a larger value of n than another one is at a higher
Exception level. An Exception level with a smaller number than another is described as being
at a lower Exception level.
Exception levels provide a logical separation of software execution privilege that applies across
all operating states of the ARMv8 architecture. It is similar to, and supports the concept of,
hierarchical protection domains common in computer science.
The following is a typical example of what software runs at each Exception level:

ARM DEN0024A
ID050815

EL0

Normal user applications.

EL1

Operating system kernel typically described as privileged.

EL2

Hypervisor.

EL3

Low-level firmware, including the Secure Monitor.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-1

Fundamentals of ARMv8

Normal world
EL0

Application

Application

Application

Kernel

EL1

Application

Kernel

EL2

Hypervisor

EL3

Secure monitor

Figure 3-1 Exception levels

In general, a piece of software, such as an application, the kernel of an operating system, or a
hypervisor, occupies a single Exception level. An exception to this rule is in-kernel hypervisors
such as KVM, which operate across both EL2 and EL1.
ARMv8-A provides two security states, Secure and Non-secure. The Non-secure state is also
referred to as the Normal World. This enables an Operating System (OS) to run in parallel with
a trusted OS on the same hardware, and provides protection against certain software attacks and
hardware attacks. ARM TrustZone technology enables the system to be partitioned between the
Normal and Secure worlds. As with the ARMv7-A architecture, the Secure monitor acts as a
gateway for moving between the Normal and Secure worlds.

Normal world
EL0

EL1

EL2

EL3

Application

Application

Secure world

Application

Guest OS

Application

Guest OS

Secure firmware

Trusted OS

No Hypervisor in
Secure world

Hypervisor

Secure monitor

Figure 3-2 ARMv8 Exception levels in the Normal and Secure worlds

ARMv8-A also provides support for virtualization, though only in the Normal world. This
means that hypervisor, or Virtual Machine Manager (VMM) code can run on the system and
host multiple guest operating systems. Each of the guest operating systems is, essentially,
running on a virtual machine. Each OS is then unaware that it is sharing time on the system with
other guest operating systems.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-2

Fundamentals of ARMv8

The Normal world (which corresponds to the Non-secure state) has the following privileged
components:
Guest OS kernels
Such kernels include Linux or Windows running in Non-secure EL1. When
running under a hypervisor, the rich OS kernels can be running as a guest or host
depending on the hypervisor model.
Hypervisor
This runs at EL2, which is always Non-secure. The hypervisor, when present and
enabled, provides virtualization services to rich OS kernels.
The Secure world has the following privileged components:
Secure firmware
On an application processor, this firmware must be the first thing that runs at boot
time. It provides several services, including platform initialization, the
installation of the trusted OS, and routing of Secure monitor calls.
Trusted OS
Trusted OS provides Secure services to the Normal world and provides a runtime
environment for executing Secure or trusted applications.
The Secure monitor in the ARMv8 architecture is at a higher Exception level and is more
privileged than all other levels. This provides a logical model of software privilege.
Figure 3-2 on page 3-2 shows that a Secure version of EL2 is not available.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-3

Fundamentals of ARMv8

3.1

Execution states
The ARMv8 architecture defines two Execution States, AArch64 and AArch32. Each state is
used to describe execution using 64-bit wide general-purpose registers or 32-bit wide
general-purpose registers, respectively. While ARMv8 AArch32 retains the ARMv7 definitions
of privilege, in AArch64, privilege level is determined by the Exception level. Therefore,
execution at ELn corresponds to privilege PLn.
When in AArch64 state, the processor executes the A64 instruction set. When in AArch32 state,
the processor can execute either the A32 (called ARM in earlier versions of the architecture) or
the T32 (Thumb) instruction set.
The following diagrams show the organization of the Exception levels in AArch64 and
AArch32.
In AArch64:

Normal world
EL0

Application

EL1

Application

Application

Guest OS

EL2

Secure world
Application

Guest OS

Trusted OS

No Hypervisor in
Secure world

Hypervisor

EL3

Secure firmware

Secure monitor

Figure 3-3 Exception levels in AArch64

In AArch32:

Normal world
EL0

EL1

Application

Application

Secure world

Application

Guest OS

Application

Secure firmware

Guest OS
Trusted kernel
(operates at EL3)

EL2

EL3

Hypervisor

No EL2 in Secure
world
Secure monitor

Figure 3-4 Exception levels in AArch32

In AArch32 state, Trusted OS software executes in Secure EL3, and in AArch64 state it
primarily executes in Secure EL1.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-4

Fundamentals of ARMv8

3.2

Changing Exception levels
In the ARMv7 architecture, the processor mode can change under privileged software control
or automatically when taking an exception. When an exception occurs, the core saves the
current execution state and the return address, enters the required mode, and possibly disables
hardware interrupts.
This is summarized in the following table. Applications operate at the lowest level of privilege,
PL0, previously unprivileged mode. Operating systems run at PL1, and the Hypervisor in a
system with the Virtualization extensions at PL2. The Secure monitor, which acts as a gateway
for moving between the Secure and Non-secure (Normal) worlds, also operates at PL1.
Table 3-1 ARMv7 processor modes

ARM DEN0024A
ID050815

Mode

Function

Security
state

Privilege
level

User (USR)

Unprivileged mode in which most applications run

Both

PL0

FIQ

Entered on an FIQ interrupt exception

Both

PL1

IRQ

Entered on an IRQ interrupt exception

Both

PL1

Supervisor
(SVC)

Entered on reset or when a Supervisor Call instruction (SVC)
is executed

Both

PL1

Monitor (MON)

Entered when the SMC instruction (Secure Monitor Call) is
executed or when the processor takes an exception which is
configured for secure handling.
Provided to support switching between Secure and
Non-secure states.

Secure only

PL1

Abort (ABT)

Entered on a memory access exception

Both

PL1

Undef (UND)

Entered when an undefined instruction is executed

Both

PL1

System (SYS)

Privileged mode, sharing the register view with User mode

Both

PL1

Hyp (HYP)

Entered by the Hypervisor Call and Hyp Trap exceptions.

Non-secure only

PL2

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-5

Fundamentals of ARMv8

Non-secure state

Secure state

Non-secure PL0
USER mode

Secure PL0
USER mode

Non-secure PL1

Secure PL1

System mode (SYS)
Supervisor mode (SVC)
FIQ mode
IRQ mode
Undef (UND) mode
Abort (ABT) mode

System mode (SYS)
Supervisor mode (SVC)
FIQ mode
IRQ mode
Undef (UND) mode
Abort (ABT) mode

Non-secure PL2
Hyp mode

Secure PL1
Monitor mode (MON)

Figure 3-5 ARMv7 privilege levels

In AArch64, the processor modes are mapped onto the Exception levels as in Figure 3-6. As in
ARMv7 (AArch32) when an exception is taken, the processor changes to the Exception level
(mode) that supports the handling of the exception.

Normal world
User

SVC, ABT, IRQ,
FIQ, UND, SYS

Hyp

Mon

Application

Application

Secure world

Application

Guest OS

Application

Guest OS

Hypervisor

Secure firmware

EL0

Trusted OS

EL1

No Hypervisor in
Secure world

EL2

EL3

Secure monitor

Figure 3-6 AArch32 processor modes

Movement between Exception levels follows these rules:

ARM DEN0024A
ID050815

•

Moves to a higher Exception level, such as from EL0 to EL1, indicate increased software
execution privilege.

•

An exception cannot be taken to a lower Exception level.

•

There is no exception handling at level EL0, exceptions must be handled at a higher
Exception level.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-6

Fundamentals of ARMv8

ARM DEN0024A
ID050815

•

An exception causes a change of program flow. Execution of an exception handler starts,
at an Exception level higher than EL0, from a defined vector that relates to the exception
taken. Exceptions include:
— Interrupts such as IRQ and FIQ.
— Memory system aborts.
— Undefined instructions.
— System calls. These permit unprivileged software to make a system call to an
operating system.
— Secure monitor or hypervisor traps.

•

Ending exception handling and returning to the previous Exception level is performed by
executing the ERET instruction.

•

Returning from an exception can stay at the same Exception level or enter a lower
Exception level. It cannot move to a higher Exception level.

•

The security state does change with a change of Exception level, except when retuning
from EL3 to a Non-secure state. See Switching between Secure and Non-secure state on
page 17-8.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-7

Fundamentals of ARMv8

3.3

Changing execution state
There are times when you must change the execution state of your system. This could be, for
example, if you are running a 64-bit operating system, and want to run a 32-bit application at
EL0. To do this, the system must change to AArch32.
When the application has completed or execution returns to the OS, the system can switch back
to AArch64. Figure 3-7 on page 3-9 shows that you cannot do it the other way around. An
AArch32 operating system cannot host a 64-bit application.
To change between execution states at the same Exception level, you have to switch to a higher
Exception level then return to the original Exception level. For example, you might have 32-bit
and 64-bit applications running under a 64-bit OS. In this case, the 32-bit application can
execute and generate a Supervisor Call (SVC) instruction, or receive an interrupt, causing a
switch to EL1 and AArch64. (See Exception handling instructions on page 6-21.) The OS can
then do a task switch and return to EL0 in AArch64. Practically speaking, this means that you
cannot have a mixed 32-bit and 64-bit application, because there is no direct way of calling
between them.
You can only change execution state by changing Exception level. Taking an exception might
change from AArch32 to AArch64, and returning from an exception may change from AArch64
to AArch32.
Code at EL3 cannot take an exception to a higher exception level, so cannot change execution
state, except by going through a reset.
The following is a summary of some of the points when changing between AArch64 and
AArch32 execution states:
•

Both AArch64 and AArch32 execution states have Exception levels that are generally
similar, but there are some differences between Secure and Non-secure operation. The
execution state the processor is in when the exception is generated can limit the Exception
levels available to the other execution state.

•

Changing to AArch32 requires going from a higher to a lower Exception level. This is the
result of exiting an exception handler by executing the ERET instruction. See Exception
handling instructions on page 6-21.

•

Changing to AArch64 requires going from a lower to a higher Exception level. The
exception can be the result of an instruction execution or an external signal.

•

If, when taking an exception or returning from an exception, the Exception level remains
the same, the execution state cannot change.

•

Where an ARMv8 processor operates in AArch32 execution state at a particular
Exception level, it uses the same exception model as in ARMv7 for exceptions taken to
that Exception level. In the AArch64 execution state, it uses the exception handling model
described in Chapter 10 AArch64 Exception Handling.

Interworking between the two states is therefore performed at the level of the Secure monitor,
hypervisor or operating system. A hypervisor or operating system executing in AArch64 state
can support AArch32 operation at lower privilege levels. This means that an OS running in
AArch64 can host both AArch32 and AArch64 applications. Similarly, an AArch64 hypervisor
can host both AArch32 and AArch64 guest operating systems. However, a 32-bit operating
system cannot host a 64-bit application and a 32-bit hypervisor cannot host a 64-bit guest
operating system.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-8

Fundamentals of ARMv8

EL0

An AArch64
OS can host
a mix of
AArch64
and AArch32
applications

EL1

EL2

AArch32
App

AArch64
App

AArch32
App

AArch64 OS

An AArch64
hypervisor
can host
an AArch64 and
AArch32 OS

AArch64
App

An AArch32
OS cannot host
an AArch64
application

AArch32 OS

Hypervisor

An AArch32
hypervisor
cannot host
an AArch64 OS

Figure 3-7 Moving between AArch32 and AArch64

For the highest implemented Exception level (EL3 on the Cortex-A53 and Cortex-A57
processors), which execution state to use for each Exception level when taking an exception is
fixed. The Exception level can only be changed by resetting the processor. For EL2 and EL1, it
is controlled by the System registers on page 4-7.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

3-9

Chapter 4
ARMv8 Registers

The AArch64 execution state provides 31 × 64-bit general-purpose registers accessible at all
times and in all Exception levels.
Each register is 64 bits wide and they are generally referred to as registers X0-X30.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-1

ARMv8 Registers

Frame pointer
Procedure link register

X0/W0
X1/W1
X2/W2
X3/W3
X4/W4
X5/W5
X6/W6
X7/W7
X8/W8
X9/W9
X10/W10
X11/W11
X12/W12
X13/W13
X14/W14
X15/W15
X16/W16
X17/W17
X18/W18
X19/W19
X20/W20
X21/W21
X22/W22
X23/W23
X24/W24
X25/W25
X26/W26
X27/W27
X28/W28
X29/W29
X30/W30

EL0, EL1,
EL2, EL3
Figure 4-1 AArch64 general-purpose registers

Each AArch64 64-bit general-purpose register (X0-X30) also has a 32-bit (W0-W30) form.

63

32 31

0

Wn
Xn

Figure 4-2 64-bit register with W and X access.

The 32-bit W register forms the lower half of the corresponding 64-bit X register. That is, W0
maps onto the lower word of X0, and W1 maps onto the lower word of X1.
Reads from W registers disregard the higher 32 bits of the corresponding X register and leave
them unchanged. Writes to W registers set the higher 32 bits of the X register to zero. That is,
writing 0xFFFFFFFF into W0 sets X0 to 0x00000000FFFFFFFF.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-2

ARMv8 Registers

4.1

AArch64 special registers
In addition to the 31 core registers, there are also several special registers.

XZR/WZR
PC

Zero register
Program counter
Stack pointer

Special
registers

SP_EL0

SP_EL1
SPSR_EL1
ELR_EL1

SP_EL2
SPSR_EL2
ELR_EL2

SP_EL3
SPSR_EL3
ELR_EL3

EL0

EL1

EL2

EL3

Program Status Register
Exception Link Register

Figure 4-3 AArch64 special registers

Note
There is no register called X31 or W31. Many instructions are encoded such that the number 31
represents the zero register, ZR (WZR/XZR). There is also a restricted group of instructions
where one or more of the arguments are encoded such that number 31 represents the Stack
Pointer (SP).
When accessing the zero register, all writes are ignored and all reads return 0. Note that the
64-bit form of the SP register does not use an X prefix.
Table 4-1 Special registers in AArch64
Name

Size

Description

WZR

32 bits

Zero register

XZR

64 bits

Zero register

WSP

32 bits

Current stack pointer

SP

64 bits

Current stack pointer

PC

64 bits

Program counter

In the ARMv8 architecture, when executing in AArch64, the exception return state is held in the
following dedicated registers for each Exception level:
•

Exception Link Register (ELR).

•

Saved Processor State Register (SPSR).

There is a dedicated SP per Exception level, but it is not used to hold return state.
Table 4-2 Special registers by Exception level
EL0

EL1

EL2

EL3

SP_EL0

SP_EL1

SP_EL2

SP_EL3

Exception Link Register (ELR)

ELR_EL1

ELR_EL2

ELR_EL3

Saved Process Status Register (SPSR)

SPSR_EL1

SPSR_EL2

SPSR_EL3

Stack Pointer (SP)

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-3

ARMv8 Registers

4.1.1

Zero register
The zero register reads as zero when used as a source register and discards the result when used
as a destination register. You can use the zero register in most, but not all, instructions.

4.1.2

Stack pointer
In the ARMv8 architecture, the choice of stack pointer to use is separated to some extent from
the Exception level. By default, taking an exception selects the stack pointer for the target
Exception level, SP_ELn. For example, taking an exception to EL1 selects SP_EL1. Each
Exception level has its own stack pointer, SP_EL0, SP_EL1, SP_EL2, and SP_EL3.
When in AArch64 at an Exception level other than EL0, the processor can use either:
•

A dedicated 64-bit stack pointer associated with that Exception level (SP_ELn).

•

The stack pointer associated with EL0 (SP_EL0).

EL0 can only ever access SP_EL0.
Table 4-3 AArch64 Stack pointer options
Exception
level

Options

EL0

EL0t

EL1

EL1t, EL1h

EL2

EL2t, EL2h

EL3

EL3t, EL3h

The t suffix indicates that the SP_EL0 stack pointer is selected. The h suffix indicates that the
SP_ELn stack pointer is selected.
The SP cannot be referenced by most instructions. However, some forms of arithmetic
instructions, for example, the ADD instruction, can read and write to the current stack pointer to
adjust the stack pointer in a function. For example:
ADD SP, SP, #0x10

4.1.3

// Adjust SP to be 0x10 bytes before its current value

Program Counter
One feature of the original ARMv7 instruction set was the use of R15, the Program Counter
(PC) as a general-purpose register. The PC enabled some clever programming tricks, but it
introduced complications for compilers and the design of complex pipelines. Removing direct
access to the PC in ARMv8 makes return prediction easier and simplifies the ABI specification.
The PC is never accessible as a named register. Its use is implicit in certain instructions such as
PC-relative load and address generation. The PC cannot be specified as the destination of a data
processing instruction or load instruction.

4.1.4

Exception Link Register (ELR)
The Exception Link Register holds the exception return address.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-4

ARMv8 Registers

4.1.5

Saved Process Status Register
When taking an exception, the processor state is stored in the relevant Saved Program Status
Register (SPSR), in a similar way to the CPSR in ARMv7. The SPSR holds the value of PSTATE
before taking an exception and is used to restore the value of PSTATE when executing an
exception return.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V

SS IL

D A I F

M

M [3:0]

Figure 4-4 SPSR

The individual bits represent the following values for AArch64:
N

Negative result (N flag).

Z

Zero result (Z) flag.

C

Carry out (C flag).

V

Overflow (V flag).

SS

Software Step. Indicates whether software step was enabled when an exception
was taken.

IL

Illegal Execution State bit. Shows the value of PSTATE.IL immediately before
the exception was taken.

D

Process state Debug mask. Indicates whether debug exceptions from watchpoint,
breakpoint, and software step debug events that are targeted at the Exception level
the exception occurred in were masked or not.

A

SError (System Error) mask bit.

I

IRQ mask bit.

F

FIQ mask bit.

M[4]

Execution state that the exception was taken from. A value of 0 indicates
AArch64.

M[3:0]

Mode or Exception level that an exception was taken from.

In ARMv8, the SPSR written to depends on the Exception level. If the exception is taken in EL1,
then SPSR_EL1 is used. If the exception is taken in EL2, then SPSR_EL2 is used, and if the
exception is taken in EL3, SPSR_EL3 is used. The core populates the SPSR when taking an
exception.
Note
The register pairs ELR_ELn and SPSR_ELn that are associated with an Exception level retain
their state during execution at a lower Exception level.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-5

ARMv8 Registers

4.2

Processor state
AArch64 does not have a direct equivalent of the ARMv7 Current Program Status Register
(CPSR). In AArch64, the components of the traditional CPSR are supplied as fields that can be
made accessible independently. These are referred to collectively as Processor State (PSTATE).
The Processor State, or PSTATE fields, for AArch64 have the following definitions:
Table 4-4 PSTATE field definitions
Name

Description

N

Negative condition flag.

Z

Zero condition flag.

C

Carry condition flag.

V

oVerflow condition flag.

D

Debug mask bit.

A

SError mask bit.

I

IRQ mask bit.

F

FIQ mask bit.

SS

Software Step bit.

IL

Illegal execution state bit.

EL (2)

Exception level.

nRW

Execution state
0 = 64-bit
1 = 32-bit

SP

Stack Pointer selector.
0 = SP_EL0
1 = SP_ELn

In AArch64, you return from an exception by executing the ERET instruction, and this causes the
SPSR_ELn to be copied into PSTATE. This restores the ALU flags, execution state, Exception
level, and the processor branches. From here, you continue execution from the address in
ELR_ELn.
The PSTATE.{N, Z, C, V} fields can be accessed at EL0. All other PSTATE fields can be executed
at EL1 or higher and are UNDEFINED at EL0.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-6

ARMv8 Registers

4.3

System registers
In AArch64, system configuration is controlled through system registers, and accessed using
MSR and MRS instructions. This contrasts with ARMv7-A, where such registers were typically
accessed through coprocessor 15 (CP15) operations. The name of a register tells you the lowest
Exception level that it can be accessed from.
For example:
•

TTBR0_EL1 is accessible from EL1, EL2, and EL3.

•

TTBR0_EL2 is accessible from EL2 and EL3.

Registers that have the suffix _ELn have a separate, banked copy in some or all of the levels,
though usually not EL0. Few system registers are accessible from EL0, although the Cache Type
Register (CTR_EL0) is an example of one that can be accessible.
Code to access system registers takes the following form:
MRS
MSR

x0, TTBR0_EL1
TTBR0_EL1, x0

// Move TTBR0_EL1 into x0
// Move x0 into TTBR0_EL1

Previous versions of the ARM architecture have used coprocessors for system configuration.
However, AArch64 does not include support for coprocessors. Table 4-5 lists only the system
registers mentioned in this book.
For a complete list, see Appendix J of the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile.
The table shows the Exception levels that have separate copies of each register. For example,
separate Auxiliary Control Registers (ACTLRs) exist as ACTLR_EL1, ACTLR_EL2 and
ACTLR_EL3.
Table 4-5 System registers
Name

Register

Description

Allowed
values of n

ACTLR_ELn

Auxiliary Control
Register

Controls processor-specific features.

1, 2, 3

CCSIDR_ELn

Current Cache
Size ID Register

Provides information about the architecture of the currently
selected cache. See Cache discovery on page 11-18.

1

CLIDR_ELn

Cache Level ID
Register

The type of cache, or caches, implemented at each level.
The Level of Coherency and Level of Unification for the cache
hierarchy.
See Cache maintenance on page 11-13.

1, 2, 3

CNTFRQ_ELn

Counter-timer
Frequency
Register

Reports the frequency of the system timer. See Timers on
page 14-5.

0

CNTPCT_ELn

Counter-timer
Physical Count
Register

Holds the 64-bit current count value. See Timers on page 14-5.

0

CNTKCTL_ELn

Counter-timer
Kernel Control
Register

Controls the generation of an event stream from the virtual
counter. Also controls access from EL0 to the physical counter,
virtual counter, EL1 physical timers, and the virtual timer. See
Timers on page 14-5.

1

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-7

ARMv8 Registers

Table 4-5 System registers (continued)
Allowed
values of n

Name

Register

Description

CNTP_CVAL_ELn

Counter-timer
Physical Timer
Compare Value
Register

Holds the compare value for the EL1 physical timer. See Timers
on page 14-5.

0

CPACR_ELn

Coprocessor
Access Control
Register

Controls access to Trace, floating-point, and NEON
functionality. See ISB in more detail on page 13-9.

1

CSSELR_ELn

Cache Size
Selection Register

Selects the current Cache Size ID Register, CCSIDR_EL1, by
specifying the required cache level and the cache type, either
instruction or data cache. See Cache discovery on page 11-18.

1

CNTP_CTL_ELn

Counter-timer
Physical Control
Register

Control register for the EL1 physical timer. See Timers on
page 14-5.

0

CTR_ELn

Cache Type
Register

Information about the architecture of the integrated caches. See
Cache discovery on page 11-18.

0

DCZID_ELn

Data Cache Zero
ID Register

Indicates the block size written with byte values of 0 by the Data
Cache Zero by Virtual Address (DCZVA) system instruction.
See Cache discovery on page 11-18.

0

ELR_ELn

Exception Link
Register

Holds the address of the instruction which caused the exception.

1, 2, 3

ESR_ELn

Exception
Syndrome
Register

Includes information about the reasons for the exception. See
The Exception Syndrome Register on page 10-9.

1, 2, 3

FAR_ELn

Fault Address
Register

Holds the virtual faulting address. See Handling synchronous
exceptions on page 10-7.

1, 2, 3

FPCR

Floating-point
Control Register

Controls floating-point extension behavior. The fields in this
register map to the equivalent fields in the AArch32 FPSCR.
See New features for NEON and Floating-point in AArch64 on
page 7-2.

-

FPSR

Floating-point
Status Register

Provides floating-point system status information. The fields in
this register map to the equivalent fields in the AArch32
FPSCR. See New features for NEON and Floating-point in
AArch64 on page 7-2.

-

HCR_ELn

Hypervisor
Configuration
Register

Controls virtualization settings and trapping of exceptions to
EL2. See Exception handling on page 18-8.

2

MAIR_ELn

Memory Attribute
Indirection
Register

Provides the memory attribute encodings corresponding to the
possible values in a Long-descriptor format translation table
entry for stage 1 translations at ELn. See Memory types on
page 13-3.

1, 2, 3

MIDR_ELn

Main ID Register

The type of processor the code is running on (part number and
revision).

1

MPIDR_ELn

Multiprocessor
Affinity Register

The processor and cluster IDs, in multi-core or cluster systems.
See Determining which core the code is running on on
page 14-3.

1

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-8

ARMv8 Registers

Table 4-5 System registers (continued)

4.3.1

Allowed
values of n

Name

Register

Description

SCR_ELn

Secure
Configuration
Register

Controls Secure state and trapping of exceptions to EL3. See
Handling synchronous exceptions on page 10-7.

3

SCTLR_ELn

System Control
Register

Controls architectural features, for example the MMU, caches
and alignment checking.

0, 1, 2, 3

SPSR_ELn

Saved Program
Status Register

Holds the saved processor state when an exception is taken to
this mode or Exception level.

abt, fiq, irq,
und, 1,2, 3

TCR_ELn

Translation
Control Register

Determines which of the Translation Table Base Registers
define the base address for a translation table walk required for
the stage 1 translation of a memory access from ELn. Also
controls the translation table format and holds cacheability and
shareability information. See Separation of kernel and
application Virtual Address spaces on page 12-7.

1, 2, 3

TPIDR_ELn

User Read/Write
Thread ID
Register

Provides a location where software executing at ELn can store
thread identifying information, for OS management purposes.
See Context switching on page 12-27.

0, 1, 2, 3

TPIDRRO_ELn

User Read-Only
Thread ID
Register

Provides a location where software executing at EL1 or higher
can store thread identifying information. This informaton is
visible to software executing at EL0, for OS management
purposes. See Context switching on page 12-27.

0

TTBR0_ELn

Translation Table
Base Register 0

Holds the base address of translation table 0, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at ELn. See
Separation of kernel and application Virtual Address spaces on
page 12-7.

1, 2, 3

TTBR1_ELn

Translation Table
Base Register 1

Holds the base address of translation table 1, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at EL0 and
EL1. See Separation of kernel and application Virtual Address
spaces on page 12-7.

1

VBAR_ELn

Vector Based
Address Register

Holds the exception base address for any exception that is taken
to ELn. See AArch64 exception table on page 10-12.

1, 2, 3

VTCR_ELn

Virtualization
Translation
Control Register

Controls the translation table walks required for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
Also holds cacheability and shareability information for the
accesses. See Translations at EL2 and EL3 on page 12-20.

2

VTTBR_ELn

Virtualization
Translation Table
Base Register

Holds the base address of the translation table for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
See Memory translation on page 18-3.

2

The system control register
The System Control Register (SCTLR) is a register that controls standard memory, system
facilities and provides status information for functions that are implemented in the core.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-9

ARMv8 Registers

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EE

SA C A M

I

nTWE
UCI EOE

WXN

UCT

SED CP15BEN

nTWI DZE

SCTLR_EL1

SA0

UMA ITD

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I

EE

SA C A M

SCTLR_EL2
SCTLR_EL3

WXN

Figure 4-5 SCTLR bit assignments

Not all bits are available above EL1. The individual bits represent the following:
UCI

When set, enables EL0 access in AArch64 for DC CVAU, DC CIVAC, DC CVAC, and
IC IVAU instructions. See Cache maintenance on page 11-13.

EE

Exception endianness. See Endianness on page 4-12.

EOE

WXN

ARM DEN0024A
ID050815

0

Little endian.

1

Big endian.

Endianness of explicit data accesses at EL0. The possible values of this bit are:
0

Explicit data accesses at EL0 are little-endian.

1

Explicit data accesses at EL0 are big-endian.

Write permission implies XN (eXecute Never). See Access permissions on
page 12-23.
0

Regions with write permission are not forced to XN.

1

Regions with write permission are forced to XN.

nTWE

Not trap WFE. A value of 1 means that WFE instructions are executed as normal.

nTWI

Not trap WFI. A value of 1 means that WFI instructions are executed as normal.

UCT

When set, enables EL0 access in AArch64 to the CTR_EL0 register.

DZE

Access to DC ZVA instruction at EL0. See Cache maintenance on page 11-13.
0

Execution prohibited.

1

Execution allowed.

I

Instruction cache enable. This is an enable bit for instruction caches at EL0 and
EL1. Instruction accesses to cacheable Normal memory are cached.

UMA

User Mask Access. Controls access to interrupt masks from EL0, when EL0 is
using AArch64.

SED

SETEND Disable. Disables SETEND instructions at EL0 using AArch32.
0

SETEND instructions are enabled.

1

The SETEND instruction is disabled.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-10

ARMv8 Registers

ITD

IT Disable. The possible values of this bit are:
0

The IT instruction is available.

1

The IT instruction is treated as a 16-bit instruction. Only another 16-bit
instruction, or the first half of a 32-bit instruction, can follow. This
depends upon the implementation.

CP15BEN

CP15 barrier enable. If implemented, it is an enable bit for the AArch32 CP15
DMB, DSB, and ISB barrier operations.

SA0

Stack Alignment Check Enable for EL0.

SA

Stack Alignment Check Enable.

C

Data cache enable. This is an enable bit for data caches at EL0 and EL1. Data
accesses to cacheable Normal memory are cached.

A

Alignment check enable bit.

M

Enable the MMU.

Accessing the SCTLR
To access the SCTLR_ELn, use:
MRS , SCTLR_ELn
MSR SCTLR_ELn, 

// Read SCTLR_ELn into Xt
// Write Xt to SCTLR_ELn

For example:
Example 4-1 Setting bits in the SCTLR

MRS
ORR
ORR
MSR

X0, SCTLR_EL1
X0, X0, #(1 << 2)
X0, X0, #(1 << 12)
SCTLR_EL1, X0

//
//
//
//

Read System Control Register configuration data
Set [C] bit and enable data caching
Set [I] bit and enable instruction caching
Write System Control Register configuration data

Note
The caches in the processor must be invalidated before caching of data and instructions is
enabled in any of the Exception levels.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-11

ARMv8 Registers

4.4

Endianness
There are two basic ways of viewing bytes in memory, either as Little-Endian (LE) or
Big-Endian (BE). On big-endian machines, the most significant byte of an object in memory is
stored at the lowest address, that is the address closest to zero. On little-endian machines, the
least significant byte is stored at the lowest address. The term byte-ordering can also be used
rather than endianness.

3

2

1

0

78

56

34

12

12

34

56

78

0

1

2

3

Byte

Little endian

0x12345678

Big endian
Byte

Figure 4-6

This data endianness is controlled independently for each Execution level. For EL3, EL2 and
EL1, the relevant register of SCTLR_ELn.EE sets the endianness. The additional bit at EL1,
SCTLR_EL1.E0E controls the data endian setting for EL0. In the AArch64 execution state, data
accesses can be LE or BE, while instruction fetches are always LE.
Whether a processor supports both LE and BE depends upon the implementation of the
processor. If only little-endianness is supported, then the EE and E0E bits are always 0.
Similarly, if only big-endianness is supported, then the EE and E0E bits are at a static 1 value.
When using AArch32, having the CPSR.E bit have a different value to the equivalent System
Control register EE bit when in EL1, EL2, or EL3 is now deprecated. The use of the ARMv7
SETEND instruction is also deprecated. It is possible to cause the Undef exception to be taken upon
executing a SETEND instruction, by setting the SCTLR.SED bit.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-12

ARMv8 Registers

4.5

Changing execution state (again)
In Changing execution state on page 3-8, we described the change between AArch64 and
AArch32 in terms of Exception levels. Now we consider the change from the point of view of
the registers.
On entry to an Exception level using AArch64 from an Exception level using AArch32:
•

The values of the upper 32 bits of registers that were accessible to any lower Exception
level using AArch32 execution are UNKNOWN.

•

The registers that are not accessible during AArch32 execution retain the state that they
had before AArch32 execution.

•

On exception entry to EL3, when EL2 has been using AArch32, the values of the upper
32 bits of the ELR_EL2 are UNKNOWN.

•

AArch64 Stack Pointers (SPs) and Exception Link Registers (ELRs) associated with an
Exception level that is not accessible during AArch32 execution, at that Exception level,
retain the state that they had before AArch32 execution. This applies to the following
registers:
— SP_EL0.
— SP_EL1.
— SP_EL2.
— ELR_EL1.

In general, application programmers write applications for either AArch32 or AArch64. It is
only the OS that must take account of the two execution states and the switch between them.
4.5.1

Registers at AArch32
Being virtually identical to ARMv7 means AArch32 must match ARMv7 privilege levels. It
also means that AArch32 only deals with ARMv7 32-bit general-purpose registers. Therefore,
there must be some correspondence between the ARMv8 architecture, and the view of it
provided by the AArch32 execution state.
Remember that in the ARMv7 architecture there are sixteen 32-bit general-purpose registers
(R0-R15) for software use. Fifteen of them (R0-R14) can be used for general-purpose data
storage. The remaining register, R15, is the program counter (PC) whose value is altered as the
core executes instructions. Software can also access the CPSR, and the saved copy of the CPSR
from the previously executed mode, is the SPSR. On taking an exception, the CPSR is copied
to the SPSR of the mode to which the exception is taken.
Which of these registers is accessed, and where, depends upon the processor mode the software
is executing in and the register itself. This is called banking, and the shaded registers in
Figure 4-7 on page 4-14 are banked. They use physically distinct storage and are usually
accessible only when a process is executing in that particular mode.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-13

ARMv8 Registers

R0

R0

R0

R0

R0

R0

R0

R0

R0

R1

R1

R1

R1

R1

R1

R1

R1

R1

R2

R2

R2

R2

R2

R2

R2

R2

R2
R3

R3

R3

R3

R3

R4

R4

R4

R4

R4

R4

R5

R5

R5

R5

R5

R5
R6

R3

R3

R3

R4

R4

R4

R5

R5

R5

R3

R6

R6

R6

R6

R6

R6

R6

R6

R7

R7

R7

R7

R7

R7

R7

R7

R7

R8

R8

R8_fiq

R8

R8

R8

R8

R8

R8

R9

R9_fiq

R9

R9

R9

R9

R9

R9

R10

R10

R10_fiq

R10

R10

R10

R10

R10

R10

R11

R11

R11_fiq

R11

R11

R11

R11

R11

R11

R12

R12

R12_fiq

R12

R12

R12

R12

R12

R12

R9

R13 (sp)

R13 (sp)

SP_fiq

SP_irq

SP_abt

SP_svc

SP_und

SP_mon

SP_hyp

R14 (lr)

R14 (lr)

LR_fiq

LR_irq

LR_abt

LR_svc

LR_und

LR_mon

LR_hyp

R15 (pc)

R15 (pc) R15 (pc)

(A/C)PSR

CPSR

User

Sys

R15 (pc)

R15 (pc) R15 (pc)

R15 (pc) R15 (pc)

R15 (pc)

CPSR
CPSR
CPSR
SPSR_hyp
SPSR_mon
SPSR_und
SPSR_fiq SPSR_irq SPSR_abt SPSR_svc
ELR_hyp
CPSR

CPSR

CPSR

CPSR

FIQ

IRQ

ABT

SVC

UND

MON

HYP

Banked

Figure 4-7 The ARMv7 register set showing banked registers

Banking is used in ARMv7 to reduce the latency for exceptions. However, this also means that
of a considerable number of possible registers, fewer than half can be used at any one time.
In contrast, the AArch64 execution state has 31 × 64-bit general-purpose registers accessible at
all times and in all Exception levels. A change in execution state between AArch64 and
AArch32 means that the AArch64 registers must necessarily map onto the AArch32 (ARMv7)
register set. This mapping is shown in Figure 4-8 on page 4-15.
The upper 32 bits of the AArch64 registers are inaccessible when executing in AArch32. If the
processor is operating in AArch32 state, it uses the 32-bit W registers, which are equivalent to
the 32-bit ARMv7 registers.
AArch32 maps the banked registers to AArch64 registers that would otherwise be inaccessible.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-14

ARMv8 Registers

W0

R0

R0

R0

R0

R0

R0

R0

R0

W1

R1

R1

R1

R1

R1

R1

R1

R1
R2

W2

R2

R2

R2

R2

R2

R2

R2

W3

R3

R3

R3

R3

R3

R3

R3

R3

W4

R4

R4

R4

R4

R4

R4

R4

R4

W5

R5

R5

R5

R5

R5

R5

R5

R5

W6

R6

R6

R6

R6

R6

R6

R6

R6

W7

R7

R7

R7

R7

R7

R7

R7

R7

R8

W24

R8

R8

R8

R8

R8

R8

W8
W9

R9

W25

R9

R9

R9

R9

R9

R9

W10

R10

W26

R10

R10

R10

R10

R10

R10

W11

R11

W27

R11

R11

R11

R11

R11

R11

R12

R12

R12

W12

R12

W28

R12

R12

R12

W29

W17

W21

W19

W13

R13 (sp)

W14

R14 (lr)

R15

R15 (pc) R15 (pc) R15 (pc)

(A/C)PSR

W30

CPSR

CPSR

W16

W20

W18

R15 (pc) R15 (pc)

CPSR

CPSR

CPSR

W23

R13

W15

W22

R14

R14

R15 (pc) R15 (pc)
CPSR

CPSR

R15 (pc)
CPSR

SPSR_fiq SPSR_irq SPSR_abt SPSR_EL1 SPSR_und SPSR_EL3 SPSR_EL2
ELR_EL2

User

Sys

FIQ

IRQ

ABT

SVC

UND

MON

HYP

Inaccessible from AArch64

Figure 4-8 AArch64 to AArch32 register mapping

The SPSR and ELR_Hyp registers in AArch32 are additional registers that are accessible using
system instructions only. They are not mapped into the general-purpose register space of the
AArch64 architecture. Some of these registers are mapped between AArch32 and AArch64:
•

SPSR_svc maps to SPSR_EL1.

•

SPSR_hyp maps to SPSR_EL2.

•

ELR_hyp maps to ELR_EL2.

The following registers are only used during AArch32 execution. However, because of the
execution at EL1 using AArch64, they retain their state despite them being inaccessible during
AArch64 execution at that Exception level.
•

SPSR_abt.

•

SPSR_und.

•

SPSR_irq.

•

SPSR_fiq.

The SPSR registers are only accessible during AArch64 execution at higher Exception levels
for context switching.
Again, if an exception is taken to an Exception level in AArch64 from an Exception level in
AArch32, the top 32 bits of the AArch64 ELR_ELn are all zero.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-15

ARMv8 Registers

4.5.2

PSTATE at AArch32
In AArch64, the different components of the traditional CPSR are presented as Processor State
(PSTATE) fields that can be made accessible independently. At AArch32, there are extra fields
corresponding to the ARMv7 CPSR bits.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V Q

IT

J

IL

GE

IT [7:2]

E A I F T M

M [3:0]

Figure 4-9 CPSR bit assignments in AArch32

Giving additional PSTATE bits which are accessible only at AArch32:
Table 4-6 PSTATE bit definitions

ARM DEN0024A
ID050815

Name

Description

Q

Cumulative saturation (sticky) flag.

GE (4)

Greater than or Equal flags.

IT (8)

If-Then execution bits.

J

J bit.

T

T32 bit.

E

Endianness bit.

M

Mode field.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-16

ARMv8 Registers

4.6

NEON and floating-point registers
In addition to the general-purpose registers, ARMv8 also has 32 128-bit floating-point registers
labeled V0-V31. The 32 registers are used to hold floating-point operands for scalar
floating-point instructions and both scalar and vector operands for NEON operations. NEON
and floating-point registers are also covered in Chapter 7 AArch64 Floating-point and NEON.

4.6.1

Floating-point register organization in AArch64
In NEON and floating-point instructions that operate on scalar data, the floating-point and
NEON registers behave similarly to the main general-purpose integer registers. Therefore, only
the lower bits are accessed, with the unused high bits ignored on a read and set to zero on a write.
The qualified names for scalar floating-point and NEON names indicate the number of
significant bits as follows, where n is a register number 0-31.
Table 4-7 Operand name for differently sized floats
Precision

Size (bits)

Name

Half

16

Hn

Single

32

Sn

Double

64

Dn

D31

Unused

S31

Unused
Unused

H31

Register V31
127

64 63

32 31

16 15

0

...
D0

Unused

S0

Unused
Unused

H0

Register V0
127

64 63

32 31

16 15

0

Figure 4-10 Arrangement of floating-point values

Note
16-bit floating-point is supported, but only as a format to be converted from or to. It is not
supported for data processing operations.
The F prefix and the float size is specified by the floating-point ADD instruction:
FADD Sd, Sn, Sm
FADD Dd, Dn, Dm

ARM DEN0024A
ID050815

// Single-precision
// Double-precision

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-17

ARMv8 Registers

The half-precision floating-point instructions are for converting between different sizes:
FCVT
FCVT
FCVT
FCVT

4.6.2

Sd,
Dd,
Hd,
Hd,

Hn
Hn
Sn
Dn

//
//
//
//

half-precision to single-precision
half-precision to double-precision
single-precision to half-precision
double-precision to half-precision

Scalar register sizes
In AArch64, the mapping for the integer scalars has changed from what is used in ARMv7-A to
the mapping shown in Figure 4-11:
Q31
D31

Unused

S31

Unused
Unused

H31

B31

Unused

Register V31
127

64 63

32 31

16 15 8 7

0

...
Q0
D0

Unused

S0

Unused
Unused

H0

B0

Unused

Register V0
127

64 63

32 31

16 15 8 7

0

Figure 4-11 Arrangement of ARMv8 registers when holding scalar values

In Figure 4-11 S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half
of D1, which is the bottom half of Q1, and so on. This eliminates many of the problems
compilers have in auto-vectorizing high-level code.

ARM DEN0024A
ID050815

•

The bottom 64-bits of each of the Q registers can also be viewed as D0-D31, 32 64-bit
wide registers for floating-point and NEON use.

•

The bottom 32-bits of each of the Q registers can also be viewed as S0-S31, 32 32-bit wide
registers for floating-point and NEON use.

•

The bottom 16-bits of each of the S registers can also be viewed as H0-H31, 32 16-bit
wide registers for floating-point and NEON use.

•

The bottom 8-bits of each of the H registers can also be viewed as B0-B31, 32 8-bit wide
registers for NEON use.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-18

ARMv8 Registers

Note
Only the bottom bits of each register set are used in each case. The rest of the register space is
ignored when read, and filled with zeros when written.
A consequence of this mapping is that if a program executing in AArch64 is interpreting D or
S registers from AArch32 execution. Then the program must unpack the D or S registers from
the V registers before using them.
For the scalar ADD instruction:
ADD Vd, Vn, Vm

If the size was, for example, 32 bits, the instruction would be:
ADD Sd, Sn, Sm

Table 4-8 Operand name for differently sized scalars

4.6.3

Word size

Size (bits)

Name

Byte

8

Bn

Halfword

16

Hn

Word

32

Sn

Doubleword

64

Dn

Quadword

128

Qn

Vector register sizes
Vectors can be 64-bits wide with one or more elements or 128-bits wide with two or more
elements as shown in Figure 4-12:
D

V0.2D

D

S

S

S

S

V0.4S

128-bit vector
H

B

H

B

B

H

B

B

H

B

B

H

B

B

H

B

B

B

...

127

H

64 63

B

B

32 31

V0.8H

H

B

B

16 15 8 7

V0.16B
0

D

Unused

V31.1D

S

Unused

S

V31.2S

64-bit vector
Unused
Unused
127

H

B

H

B

64 63

B

H

B

B

32 31

V31.4H

H

B

B

16 15 8 7

B

V31.8B
0

Figure 4-12 Vector sizes

For the vector ADD instruction:
ADD Vd.T, Vn.T, Vm.T

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-19

ARMv8 Registers

For 32-bit vectors this time, with 4 lanes, the instruction becomes:
ADD Vd.4S, Vn.4S, Vm.4S

Table 4-9 Operand names for different size vectors
Name

Shape

Vn.8B

8 lanes, each containing an 8-bit element

Vn.16B

16 lanes, each containing an 8-bit element

Vn.4H

4 lanes, each containing a 16-bit element

Vn.8H

8 lanes, each containing a 16-bit element

Vn.2S

2 lanes, each containing a 32-bit element

Vn.4S

4 lanes, each containing a 32-bit element

Vn.1D

1 lane containing a 64-bit element

Vn.2D

2 lanes, each containing a 64-bit element

When these registers are used in a specific instruction form, the names must be further qualified
to indicate the data shape. More specifically, this means the data element size and the number
of elements or lanes held within them.
4.6.4

NEON in AArch32 execution state.
In AArch32, the smaller registers are packed into larger ones (D0 and D1 are combined to form
Q1, for instance). This introduces some tricky loop-carried dependencies which can reduce the
ability of the compiler to vectorize loop structures.

S7

S6

S5

D3

S4
D2

Q1
127

63

S3

S2

31

S1

D1

15

7

0

7

0

S0
D0

Q0
127

63

31

15

Figure 4-13 Arrangement of ARMv7 SIMD registers

The floating-point and Advanced SIMD registers in AArch32 are mapped into the AArch64 FP
and SIMD registers. This is done to allow the floating-point and NEON registers of an
application or a virtual machine to be interpreted (and, as necessary, modified) by a higher level
of system software, for example, the OS or the Hypervisor.
The AArch64 V16-V31 FP and NEON registers are not accessible from AArch32. As with the
general-purpose registers, during execution in an Exception level using AArch32 these registers
retain their state from the previous execution using AArch64.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

4-20

Chapter 5
An Introduction to the ARMv8 Instruction Sets

One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set. This set complements the existing 32-bit instruction set architecture. This
addition provides access to 64-bit wide integer registers and data operations, and the ability to
use 64-bit sized pointers to memory. The new instructions are known as A64 and execute in the
AArch64 execution state. ARMv8 also includes the original ARM instruction set, now called
A32, and the Thumb (T32) instruction set. Both A32 and T32 execute in AArch32 state, and
provide backward compatibility with ARMv7.
Although ARMv8-A provides backward compatibility with the 32-bit ARM Architectures, the
A64 instruction set is separate and distinct from the older ISA and is encoded differently. A64
adds some additional capabilities while also removing other features that would potentially limit
the speed or energy efficiency of high performance implementations. The ARMv8 architecture
includes some enhancements to the 32-bit instruction sets (A32 and T32) as well. However,
code that makes use of such features is not compatible with older ARMv7 implementations.
Instruction opcodes in the A64 instruction set, however, are still 32 bits long, not 64 bits.
Programmers seeking a more detailed description of A64 assembly language can also refer to
the ARM® Compiler armasm Reference Guide v6.01.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-1

An Introduction to the ARMv8 Instruction Sets

5.1

The ARMv8 instruction sets
The new A64 instruction set is similar to the existing A32 instruction set. Instructions are 32 bits
wide and have similar syntax.
The instruction sets use a generic naming convention within the ARMv8 architecture, so that
the original 32-bit instruction set states are now called:
A32

When in AArch32 state, the instruction set is largely compatible with ARMv7,
though there are differences. See, ARMv8-A Architecture Reference Manual. It
also provides some new instructions to align with some of the features that are
introduced in the A64 instruction set.

T32

The Thumb instruction set was first included in the ARM7TDMI processor and
originally contained only 16-bit instructions. 16-bit instructions gave much
smaller programs at the cost of some performance. ARMv7 processors, including
those in the Cortex-A series, support Thumb-2 technology, which extends the
Thumb instruction set to provide a mix of 16-bit and 32-bit instructions. This
gives performance similar to that of ARM, while retaining the reduced code size.
Because of its size and performance advantages, it is increasingly common for all
32-bit code to be compiled or assembled to take advantage of Thumb-2
technology.

A new instruction set has been introduced that the core can use when in AArch64 state. In
keeping with the naming convention, and reflecting the 64-bit operation, this instruction set is
called:
A64

A64 provides similar functionality to the A32 and T32 instruction sets in
AArch32 or ARMv7. The design of the new A64 instruction set allowed several
improvements:
A consistent encoding scheme
The late addition of some instructions in A32 resulted in some
inconsistency in the encoding scheme. For example, LDR and STR
support for halfwords is encoded slightly differently to the mainstream
byte and word transfer instructions. The result of this is that the
addressing modes are slightly different.
Wide range of constants
A64 instructions provide a huge range of options for constants, each
tailored to the requirements of specific instruction types.
•

Arithmetic instructions generally accept a 12-bit immediate
constant.

•

Logical instructions generally accept a 32-bit or 64-bit constant,
which has some constraints in its encoding.

•

MOV instructions accept a 16-bit immediate, which can be shifted

to any 16-bit boundary.
•

Address generation instructions are geared to addresses aligned
to a 4KB page size.
There are slightly more complex rules for constants that are used in bit
manipulation instructions. However, bitfield manipulation instructions
can address any contiguous sequence of bits, in either the source or
destination operand.
A64 provides flexible constants, but encoding them, even determining
whether a particular constant can be legally encoded in a particular
context, can be non-trivial.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-2

An Introduction to the ARMv8 Instruction Sets

Data types are easier
A64 deals naturally with 64-bit signed and unsigned data types in that
it offers more concise and efficient ways of manipulating 64-bit
integers. This can be advantageous for all languages which provide
64-bit integers such as C or Java.
Long offsets
A64 instructions generally provide longer offsets, both for PC-relative
branches and for offset addressing.
The increased branch range makes it easier to manage inter-section
jumps. Dynamically generated code is generally placed on the heap so
it can, in practice, be located anywhere. The runtime system finds it
much easier to manage this with increased branch ranges, and fewer
fix-ups are required.
The need for literal pools (blocks of literal data embedded in the code
stream) has long been a feature of ARM instruction sets. This still
exists in A64. However, the larger PC-relative load offset helps
considerably with the management of literal pools, making it possible
to use one per compilation unit. This removes the need to manufacture
locations for multiple pools in long code sequences.
Pointers Pointers are 64-bit in AArch64, which allows larger amounts of virtual
memory to be addressed and gives more freedom for address mapping.
However, using 64-bit pointers does incur some costs. The same piece
of code typically uses more memory when running with 64-pointers
than with 32-bit pointers. Each pointer is stored in memory and
requires eight bytes instead of four. This might sound trivial, but can
add up to a significant penalty. Additionally, the increased use of
memory space that is associated with a move to 64 bits can cause a
drop in the number of accesses that hit in cache. This drop of cache hits
can reduce performance.
Some languages can be implemented with compressed pointers, such
as Java, to circumvent the performance issue.
Conditional constructs are used instead of IT blocks
IT blocks are a useful feature of T32, enabling efficient sequences that
avoid the need for short forward branches around unexecuted
instructions. However, they are sometimes difficult for hardware to
handle efficiently. A64 removes these blocks and replaces them with
conditional instructions such as CSEL, or Conditional Select and CINC,
or Conditional Increment. These conditional constructs are more
straightforward and easier to handle without special cases.
Shift and rotate behavior is more intuitive
The A32 or T32 shift and rotate behavior does not always map easily
to the behavior expected by high-level languages.
ARMv7 provides a barrel shifter that can be used as part of data
processing instructions. However, specifying the type of shift and the
amount to shift requires a certain number of opcode bits, which could
be used elsewhere.
A64 instructions therefore remove options that were rarely used, and
instead adds new explicit instructions to carry out more complicated
shift operations.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-3

An Introduction to the ARMv8 Instruction Sets

Code generation
When generating code, both statically and dynamically, for common
arithmetic functions, A32 and T32 often require different instructions,
or instruction sequences. This is to cope with different data types.
These operations in A64 are much more consistent so it is much easier
to generate common sequences for simple operations on differently
sized data types.
For example, in T32 the same instruction can have different encodings
depending on what registers are used (either a low register or a high
register).
The A64 instruction set encodings are much more regular and
rationalized. Consequently, an assembler for A64 typically requires
fewer lines of code than an assembler for T32.
Fixed-length instructions
All A64 instructions are the same length, unlike T32, which is a
variable-length instruction set. This makes management and tracking
of generated code sequences easier, particularly affecting dynamic
code generators.
Three operands map better
A32, in general, preserves a true three-operand structure for
data-processing operations. T32, on the other hand, contains a
significant number of two-operand instruction formats, which make it
slightly less flexible when generating code. A64 sticks to a consistent
three-operand syntax, which further contributes to the regularity and
homogeneity of the instruction set for the benefit of compilers.
5.1.1

Distinguishing between 32-bit and 64-bit A64 instructions
Most integer instructions in the A64 instruction set have two forms, which operate on either
32-bit or 64-bit values within the 64-bit general-purpose register file.
When looking at the register name that the instruction uses:
•

If the register name starts with X, it is a 64-bit value.

•

If the register name starts with W, it is a 32-bit value.

Where a 32-bit instruction form is selected, the following facts hold true:
•

Right shifts and rotates inject at bit 31, instead of bit 63.

•

The condition flags, where set by the instruction, are computed from the lower 32 bits.

•

Writes to the W register set bits [63:32] of the X register to zero.

This distinction applies even when the results of a 32-bit instruction form would be
indistinguishable from the lower 32 bits computed by the equivalent 64-bit instruction form. For
example, a 32-bit bitwise ORR could be performed using a 64-bit ORR and simply ignoring the top
32 bits of the result. The A64 instruction set includes separate 32 and 64-bit forms of the ORR
instruction.
The C and C++ LP64 and LLP64 data models are expected to be the most commonly used on
AArch64. They both define the frequently used int, short, and char types to be 32 bits or less.
By maintaining this semantic information in the instruction set, implementations can exploit this
information. For example, to avoid expending energy or cycles to compute, forward, and store
the unused upper 32 bits of such data types. Implementations are free to exploit this freedom in
whatever way they choose to save energy.
ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-4

An Introduction to the ARMv8 Instruction Sets

So the new A64 instruction set provides distinct sign and zero-extend instructions. Additionally.
the A64 instruction set means it is possible to extend and shift the final source register of an ADD,
SUB, CMN, or CMP instruction and the index register of a Load or Store instruction. This results in
efficient implementation of array index calculations involving a 64-bit array pointer and 32-bit
array index.
5.1.2

Addressing
When the processor can store 64-bit values in a single register, it becomes much simpler to
access large amounts of memory within a program. A single thread executing on a 32-bit core
is limited to accessing 4GB of address space. Large parts of that addressable space are reserved
for use by the OS kernel, library code, peripherals, and more. As a result, lack of space means
that the program might need to map some data in or out of memory while executing. Having a
larger address space, with 64-bit pointers, avoids this problem. It also makes techniques such as
memory-mapped files more attractive and convenient to use. The file contents are mapped into
the memory map of a thread, even though the physical RAM might not be large enough to
contain the whole file.
Other improvements to addressing include the following:
Exclusive accesses
Exclusive load-store of a byte, halfword, word and doubleword. Exclusive access
to a pair of doublewords permits atomic updates of a pair of pointers, for example
circular list inserts. All exclusive accesses must be naturally aligned, and
exclusive pair access must be aligned to twice the data size, that is, 128 bits for a
pair of 64-bit values.
Increased PC-relative offset addressing
PC-relative literal loads have an offset range of ±1MB. Compared to the
PC-relative loads of A32, this reduces the number of literal pools, and increases
sharing of literal data between functions. In turn, this reduces I-cache and TLB
pollution.
Most conditional branches have a range of ±1MB, expected to be sufficient for
the majority of conditional branches that take place within a single function.
Unconditional branches, including branch and link, have a range of ±128MB,
expected to be sufficient to span the static code segment of most executable load
modules and shared objects, without needing linker-inserted veneers.
Note
Veneers are small pieces of code that are automatically inserted by the linker, for
example, when it detects that a branch target is out of range. The veneer becomes
an intermediate target of the original branch with the veneer itself then being a
branch to the target address.
The linker can reuse a veneer generated for a previous call, for other calls to the
same function if it is in range from both calls. Occasionally, such veneers can be
a performance factor.
If you have a loop that calls multiple functions through veneers, you will get
many pipeline flushes and therefore sub-optimal performance. Placing related
code together in memory can avoid this.
PC-relative load and store and address generation with a range of ±4GB can be
performed inline using only two instructions, that is, without the need to load an
offset from a literal pool.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-5

An Introduction to the ARMv8 Instruction Sets

Unaligned address support
Except for exclusive and ordered accesses, all loads and stores support the use of
unaligned addresses when accessing normal memory. This simplifies porting
code to A64.
Bulk transfers
The LDM, STM, PUSH, and POP instructions do not exist in A64. Bulk transfers can be
constructed using the LDP and STP instructions. These instructions load and store
a pair of independent registers from consecutive memory locations.
The LDNP and STNP instructions provide a streaming or non-temporal hint, that the
data does not need to be retained in caches.
The PRFM, or prefetch memory instructions enable targeting of a prefetch to a
specific cache level.
Load/Store
All Load/Store instructions now support consistent addressing modes. This
makes it much easier, for example, to treat char, short, int and long long in the
same way when loading and storing quantities from memory.
The floating-point and NEON registers now support the same addressing modes
as the core registers, making it easier to use the two register banks
interchangeably.
Alignment checking
When executing in AArch64, additional alignment checking is performed on
instruction fetches and on loads or stores using the stack pointer, enabling
misalignment checking of the PC or the current SP.
This approach is preferable to forcing the correct alignment of the PC or SP,
because a misalignment of the PC or SP commonly indicates a software error,
such as corruption of an address in software.
There are a number of types of alignment checking:
•

Program Counter alignment checking generates an exception associated
with instruction fetch whenever an attempt is made to execute an
instruction fetched with a misaligned PC in AArch64.
A misaligned PC is defined to be one where bits [1:0] of the PC are not 00.
A PC misalignment is identified in the exception syndrome register
associated with the target Exception level.
When the exception is handled using AArch64, the associated exception
link register holds the entire PC in its misaligned form, as does the Fault
Address Register, FAR_ELn, for the Exception level in which the exception
is taken.
PC alignment checking is performed in AArch64, and in AArch32 as part
of Data Abort exception handling.

•

Stack Pointer (SP) alignment checking generates an exception associated
with data memory access whenever a load or store using the stack pointer
as a base address in AArch64 is attempted.
A misaligned stack pointer is one where bits [3:0] of the stack pointer, used
as the base address of the calculation, are not 0000. The stack pointer must
be 16-byte aligned whenever it is usedas a base address.
Stack pointer alignment checking is only performed in AArch64, and can
be enabled independently for each Exception level:
—

ARM DEN0024A
ID050815

EL0 and EL1 are controlled by two separate bits in SCTLR_EL1.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-6

An Introduction to the ARMv8 Instruction Sets

5.1.3

—

EL2 is controlled by a bit in SCTLR_EL2.

—

EL3 is controlled by a bit in SCTLR_EL3.

Registers
The A64 64-bit register bank helps reduce register pressure in most applications.
The A64 Procedure Call Standard (PCS) passes up to eight parameters in registers (X0-X7). In
contrast, A32 and T32 pass only four arguments in registers, with any excess being passed on
the stack.
The PCS also defines a dedicated Frame Pointer (FP), which makes debugging and call-graph
profiling easier by making it possible to reliably unwind the stack. Refer to Chapter 9 The ABI
for ARM 64-bit Architecture for further information.
A consequence of adopting 64-bit wide integer registers is the varying widths of variables used
by programming languages. A number of standard models are currently in use, which differ
mainly in the size defined for integers, longs, and pointers:
Table 5-1 Variable width
Type

ILP32

LP64

LLP64

char

8

8

8

short

16

16

16

int

32

32

32

long

32

64

32

long long

64

64

64

size_t

32

64

64

pointer

32

64

64

64-bit Linux implementations use LP64 and this is supported by the A64 Procedure Call
Standard. Other PCS variants are defined that can be used by other operating systems.
Zero register
The zero register (WZR/XZR) is used for a few encoding tricks. For example,
there is no plain multiply encoding, just multiply-add. The instruction MUL W0, W1,
W2 is identical to MADD W0, W1, W2, WZR which uses the zero register. Not all
instructions can use the XZR/WZR. As we mentioned in Chapter 4, the zero
register shares the same encoding as the stack pointer. This means that, for some
arguments, for a very limited number of instructions, WZR/XZR is not available,
but WSP/SP is used instead.
Example 5-1 Using the Zero register to write a zero to memory

In A32:
mov
str

r0, #0
r0, [...]

In A64 using the zero register:
str

wzr, [...]

No need for a spare register. Or write 16 bytes of zeros using:

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-7

An Introduction to the ARMv8 Instruction Sets

stp xzr, xzr, [...] etc

A convenient side-effect of the zero register is that there are many NOP instructions
with large immediate fields. For example, ADR XZR, # alone gives you 21 bits
of data in an instruction with no other side effects. This is very useful for JIT
compilers, where code can be patched at runtime.
Stack pointer
The Stack Pointer (SP) cannot be referenced by most instructions. Some forms of
arithmetic instructions can read or write the current stack pointer. This might be
done to adjust the stack pointer in a function prologue or epilogue. For example:
ADD SP, SP, #256

// SP = SP + 256

Program counter
The current Program Counter (PC) cannot be referred to by number as if part of
the general register file and therefore cannot be used as the source or destination
of arithmetic instructions, or as the base, index or transfer register of load and
store instructions.
The only instructions that read the PC are those whose function it is to compute a
PC-relative address (ADR, ADRP, literal load, and direct branches), and the
branch-and-link instructions that store a return address in the link register (BL and
BLR). The only way to modify the program counter is using branch, exception
generation and exception return instructions.
Where the PC is read by an instruction to compute a PC-relative address, then its
value is the address of that instruction. Unlike A32 and T32, there is no implied
offset of 4 or 8 bytes.
FP and NEON registers
The most significant update to the NEON registers is that NEON now has 32
16-byte registers, instead of the 16 registers it had before. The simpler mapping
scheme between the different register sizes in the floating-point and NEON
register bank make these registers much easier to use. The mapping is easier for
compilers and optimizers to model and analyze.
Register indexed addressing
The A64 instruction set provides additional addressing modes with respect to
A32, allowing a 64-bit index register to be added to the 64-bit base register, with
optional scaling of the index by the access size. Additionally, it provides sign or
zero-extension of a 32-bit value within an index register, again with optional
scaling.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-8

An Introduction to the ARMv8 Instruction Sets

5.2

C/C++ inline assembly
In this section, we briefly cover how to include assembly code within C or C++ language
modules.
The asm keyword can incorporate inline GCC syntax assembly code into a function. For
example:
#include 
int add(int i, int j)
{
int res = 0;
asm (
"ADD %w[result], %w[input_i], %w[input_j]"

//Use `%w[name]` to operate on W
// registers (as in this case).
// You can use `%x[name]` for X
// registers too, but this is the
// default.

: [result] "=r" (res)
: [input_i] "r" (i), [input_j] "r" (j)
);
return res;
}
int main(void)
{
int a = 1;
int b = 2;
int c = 0;
c = add(a,b)
printf(“Result of %d + %d = %d\n, a, b, c);
}

The general form of an asm inline assembly statement is:
asm(code [: output_operand_list [: input_operand_list [: clobber_list]]]);

where:
code is the assembly code. In our example, this is "ADD %[result], %[input_i], %[input_j]".
output_operand_list is an optional list of output operands, separated by commas. Each operand

consists of a symbolic name in square brackets, a constraint string, and a C expression in
parentheses. In this example, there is a single output operand: [result] "=r" (res).
input_operand_list is an optional list of input operands, separated by commas. Input operands
use the same syntax as output operands. In this example, there are two input operands: [input_i]
"r" (i) and [input_j] "r" (j).
clobber_list is an optional list of clobbered registers, or other values. In our example, this is

omitted.
When calling functions between C/C++ and assembly code, you must follow the AAPCS64
rules.
For further information, see:
https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-L
anguage-with-C

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-9

An Introduction to the ARMv8 Instruction Sets

5.3

Switching between the instruction sets
It is not possible to use code from the two execution states within a single application. There is
no interworking between A64 and A32 or T32 instruction sets in ARMv8 as there is between
A32 and T32 instruction sets. Code written in A64 for the ARMv8 processors cannot run on
ARMv7 Cortex-A series processors. However, code written for ARMv7-A processors can run
on ARMv8 processors in the AArch32 execution state. This is summarized in Figure 5-1.

T32
Mixed 16 and 32-bit instructions
32-bit general purpose registers
BX
BLX
MOV PC
LDR PC

Exception
entry or
return

Exception
entry

Exception
return

A64
32-bit instructions
32 and 64-bit general purpose registers

A32
32-bit instructions
32-bit general purpose registers

Figure 5-1 Switching between instruction sets

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

5-10

Chapter 6
The A64 instruction set

Many programmers writing at the application level do not need to write code in assembly
language. However, assembly code can be useful in cases where highly optimized code is
required. This is the case when when writing compilers, or where use of low level features not
directly available in C is needed. It might be required for portions of boot code, device drivers,
or when developing operating systems. Finally, it can be useful to be able to read assembly code
when debugging C, and particularly, to understand the mapping between assembly instructions
and C statements.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-1

The A64 instruction set

6.1

Instruction mnemonics
The A64 assembly language overloads instruction mnemonics, and distinguishes between the
different forms of an instruction based on the operand register names. For example, the ADD
instructions below all have different encodings, but you only have to remember one mnemonic,
and the assembler automatically chooses the correct encoding based on the operands.
ADD W0, W1, W2
ADD X0, X1, X2
ADD X0, X1, W2, SXTW
ADD X0, X1, #42
ADD V0.8H, V1.8H, V2.8H

ARM DEN0024A
ID050815

//
//
//
//
//
//

add 32-bit registers
add 64-bit registers
add sign extended 32-bit register to 64-bit extended
register
add immediate to 64-bit register
NEON 16-bit add, in each of 8 lanes

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-2

The A64 instruction set

6.2

Data processing instructions
These are the fundamental arithmetic and logical operations of the processor and operate on
values in the general-purpose registers, or a register and an immediate value. Multiply and
divide instructions on page 6-4 can be considered special cases of these instructions.
Data processing instructions mostly use one destination register and two source operands. The
general format can be considered to be the instruction, followed by the operands, as follows:
Instruction Rd, Rn, Operand2

The second operand might be a register, a modified register, or an immediate value. The use of
R indicates that it can be either an X or a W register.
The data processing operations include:

6.2.1

•

Arithmetic and logical operations.

•

Move and shift operations.

•

Instructions for sign and zero extension.

•

Bit and bitfield manipulation.

•

Conditional comparison and data processing.

Arithmetic and logical operations
Table 6-1 shows some of the available arithmetic and logical operations.
Table 6-1 Arithmetic and logical operations
Type

Instructions

Arithmetic

ADD, SUB, ADC, SBC, NEG

Logical

AND, BIC, ORR, ORN, EOR, EON

Comparison

CMP, CMN, TST

Move

MOV, MVN

Some instructions also have an S suffix, indicating that the instruction sets flags. Of the
instructions in Table 6-1, this includes ADDS, SUBS, ADCS, SBCS, ANDS, and BICS. There are other flag
setting instructions, notably CMP, CMN and TST, but these do not take an S suffix.
The operations ADC and SBC perform additions and subtractions that also use the carry condition
flag as an input.
ADC{S}: Rd = Rn + Rm + C
SBC{S}: Rd = Rn - Rm - 1 + C

Example 6-1 Arithmetic instructions

ADD W0, W1, W2, LSL #3
SUBS X0, X4, X3, ASR #2
MOV X0, X1
CMP W3, W4
ADD W0, W5, #27

ARM DEN0024A
ID050815

//
//
//
//
//

W0 = W1 + (W2 << 3)
X0 = X4 - (X3 >> 2), set flags
Copy X1 to X0
Set flags based on W3 - W4
W0 = W5 + 27

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-3

The A64 instruction set

The logical operations are essentially the same as the corresponding boolean operators operating
on individual bits of the register.
The BIC (Bitwise bit Clear) instruction performs an AND of the register that is the first after the
destination register, with the inverted value of the second operand. For example, to clear bit [11]
of register X0, use:
MOV X1, #0x800
BIC X0, X0, X1
ORN and EON perform an OR or EOR respectively with a bitwise-NOT of the second operand.

The comparison instructions only modify the flags and have no other effect. The range of
immediate values for these instructions is 12 bits, and this value can be optionally shifted 12 bits
to the left.
6.2.2

Multiply and divide instructions
The multiply instructions provided are broadly similar to those in ARMv7-A, but with the
ability to perform 64-bit multiplies in a single instruction.
Table 6-2 Multiplication operations in assembly language
Opcode

Description

Multiply instructions
MADD

Multiply add

MNEG

Multiply negate

MSUB

Multiply subtract

MUL

Multiply

SMADDL

Signed multiply-add long

SMNEGL

Signed multiply-negate long

SMSUBL

Signed multiply-subtract long

SMULH

Signed multiply returning high half

SMULL

Signed multiply long

UMADDL

Unsigned multiply-add long

UMNEGL

Unsigned multiply-negate long

UMSUBL

Unsigned multiply-subtract long

UMULH

Unsigned multiply returning high half

UMULL

Unsigned multiply long

Divide instructions
SDIV

Signed divide

UDIV

Unsigned divide

There are multiply instructions that operate on 32-bit or 64-bit values and return a result of the
same size as the operands. For example, two 64-bit registers can be multiplied to produce a
64-bit result with the MUL instruction.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-4

The A64 instruction set

MUL X0, X1, X2

// X0 = X1 * X2

There is also the ability to add or subtract an accumulator value in a third source register, using
the MADD or MSUB instructions.
The MNEG instruction can be used to negate the result, for example:
MNEG X0, X1, X2

// X0 = -(X1 * X2)

Additionally, there are a range of multiply instructions that produce a long result, that is,
multiplying two 32-bit numbers and generating a 64-bit result. There are both signed and
unsigned variants of these long multiplies (UMULL, SMULL). There are also options to accumulate
a value from another register (UMADDL, SMADDL) or to negate (UMNEGL, SMNEGL).
Including 32-bit and 64-bit multiply with optional accumulation give a result size the same size
as the operands:
•

32 ± (32 × 32) gives a 32-bit result.

•

64 ± (64 × 64) gives a 64-bit result.

•

± (32 × 32) gives a 32-bit result.

•

± (64 × 64) gives a 64-bit result.

Widening multiply, that is signed and unsigned, with accumulation gives a single 64-bit result:
•

64 ± (32 × 32) gives a 64-bit result.

•

± (32 × 32) gives a 64-bit result.

A 64 × 64 to 128-bit multiply requires a sequence of two instructions to generate a pair of 64-bit
result registers:
•

± (64 × 64) gives the lower 64 bits of the result [63:0].

•

(64 × 64) gives the higher 64 bits of the result [127:64].

Note
The list contains no 32 × 64 options. You cannot directly multiply a 32-bit W register by a 64-bit
X register.
The ARMv8-A architecture has support for signed and unsigned division of 32-bit and 64-bit
sized values. For example:
UDIV W0, W1, W2
SDIV X0, X1, X2

// W0 = W1 / W2 (unsigned, 32-bit divide)
// X0 = X1 / X2 (signed, 64-bit divide)

Overflow and divide-by-zero are not trapped:
•

Any integer division by zero returns zero.

•

Overflow can only occur in SDIV:
—

ARM DEN0024A
ID050815

INT_MIN / -1 returns INT_MIN, where INT_MIN is the smallest negative number that
can be encoded in the registers used for the operation. The result is always rounded
towards zero, as in most C/C++ dialects.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-5

The A64 instruction set

6.2.3

Shift operations
The following instructions are specifically for shifting:
•

Logical Shift Left (LSL). The LSL instruction performs multiplication by a power of 2.

•

Logical Shift Right (LSR). The LSR instruction performs division by a power of 2.

•

Arithmetic Shift Right (ASR). The ASR instruction performs division by a power of 2,
preserving the sign bit.

•

Rotate right (ROR). The ROR instruction performs a bitwise rotation, wrapping the bits
rotated from the LSB into the MSB.
Table 6-3 Shift and move operations
Instruction

Description

Shift
ASR

Arithmetic shift right

LSL

Logical shift left

LSR

Logical shift right

ROR

Rotate right

Move
MOV

Move

MVN

Bitwise NOT

LSL Logical shift left
Bits shifted
out are lost

Register

LSR Logical shift right

0

Register

0

Bits shifted
out are lost

Unsigned division by 2n
where n is the shift amount

Multiplication by 2n where n is
the shift amount

ASR Arithmetic shift right

ROR Rotate right

sign-bit
Register

Bits shifted
out are lost

Division by 2n, where n is the
shift amount, preserving the
sign bit

Register

Bit rotate with wrap around
from LSB to MSB

Figure 6-1 Shift operations

The register that is specified for a shift can be 32-bit or 64-bit. The amount to be shifted can be
specified either as an immediate, that is up to register size minus one, or by a register where the
value is taken only from the bottom five (modulo-32) or six (modulo-64) bits.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-6

The A64 instruction set

6.2.4

Bitfield and byte manipulation instructions
There are instructions that extend a byte, halfword, or word to register size, which can be either
X or W. These instructions exist in both signed (SXTB, SXTH, SXTW) and unsigned (UXTB, UXTH)
variants and are aliases to the appropriate bitfield manipulation instruction.
Both the signed and unsigned variants of these instructions extend a byte, halfword, or word
(although only SXTW operates on a word) to register size. The source is always a W register. The
destination register is either an X or a W register, except for SXTW which must be an X register.
For example:
SXTB X0, W1

// Sign extend the least significant byte of register W1
// from 8-bits to 64-bit by repeating the leftmost bit of the
// byte.

Bitfield instructions are similar to those that exist in ARMv7 and include Bit Field Insert (BFI),
and signed and unsigned Bit Field Extract ((S/U)BFX). There are extra bitfield instructions too,
such as BFXIL (Bit Field Extract and Insert Low), UBFIZ (Unsigned Bit Field Insert in Zero), and
SBFIZ (Signed Bit Field Insert in Zero).

31

0

0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0

BFI W0, W0, #9, #6

;Bit field insert

31

0

0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0

UBFX W1, W0, #18, #7

;Bit field extract

31

0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
Zero extend
BFC W1, WZR, #3, #4

0

;Bit field clear

31

0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

Figure 6-2 Bit manipulation instructions

Note
There are also BFM, UBFM, and SBFM instructions. These are Bit Field Move instructions, which are
new for ARMv8. However, the instructions do not need to be used explicitly, as aliases are
provided for all cases. These aliases are the bitfield operations already described: [SU]XT[BHWX],
ASR/LSL/LSR immediate, BFI, BFXIL, SBFIZ, SBFX, UBFIZ, and UBFX.
If you are familiar with the ARMv7 architecture, you might recognize the other bit manipulation
instruction:
•
ARM DEN0024A
ID050815

CLZ Count leading zero bits in a register.

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-7

The A64 instruction set

Similarly, the same byte manipulation instructions:
•

RBIT Reverse all bits.

•

REV Reverse the byte order of a register.

•

REV16 Reverse the byte order of each halfword in a register.

Xn

Xd
Figure 6-3 REV16 instruction

•

REV32 Reverse the byte order of each word in a register.

Xn

Xd
Figure 6-4 REV32 instruction

These operations can be performed on either word (32-bit) or doubleword (64-bit) sized
registers, except for REV32, which applies only to 64-bit registers.

6.2.5

Conditional instructions
The A64 instruction set does not support conditional execution for every instruction. Predicated
execution of instructions does not offer sufficient benefit to justify its significant use of opcode
space.
Processor state on page 4-6, describes the four status flags, Zero (Z), Negative (N), Carry (C)
and Overflow (V). Table 6-4 indicates the value of these bits for flag setting operations.
Table 6-4 Condition flag
Flag

Name

Description

N

Negative

Set to the same value as bit[31] of the result. For a 32-bit signed integer, bit[31] being set indicates
that the value is negative.

Z

Zero

Set to 1 if the result is zero, otherwise it is set to 0.

C

Carry

Set to the carry-out value from result, or to the value of the last bit shifted out from a shift
operation.

V

Overflow

Set to 1 if signed overflow or underflow occurred, otherwise it is set to 0.

The C flag is set if the result of an unsigned operation overflows the result register.
The V flag operates in the same way as the C flag, but for signed operations.
ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-8

The A64 instruction set

Note
The condition flags (NZCV) and the condition codes are the same as in A32 and T32. However,
A64 adds NV (0b1111), though it behaves the same as its complement, AL (0b1110). This differs
from A32, which did not assign any meaning to 0b1111.

Table 6-5 Condition codes
Code

Encoding

Meaning (when set by CMP)

Meaning (when set by FCMP)

Condition flags

EQ

0b0000

Equal to.

Equal to.

Z =1

NE

0b0001

Not equal to.

Unordered, or not equal to.

Z=0

CS

0b0010

Carry set (identical to HS).

Greater than, equal to, or unordered (identical
to HS).

C=1

HS

0b0010

Greater than, equal to (unsigned)
(identical to CS).

Greater than, equal to, or unordered (identical
to CS).

C=1

CC

0b0011

Carry clear (identical to LO).

Less than (identical to LO).

C=0

LO

0b0011

Unsigned less than (identical to
CC).

Less than (identical to CC).

C=0

MI

0b0100

Minus, Negative.

Less than.

N=1

PL

0b0101

Positive or zero.

Greater than, equal to, or unordered.

N=0

VS

0b0110

Signed overflow.

Unordered. (At least one argument was NaN).

V=1

VC

0b0111

No signed overflow.

Not unordered. (No argument was NaN).

V=0

HI

0b1000

Greater than (unsigned).

Greater than or unordered.

(C = 1) && (Z = 0)

LS

0b1001

Less than or equal to (unsigned).

Less than or equal to.

(C = 0) || (Z = 1)

GE

0b1010

Greater than or equal to (signed).

Greater than or equal to.

N==V

LT

0b1011

Less than (signed).

Less than or unordered.

N!=V

GT

0b1100

Greater than (signed).

Greater than.

(Z==0) && (N==V)

LE

0b1101

Less than or equal to (signed).

Less than, equal to or unordered.

(Z==1) || (N!=V)

AL

0b1110

Always executed.

Default. Always executed.

Any

NV

0b1111

Always executed.

Always executed.

Any

There are a small set of conditional data processing instructions. These instructions are
unconditionally executed but use the condition flags as an extra input to the instruction. This set
has been provided to replace common usage of conditional execution in ARM code.
The instructions types which read the condition flags are:
Add/subtract with carry
The traditional ARM instructions, for example, for multi-precision arithmetic and
checksums.
ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-9

The A64 instruction set

Conditional select with optional increment, negate, or invert
Conditionally select between one source register and a second incremented,
negated, inverted, or unmodified source register.
These are the most common uses of single conditional instructions in A32 and
T32. Typical uses include conditional counting or calculating the absolute value
of a signed quantity.
Conditional operations
The A64 instruction set enables conditional execution of only program flow control branch
instructions. This is in contrast to A32 and T32 where most instructions can be predicated with
a condition code. These can be summarized as follows:
Conditional select (move)
•

CSEL Select between two registers based on a condition. Unconditional

instructions, followed by a conditional select, can replace short conditional
sequences.
•

CSINC Select between two registers based on a condition. Return the first
source register or the second source register incremented by one.

•

CSINV Select between two registers based on a condition. Return the first
source register or the inverted second source register.

•

CSNEG Select between two registers based on a condition. Return the first
source register or the negated second source register.

Conditional set
Conditionally select between 0 and 1 (CSET) or 0 and -1 (CSETM). Used, for
example, to set the condition flags as a boolean value or mask in a general
register.
Conditional compare
(CMP and CMN) Sets the condition flags to the result of a comparison if the original
condition is true. If not true, the conditional flags are set to a specified condition
flag state. The conditional compare instruction is very useful for expressing
nested or compound comparisons.
Note
Conditional select and conditional compare are also available for floating-point registers using
the FCSEL and FCCMP instructions.
For example:
CSINC X0, X1, X0, NE

// Set the return register X0 to X1 if Zero flag clear,
// else increment X0

Some aliases to the example instructions are provided, where either the zero register is used, or
the same register is used as both destination and both source registers for the instruction.
For example:
CINC X0, X0, LS
CSET W0, EQ
CSETM X0, NE

ARM DEN0024A
ID050815

//
//
//
//

If less than or same (LS) then X0 = X0 + 1
If the previous comparison was equal (Z=1) then W0 = 1,
else W0 = 0
If not equal then X0 = -1, else X0 = 0

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-10

The A64 instruction set

This class of instructions provides a powerful way to avoid the use of branches or conditionally
executed instructions. Compilers, or assembly programmers, might adopt a technique of
performing the operations for both branches of an if-then-else statement. Then the correct result
is selected at the end.
For example, consider the simple C code:
if (i == 0)

r = r + 2;

else

r = r - 1;

This might produce code similar to:
CMP w0, #0
SUB w2, w1, #1
ADD w1, w1, #2
CSEL w1, w1, w2, EQ

ARM DEN0024A
ID050815

//
//
//
//

if (i == 0)
r = r - 1
r = r + 2
select between the two results

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-11

The A64 instruction set

6.3

Memory access instructions
As with all prior ARM processors, the ARMv8 architecture is a Load/Store architecture. This
means that no data processing instruction operates directly on data in memory. The data must
first be loaded into registers, modified, and then stored to memory. The program must specify
an address, the size of data to be transferred, and a source or destination register. There are
additional Load and Store instructions which provide further options, such as non-temporal
Load/Store, Load/Store exclusives, and Acquire/Release.
Memory instructions can access Normal memory in an unaligned fashion (see Chapter 13
Memory Ordering). This is not supported by exclusive accesses, load acquire or store release
variants. If unaligned accesses are not desired, they can be configured to be faulted.

6.3.1

Load instruction format
The general form of a Load instruction is as follows:
LDR Rt, 

For loads into integer registers, you can choose a size to load. For example, to load a size smaller
than the specified register value, append one of the following suffixes to the LDR instruction:
•
LDRB (8-bit, zero extended).
•
LDRSB (8-bit, sign extended).
•
LDRH (16-bit, zero extended).
•
LDRSH (16-bit, sign extended).
•
LDRSW (32-bit, sign extended).
There are also unscaled-offset forms such as LDUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the LDUR form
explicitly, because most assemblers can select the appropriate version based on the offset used.
You do not need to specify a zero-extended load to an X register, because writing a W register
effectively zero extends to the entire register width.

LDRSB W4, 

8A

Memory.

8A

R4

8A

Memory.

8A

R4

8A

Memory.

8A

R4

Sign extend
00

00

00

00

FF

FF

FF

LDRSB X4, 
Sign extend
FF

FF

FF

FF

FF

FF

FF

LDRB W4, 
Zero extend
00

00

00

00

00

00

00

Figure 6-5 Load instructions

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-12

The A64 instruction set

6.3.2

Store instruction format
Similarly, the general form of a Store instruction is as follows:
STR Rn, 

There are also unscaled-offset forms such as STUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the STUR form
explicitly, as most assemblers can select the appropriate version based on the offset used.
The size to be stored might be smaller than the register. You specify this by adding a B or H
suffix to the STR. It is always the least significant part of the register that is stored in such a case.
6.3.3

Floating-point and NEON scalar loads and stores
Load and Store instructions can also access floating-point/NEON registers. Here, the size is
determined only by the register being loaded or stored, which can be any of the B, H, S, D, or
Q registers. This information is summarized in Table 6-6, and Table 6-7.
For Load instructions:
Table 6-6 Memory bits written by Load instructions
Load

Xt

Wt

Qt

Dt

St

Ht

Bt

LDR

64

32

128

64

32

16

9

LDP

128

64

256

128

64

-

-

LDRB

-

8

-

-

-

-

-

LDRH

-

16

-

-

-

-

-

LDRSB

8

8

-

-

-

-

-

LDRSH

16

16

-

-

-

-

-

LDRSW

32

-

-

-

-

-

-

LDPSW

-

-

-

-

-

-

-

For Store instructions:
Table 6-7 Memory bits read by Store instructions
Store

Xt

Wt

Qt

Dt

St

Ht

Bt

STR

64

32

126

64

32

16

8

STP

128

64

256

128

64

-

-

STRB

-

8

-

-

-

-

-

STRH

-

16

-

-

-

-

-

No sign-extension options are available for loads into FP/SIMD registers. Addresses for such
loads are still specified using the general-purpose registers.
For example:
LDR D0, [X0, X1]

Loads register D0 with the doubleword at the memory address pointed to by X0 plus X1.
ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-13

The A64 instruction set

Note
Floating-point and scalar NEON Loads and Stores use the same addressing modes as integer
Loads and Stores.

6.3.4

Specifying the address for a Load or Store instruction
The addressing modes available to A64 are similar to those in A32 and T32. There are some
additional restrictions as well as some new features, but the addressing modes available to A64
will not be surprising to someone familiar with A32 or T32.
In A64, the base register of an address operand must always be an X register. However, several
instructions support zero-extension or sign-extension so that a 32-bit offset can be provided as
a W register.
Offset modes
Offset addressing modes add an immediate value or an optionally-modified register value to a
64-bit base register to generate an address.
Table 6-8 Offset addressing modes
Example instruction

Description

LDR X0, [X1]

Load from the address in X1

LDR X0, [X1, #8]

Load from address X1 + 8

LDR X0, [X1, X2]

Load from address X1 + X2

LDR X0, [X1, X2, LSL, #3]

Load from address X1 + (X2 << 3)

LDR X0, [X1, W2, SXTW]

Load from address X1 + sign_extend(W2)

LDR X0, [X1, W2, SXTW, #3]

Load from address X1 + (sign_extend(W2) << 3)

Typically, when specifying a shift or extension option, the shift amount can be either 0 (the
default) or log2 of the access size in bytes (so that Rn <<  multiplies Rn by the access
size). This supports common array-indexing operations.
// A C example showing accesses that a compiler is likely to generate.
void example_dup(int32_t a[], int32_t length) {
int32_t first = a[0];
// LDR W3, [X0]
for (int32_t i = 1; i < length; i++) {
a[i] = first;
// STR W3, [X0, W2, SXTW, #2]
}
}

Index modes
Index modes are similar to offset modes, but they also update the base register. The syntax is the
same as in A32 and T32, but the set of operations is more restrictive. Usually, only immediate
offsets can be provided for index modes.

ARM DEN0024A
ID050815

Copyright © 2015 ARM. All rights reserved.
Non-Confidential

6-14

The A64 instruction set

There are two variants: pre-index modes which apply the offset before accessing the memory,
and post-index modes which apply the offset after accessing the memory.
Table 6-9 Index addressing modes
Example instruction

Description

LDR X0, [X1, #8]!

Pre-index: Update X1 first (to X1 + #8), then load from the new address

LDR X0, [X1], #8

Post-index: Load from the unmodified address in X1 first, then update X1 (to X1 + #8)

STP X0, X1, [SP, #-16]!

Push X0 and X1 to the stack.

LDP X0, X1, [SP], #16

Pop X0 and X1 off the stack.

These options map cleanly onto some common C operations:
// A C example showing accesses that a compiler is likely to generate.
void example_strcpy(char * dst, const char * src)
{
char c;
do {
c = *(src++);
// LDRB W2, [X1], #1
*(dst++) = c;
// STRB W2, [X0], #1
} while (c != '\0');
}

PC-relative modes (load-literal)
A64 adds another addressing mode specifically for accessing literal pools. Literal pools are
blocks of data encoded in an instruction stream. The pools are not executed, but their data can
be accessed from surrounding code using PC-relative memory addresses. Literal pools are often
used to encode constant values that do not fit into a simple move-immediate instruction.
In A32 and T32, the PC can be read like a general-purpose register, so a literal pool can be
accessed simply by specifying PC as the base register.
In A64, PC is not generally accessible, but instead there is a special addressing mode (for load
instructions only) that accesses a PC-relative address. This special-purpose addressing mode
also has a much greater range than the PC-relative loads in A32 and T32 could achieve, so literal
pools can be positioned more sparsely.
Table 6-10
Example instruction

Description

LDR W0, 

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : Yes
Create Date                     : 2015:05:08 08:47:18Z
Copyright                       : Copyright ©€2015 ARM. All rights reserved.
Author                          : ARM Limited
Creator                         : FrameMaker 8.0
Keywords                        : Cortex-A, Cortex-A50, Cortex-A53, Cortex-A57, ARMv8
Title                           : ARM Cortex-A Series Programmer’s Guide for ARMv8-A
Modify Date                     : 2017:12:07 07:56:44-05:00
Producer                        : 3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)
Page Count                      : 296
Page Mode                       : UseOutlines
EXIF Metadata provided by EXIF.tools

Navigation menu