ARM Cortex A Series Programmer’s Guide For ARMv8 Programmer's V1.0 Min
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 296
ARM Cortex -A Series
®
®
Version: 1.0
Programmer’s Guide for ARMv8-A
Copyright © 2015 ARM. All rights reserved.
ARM DEN0024A (ID050815)
ARM Cortex-A Series
Programmer’s Guide for ARMv8-A
Copyright © 2015 ARM. All rights reserved.
Release Information
The following changes have been made to this book.
Change history
Date
Issue
Confidentiality
Change
24 March 2015
A
Non-Confidential
First release
Proprietary Notice
This document is protected by copyright and other related rights and the practice or implementation of the information
contained in this document may be protected by one or more patents or pending patent applications. No part of this
document may be reproduced in any form by any means without the express prior written permission of ARM. No
license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document
unless specifically stated.
Your access to the information in this document is conditional upon your acceptance that you will not use or permit
others to use the information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS
FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes
no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of,
third party patents, copyrights, trade secrets, or other rights.
This document may include technical inaccuracies or typographical errors.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or
disclosure of this document complies fully with any relevant export laws and regulations to assure that this document
or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word “partner”
in reference to ARM’s customers is not intended to create or refer to any partnership relationship with any other
company. ARM may make changes to this document at any time and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any signed written agreement
covering this document with ARM, then the signed written agreement prevails over and supersedes the conflicting
provisions of these terms. This document may be translated into other languages for convenience, and you agree that if
there is any conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.
Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in the
EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks
of their respective owners. Please follow ARM’s trademark usage guidelines at
http://www.arm.com/about/trademark-usage-guidelines.php
Copyright © 2015, ARM Limited or its affiliates. All rights reserved.
ARM Limited. Company 02557590 registered in England.
110 Fulbourn Road, Cambridge, England CB1 9NJ.
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license
restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this
document to.
Product Status
The information in this document is final, that is for a developed product.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
ii
Web Address
http://www.arm.com
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
iii
Contents
ARM Cortex-A Series Programmer’s Guide for
ARMv8-A
Preface
Glossary ...................................................................................................................... ix
References ............................................................................................................... xiii
Feedback on this book ............................................................................................... xv
Chapter 1
Introduction
1.1
Chapter 2
ARMv8-A Architecture and Processors
2.1
2.2
Chapter 3
Execution states ...................................................................................................... 3-4
Changing Exception levels ...................................................................................... 3-5
Changing execution state ........................................................................................ 3-8
ARMv8 Registers
4.1
4.2
4.3
4.4
4.5
4.6
ARM DEN0024A
ID050815
ARMv8-A ................................................................................................................. 2-3
ARMv8-A Processor properties ............................................................................... 2-5
Fundamentals of ARMv8
3.1
3.2
3.3
Chapter 4
How to use this book ............................................................................................... 1-3
AArch64 special registers ........................................................................................ 4-3
Processor state ........................................................................................................ 4-6
System registers ...................................................................................................... 4-7
Endianness ............................................................................................................ 4-12
Changing execution state (again) .......................................................................... 4-13
NEON and floating-point registers ......................................................................... 4-17
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
iv
Contents
Chapter 5
An Introduction to the ARMv8 Instruction Sets
5.1
5.2
5.3
Chapter 6
The A64 instruction set
6.1
6.2
6.3
6.4
6.5
Chapter 7
The Translation Lookaside Buffer .......................................................................... 12-4
Separation of kernel and application Virtual Address spaces ................................ 12-7
Translating a Virtual Address to a Physical Address ............................................. 12-9
Translation tables in ARMv8-A ............................................................................ 12-14
Translation table configuration ............................................................................. 12-18
Translations at EL2 and EL3 ............................................................................... 12-20
Access permissions ............................................................................................. 12-23
Operating system use of translation table descriptors ........................................ 12-25
Security and the MMU ......................................................................................... 12-26
Context switching ................................................................................................. 12-27
Kernel access with user permissions ................................................................... 12-29
Memory Ordering
13.1
ARM DEN0024A
ID050815
Cache terminology ................................................................................................. 11-3
Cache controller ..................................................................................................... 11-8
Cache policies ....................................................................................................... 11-9
Point of coherency and unification ....................................................................... 11-11
Cache maintenance ............................................................................................. 11-13
Cache discovery .................................................................................................. 11-18
The Memory Management Unit
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.9
12.10
12.11
Chapter 13
Exception handling registers .................................................................................. 10-4
Synchronous and asynchronous exceptions ......................................................... 10-7
Changes to execution state and Exception level caused by exceptions ............. 10-10
AArch64 exception table ...................................................................................... 10-12
Interrupt handling ................................................................................................. 10-14
The Generic Interrupt Controller .......................................................................... 10-17
Caches
11.1
11.2
11.3
11.4
11.5
11.6
Chapter 12
Register use in the AArch64 Procedure Call Standard ............................................ 9-3
AArch64 Exception Handling
10.1
10.2
10.3
10.4
10.5
10.6
Chapter 11
Alignment ................................................................................................................. 8-3
Data types ................................................................................................................ 8-4
Issues when porting code from a 32-bit to 64-bit environment ................................ 8-8
Recommendations for new C code ........................................................................ 8-10
The ABI for ARM 64-bit Architecture
9.1
Chapter 10
New features for NEON and Floating-point in AArch64 ........................................... 7-2
NEON and Floating-Point architecture .................................................................... 7-4
AArch64 NEON instruction format ........................................................................... 7-9
NEON coding alternatives ..................................................................................... 7-14
Porting to A64
8.1
8.2
8.3
8.4
Chapter 9
Instruction mnemonics ............................................................................................. 6-2
Data processing instructions .................................................................................... 6-3
Memory access instructions .................................................................................. 6-12
Flow control ........................................................................................................... 6-19
System control and other instructions .................................................................... 6-21
AArch64 Floating-point and NEON
7.1
7.2
7.3
7.4
Chapter 8
The ARMv8 instruction sets ..................................................................................... 5-2
C/C++ inline assembly ............................................................................................. 5-9
Switching between the instruction sets .................................................................. 5-10
Memory types ........................................................................................................ 13-3
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
v
Contents
13.2
13.3
Chapter 14
Multi-core processors
14.1
14.2
14.3
14.4
Chapter 15
TrustZone hardware architecture ...........................................................................
Switching security worlds through interrupts .........................................................
Security in multi-core systems ...............................................................................
Switching between Secure and Non-secure state .................................................
17-3
17-5
17-6
17-8
ARM debug hardware ............................................................................................ 18-3
ARM trace hardware .............................................................................................. 18-9
DS-5 debug and trace .......................................................................................... 18-12
ARMv8 Models
19.1
19.2
19.3
ARM DEN0024A
ID050815
Structure of a big.LITTLE system .......................................................................... 16-2
Software execution models in big.LITTLE ............................................................. 16-4
big.LITTLE MP ....................................................................................................... 16-7
Debug
18.1
18.2
18.3
Chapter 19
15-3
15-6
15-7
15-8
Security
17.1
17.2
17.3
17.4
Chapter 18
Idle management ...................................................................................................
Dynamic voltage and frequency scaling ................................................................
Assembly language power instructions .................................................................
Power State Coordination Interface .......................................................................
big.LITTLE Technology
16.1
16.2
16.3
Chapter 17
Multi-processing systems ...................................................................................... 14-3
Cache coherency ................................................................................................. 14-10
Multi-core cache coherency within a cluster ........................................................ 14-13
Bus protocol and the Cache Coherent Interconnect ............................................ 14-17
Power Management
15.1
15.2
15.3
15.4
Chapter 16
Barriers .................................................................................................................. 13-6
Memory attributes ................................................................................................ 13-11
ARM Fast Models .................................................................................................. 19-2
ARMv8-A Foundation Platform .............................................................................. 19-4
The Base Platform FVP ....................................................................................... 19-16
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
vi
Preface
In 2013, ARM released its 64-bit ARMv8 architecture, the first major change to the ARM
architecture since ARMv7 in 2007, and the most fundamental and far reaching change since the
original ARM architecture was created.
Development of the architecture has continued for some years. Early versions were being used
before the Cortex-A Series Programmer’s Guide for ARMv7-A was first released. The first of
the Programmer’s Guide series from ARM, it post-dated the introduction of the 32-bit ARMv7
architecture by some years. Almost immediately there were requests for a version to cover the
ARMv8 architecture. It was intended from the outset that a guide to ARMv8 should be available
as soon as possible.
This book was started when the first versions of the ARMv8 architecture were being tested and
codified. As always, moving from a system that is known and understood to something new and
unknown can present a number of problems. The engineers who supplied information for the
present book are, by and large, the same engineers who supplied the information for the original
Cortex-A Series Programmer’s Guide. This book has been made richer by their observations and
insights as they use, and solve the problems presented by the new architecture.
The Programmer’s Guides are meant to complement, rather than replace, other ARM
documentation available, such as the Technical Reference Manuals (TRMs) for the processors
themselves, documentation for individual devices or boards or, most importantly, the ARM
Architecture Reference Manual (the ARM ARM). They are intended to provide a gentle
introduction to the ARM architecture, and cover all the main concepts that you need to know
about, in an easy to read format, with examples of actual code in both C and assembly language,
and with hints and tips for writing your own code.
It might be argued that if you are an application developer, you do not need to know what goes
on inside a processor. ARM Application processors can easily be regarded as black boxes which
simply run your code when you say go. Instead, this book provides a single guide, bringing
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
vii
Preface
together information from a wide variety of sources, for those programmers who get the system
to the point where application developers can run applications, such as those involved in ASIC
verification, or those working on boot code and device drivers.
During bring-up of a new board or System-on-Chip (SoC), engineers may have to investigate
issues with the hardware. Memory system behavior is among the most common places for these
to manifest, for example, deadlocks where the processor cannot make forward progress because
of memory system lock. Debugging these problems requires an understanding of the operation
and effect of cache or MMU use. This is different from debugging a failing piece of code.
In a similar vein, system architects (usually hardware engineers) make choices early in the
design about the implementation of DMA, frame buffers and other parts of the memory system
where an understanding of data flow between agents in required. In this case it is difficult to
make sensible decisions about it if you do not understand when a cache will help you and when
it gets in the way, or how the OS will use the MMU. Similar considerations apply in many other
places.
This is not an introductory level book, nor is it a purely technical description of the architecture
and processors, which merely state the facts with little or no explanation of ‘how’ and ‘why’.
ARM and all who have collaborated on this book hope it successfully navigates between the two
extremes, while attempting to explain some of the more intricate aspects of the architecture.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
viii
Preface
Glossary
Abbreviations and terms used in this document are defined here.
ARM DEN0024A
ID050815
AAPCS
ARM Architecture Procedure Call Standard.
AArch32 state
The ARM 32-bit execution state that uses 32-bit general-purpose registers,
and a 32-bit Program Counter (PC), Stack Pointer (SP), and Link Register
(LR). AArch32 execution state provides a choice of two instruction sets,
A32 and T32, previously called the ARM and Thumb instruction sets.
AArch64 state
The ARM 64-bit execution state that uses 64-bit general-purpose registers,
and a 64-bit Program Counter (PC), Stack Pointer (SP), and Exception
Link Registers (ELR). AArch64 execution state provides a single
instruction set, A64.
ABI
Application Binary Interface.
ACE
AXI Coherency Extensions.
AES
Advanced Encryption Standard.
AMBA®
Advanced Microcontroller Bus Architecture.
AMP
Asymmetric Multi-Processing.
ARM ARM
The ARM Architecture Reference Manual.
ASIC
Application Specific Integrated Circuit.
ASID
Address Space ID.
AXI
Advanced eXtensible Interface.
BE8
Byte Invariant Big-Endian Mode.
BTAC
Branch Target Address Cache.
BTB
Branch Target Buffer.
CCI
Cache Coherent Interface.
CHI
Coherent Hub Interface.
CP15
Coprocessor 15 for AArch32 and ARMv7-A- System control coprocessor.
DAP
Debug Access Port.
DMA
Direct Memory Access.
DMB
Data Memory Barrier.
DS-5™
The ARM Development Studio.
DSB
Data Synchronization Barrier.
DSP
Digital Signal Processing.
DSTREAM
An ARM debug and trace unit.
DVFS
Dynamic Voltage/Frequency Scaling.
EABI
Embedded ABI.
ECC
Error Correcting Code.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
ix
Preface
ECT
Embedded Cross Trigger.
EL0
Exception level used to execute user applications.
EL1
Exception level normally used to run operating systems.
EL2
Hypervisor Exception level. In the Normal world, or Non-Secure state,
this is used to execute hypervisor code.
EL3
Secure Monitor exception level.This is used to execute the code that
guards transitions between the Secure and Normal worlds.
ETB
Embedded Trace Buffer™.
ETM
Embedded Trace Macrocell™.
Execution state
The operational state of the processor, either 64-bit (AArch64) or 32-bit
(AArch32).
FIQ
An interrupt type (formerly fast interrupt).
FPSCR
Floating-Point Status and Control Register.
GCC
GNU Compiler Collection.
GIC
Generic Interrupt Controller.
Harvard architecture
Architecture with physically separate storage and signal pathways for
instructions and data.
HCR
Hyp Configuration Register.
HMP
Heterogenous Multi-Processing.
IMPLEMENTATION DEFINED
Some properties of the processor are defined by the manufacturer.
ARM DEN0024A
ID050815
IPA
Intermediate Physical Address.
IRQ
Interrupt Request, normally for external interrupts.
ISA
Instruction Set Architecture.
ISB
Instruction Synchronization Barrier.
ISR
Interrupt Service Routine.
Jazelle™
The ARM bytecode acceleration technology.
LLP64
Indicates the size in bits of basic C data types. Under LLP64 int and long
data types are 32 bit, pointers and long long are 64 bits.
LP64
Indicates the size in bits of basic C data types. Under LP64 int types are
32 bits, all others are 64 bits.
LPAE
Large Physical Address Extension.
LSB
Least Significant Bit.
MESI
A cache coherency protocol with four states that are Modified, Exclusive,
Shared and Invalid.
MMU
Memory Management Unit.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
x
Preface
MOESI
A cache coherency protocol with five states that are Modified, Owned,
Exclusive, Shared and Invalid.
Monitor mode
When EL3 is using AArch32, the PE mode in which the Secure Monitor
must execute. This mode guards transitions between the Secure and
Normal worlds.
MPU
Memory Protection Unit.
NEON™
The ARM Advanced SIMD Extensions.
NIC
Network InterConnect.
Normal world
The execution environment when the processor is in the Non-secure state.
PCS
Procedure Call Standard.
PIPT
Physically Indexed, Physically Tagged.
PoC
Point of Coherency.
PoU
Point of Unification.
PSR
Program Status Register.
SCU
Snoop Control Unit.
Secure world
The execution environment when the processor is in the Secure State.
SIMD
Single Instruction, Multiple Data.
SMC
Secure Monitor Call. An ARM assembler instruction that causes an
exception that is taken synchronously to EL3.
SMC32
32-bit SMC calling convention
SMC64
64-bit SMC calling convention
SMC Function Identifier
A 32-bit integer which identifies which function is being invoked by this
SMC call. Passed in R0 or W0 to every SMC call
ARM DEN0024A
ID050815
SMMU
System MMU.
SMP
Symmetric Multi-Processing.
SoC
System on Chip.
SP
Stack Pointer.
SPSR
Saved Program Status Register.
Streamline
A graphical performance analysis tool.
SVC
Supervisor Call instruction.
SYS
System Mode.
Thumb®
An instruction set extension to ARM.
Thumb-2
A technology extending the Thumb instruction set to support both 16-bit
and 32-bit instructions.
TLB
Translation Lookaside Buffer.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
xi
Preface
TrustedOS
This is the operating system running in the Secure World. It supports the
execution of trusted applications in Secure EL0. When EL3 is using
AArch64 it executes in Secure EL1. When EL3 is using AArch32 it
executes in Secure EL3 modes other than Monitor mode.
TrustZone®
The ARM security extension.
TTB
Translation Table Base.
TTBR
Translation Table Base Register.
UART
Universal Asynchronous Receiver/Transmitter.
UEFI
Unified Extensible Firmware Interface.
U-Boot
A Linux Bootloader.
UNK
Unknown.
UNKNOWN
Values in a register cannot be known before they are reset.
UNPREDICTABLE
The value taken cannot be predicted.
ARM DEN0024A
ID050815
USR
User mode, a non-privileged processor mode.
VFP
The ARM floating-point instruction set. Before ARMv7, the VFP
extension was called the Vector Floating-Point architecture, and was used
for vector operations.
VIPT
Virtually Indexed, Physically Tagged.
VMID
Virtual Machine Identifier.
XN
Execute Never.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
xii
Preface
References
ANSI/IEEE Std 754-1985, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 754-2008, “IEEE Standard for Binary Floating-Point Arithmetic”.
ANSI/IEEE Std 1003.1-1990, “Standard for Information Technology - Portable Operating
System Interface (POSIX) Base Specifications, Issue 7”.
ANSI/IEEE Std 1149.1-2001, “IEEE Standard Test Access Port and Boundary-Scan
Architecture”.
The ARMv8 Architecture Reference Manual, known as the ARM ARM, fully describes the
ARMv8 instruction set architecture, programmer’s model, system registers, debug features and
memory model. It forms a detailed specification to which all implementations of ARM
processors must adhere.
References to the ARM Architecture Reference Manual in this document are to:
ARM® Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile (ARM DDI
0487).
Note
In the event of a contradiction between this book and the ARM ARM, the ARM ARM is
definitive and must take precedence. In most instances, however, the ARM ARM and the
Cortex-A Series Programmer’s Guide for ARMv8-A cover two separate world views. The most
likely scenario is that this book describes something in a way that does not cover all
architecturally permitted behaviors, or simply rewords an abstract concept in more practical
terms.
ARM® Cortex®-A Series Programmer’s Guide for ARMv7-A (DEN 0013).
ARM® NEON™ Programmer’s Guide (DEN 0018).
ARM® Cortex®-A53 MPCore Processor Technical Reference Manual (DDI 0500).
ARM® Cortex®-A57 MPCore Processor Technical Reference Manual (DDI 0488).
ARM® Generic Interrupt Controller Architecture Specification (ARM IHI 0048).
ARM® Compiler armasm Reference Guide v6.01 (DUI 0802).
ARM® Compiler Software Development Guide v5.05 (DUI 0471).
ARM® C Language Extensions (IHI 0053).
ELF for the ARM® Architecture (ARM IHI 0044).
The individual processor Technical Reference Manuals provide a detailed description of the
processor behavior. They can be obtained from the ARM website documentation area
http://infocenter.arm.com.
Connected community
The ARM Connected Community makes it easier to design using ARM processors and IP. It is
an interactive platform containing information, discussions and blogs which help you to develop
an ARM-based design efficiently, in collaboration with ARM engineers and our 1200+
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
xiii
Preface
ecosystem Partners and enthusiasts. Visitors also use the community to find new companies to
work with from the many ARM Partners who first introduced their products and services in their
dedicated area. You can join the Connected Community on http://community.arm.com.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
xiv
Preface
Feedback on this book
ARM hopes you find the Cortex-A Series Programmer’s Guide for ARMv8-A easy to read while
in enough depth to provide the comprehensive introduction to using the processors.
If you have any comments on this book, don’t understand our explanations, think something is
missing, or think that it is incorrect, send an e-mail to errata@arm.com. Give:
•
The title.
•
The number, ARM DEN0024A.
•
The page number(s) to which your comments apply.
•
What you think needs to be changed.
ARM also welcomes general suggestions for additions and improvements.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
xv
Chapter 1
Introduction
ARMv8-A is the latest generation of the ARM architecture that is targeted at the Applications
Profile. In this book, the name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution states. ARMv8 introduces the ability to
perform execution with 64-bit wide registers, but provides mechanisms for backwards
compatibility to enable existing ARMv7 software to be executed.
AArch64 is the name used to describe the 64-bit execution state of the ARMv8 architecture.
AArch32 describes the 32-bit execution state of the ARMv8 architecture, which is almost
identical to ARMv7. GNU and Linux documentation (except for Redhat and Fedora
distributions) sometimes refers to AArch64 as ARM64.
Because many of the concepts of the ARMv8-A architecture are shared with the ARMv7-A
architecture, the details of all those concepts are not covered here. As a general introduction to
the ARMv7-A architecture, refer to the ARM® Cortex®-A Series Programmer’s Guide. This
guide can also help you to familiarize yourself with some of the concepts discussed in this
volume. However, the ARMv8-A architecture profile is backwards compatible with earlier
iterations, like most versions of the ARM architecture. Therefore, there is a certain amount of
overlap between the way the ARMv8 architecture and previous architectures function. The
general principles of the ARMv7 architecture are only covered to explain the differences
between the ARMv8 and earlier ARMv7 architectures.
Cortex-A series processors now include both ARMv8-A and ARMv7-A implementations:
ARM DEN0024A
ID050815
•
The Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15, and Cortex-A17
processors all implement the ARMv7-A architecture.
•
The Cortex-A53 and Cortex-A57 processors implement the ARMv8-A architecture.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
1-1
Introduction
ARMv8 processors still support software (with some exceptions) written for the ARMv7-A
processors. This means, for example, that 32-bit code written for the ARMv7 Cortex-A series
processors also runs on ARMv8 processors such as the Cortex-A57. However, the code will
only run when the ARMv8 processor is in the AArch32 execution state. The A64 64-bit
instruction set, however, does not run on ARMv7 processors, and only runs on the ARMv8
processors.
Some knowledge of the C programming language and microprocessors is assumed of the
readers of this book. There are pointers to further reading, referring to books and websites that
can give you a deeper level of background to the subject matter.
The change from 32-bit to 64-bit
There are several performance gains derived from moving to a 64-bit processor.
•
The A64 instruction set provides some significant performance benefits, including a
larger register pool. The additional registers and the ARM Architecture Procedure Call
Standard (AAPCS) provide a performance boost when you must pass more than four
registers in a function call. On ARMv7, this would require using the stack, whereas in
AArch64 up to eight parameters can be passed in registers.
•
Wider integer registers enable code that operates on 64-bit data to work more efficiently.
A 32-bit processor might require several operations to perform an arithmetic operation on
64-bit data. A 64-bit processor might be able to perform the same task in a single
operation, typically at the same speed required by the same processor to perform a 32-bit
operation. Therefore, code that performs many 64-bit sized operations is significantly
faster.
•
64-bit operation enables applications to use a larger virtual address space. While the Large
Physical Address Extension (LPAE) extends the physical address space of a 32-bit
processor to 40-bit, it does not extend the virtual address space. This means that even with
LPAE, a single application is limited to a 32-bit (4GB) address space. This is because
some of this address space is reserved for the operating system.
•
Software running on a 32-bit architecture might need to map some data in or out of
memory while executing. Having a larger address space, with 64-bit pointers, avoids this
problem. However, using 64-bit pointers does incur some cost. The same piece of code
typically uses more memory when running with 64-pointers than with 32-bit pointers.
Each pointer is stored in memory and requires eight bytes instead of four. This might
sound trivial, but can add up to a significant penalty. Furthermore, the increased usage of
memory space associated with a move to 64-bits can cause a drop in the number of
accesses that hit in the cache. This in turn can reduce performance.
The larger virtual address space also enables memory-mapping larger files. This is the
mapping of the file contents into the memory map of a thread. This can occur even though
the physical RAM might not be large enough to contain the whole file.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
1-2
Introduction
1.1
How to use this book
This book provides a single guide for programmers who want to use the Cortex-A series
processors that implement the ARMv8 architecture. The guide brings together information from
a wide variety of sources that is useful to both ARM assembly language and C programmers. It
is meant to complement rather than replace other ARM documentation available for ARMv8
processors. The other documents for specific information includes the ARM Technical
Reference Manuals (TRMs) for the processors themselves, documentation for individual
devices or boards or, most importantly, the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile - the ARM ARM.
This book is not written at an introductory level. It assumes some knowledge of the C
programming language and microprocessors. Hardware concepts such as caches and Memory
Management Units are covered, but only where this knowledge is valuable to the application
writer. The book looks at the way operating systems utilize ARMv8 features, and how to take
full advantage of the capabilities of the ARMv8 processors. Some chapters contain pointers to
additional reading. We also refer to books and web sites that can give a deeper level of
background to the subject matter, but often the main focus is the ARM-specific detail. No
assumptions are made on the use of any particular toolchain, and both GNU and ARM tools are
mentioned throughout the book.
If you are new to the ARMv8 architecture, Chapter 2 ARMv8-A Architecture and Processors
describes the previous 32-bit ARM architectures, introduces ARMv8, and describes some of the
properties of the ARMv8 processors. Next, Chapter 3 Fundamentals of ARMv8 describes the
building blocks of the architecture in the form of Exception levels and Execution states.
Chapter 4 ARMv8 Registers then describes the registers available to you in the ARMv8
architecture.
One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set, which complements the existing 32-bit architecture. Chapter 5 An
Introduction to the ARMv8 Instruction Sets describes the differences between the Instruction Set
Architecture (ISA) of ARMv7 (A32), and that of the A64 instruction set. Chapter 6 The A64
instruction set looks at the Instruction Set and its use in more detail. In addition to a new
instruction set for general operation, ARMv8 also has a changed NEON and floating-point
instruction set. Chapter 7 AArch64 Floating-point and NEON describes the changes in ARMv8
to ARM Advanced SIMD (NEON) and floating-point instructions. For a more detailed guide to
NEON and its capabilities at ARMv7, refer to the ARM® NEON™ Programmer’s Guide.
Chapter 8 Porting to A64 of this book covers the problems you might encounter when porting
code from other architectures, or previous ARM architectures to ARMv8. Chapter 9 The ABI
for ARM 64-bit Architecture describes the Application Binary Interface (ABI) for the ARM
architecture specification. The ABI is a specification for all the programming behavior of an
ARM target, which governs the form your 64-bit code takes. Chapter 10 AArch64 Exception
Handling describes the exception handling behavior of ARMv8 in AArch64 state.
Following this, the focus moves to the internal architecture of the processor. Chapter 11 Caches
describes the design of caches and how the use of caches can improve performance.
An important motivating factor behind ARMv8 and moving to a 64-bit architecture is
potentially enabling access to larger address space than is possible using just 32 bits. Chapter 12
The Memory Management Unit describes how the MMU converts virtual memory addresses to
physical addresses.
Chapter 13 Memory Ordering describes the weakly-ordered model of memory in the ARMv8
architecture. Generally, this means that the order of memory accesses is not required to be the
same as the program order for load and store operations. Only some programmers must be aware
of memory ordering issues. If your code interacts directly with the hardware or with code
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
1-3
Introduction
executing on other cores, directly loads or writes instructions to be executed, or modifies page
tables, then you might have to think about ordering and barriers. This also applies if you are
implementing your own synchronization functions or lock-free algorithms.
Chapter 14 Multi-core processors describes how the ARMv8-A architecture supports systems
with multiple cores. Systems that use the ARMv8 processors are almost always implemented in
such a way. Chapter 15 Power Management describes how ARM cores use their hardware that
can reduce power use. A further aspect of power management, applied to multi-core and
multi-cluster systems is covered in Chapter 16 big.LITTLE Technology. This chapter describes
how big.LITTLE technology from ARM couples together an energy efficient LITTLE core with
a high performance big core, to provide a system with high performance and power efficiency.
Chapter 17 Security describes how the ARMv8 processors can create a Secure, or trusted system
that protects assets such as passwords or credit card details from unauthorized copying or
damage. The main part of the book then concludes with Chapter 18 Debug describing the
standard debug and trace features available in the Cortex-A53 and Cortex-A57 processors.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
1-4
Chapter 2
ARMv8-A Architecture and Processors
The ARM architecture dates back to 1985, but it has not stayed static. On the contrary, it has
developed massively since the early ARM cores, adding features and capabilities at each step:
ARMv4 and earlier
These early processors used only the ARM 32-bit instruction set.
ARMv4T
The ARMv4T architecture added the Thumb 16-bit instruction set to the ARM
32-bit instruction set. This was the first widely licensed architecture. It was
implemented by the ARM7TDMI® and ARM9TDMI® processors.
ARMv5TE The ARMv5TE architecture added improvements for DSP-type operations,
saturated arithmetic, and for ARM and Thumb interworking. The ARM926EJ-S®
implements this architecture.
ARMv6
ARMv6 made several enhancements, including support for unaligned memory
accesses, significant changes to the memory architecture and for multi-processor
support. Additionally, some support for SIMD operations operating on bytes or
halfwords within the 32-bit registers was included. The ARM1136JF-S®
implements this architecture. The ARMv6 architecture also provided some
optional extensions, notably Thumb-2 and Security Extensions (TrustZone®).
Thumb-2 extends Thumb to be a mixed length 16-bit and 32-bit instruction set.
ARMv7-A
The ARMv7-A architecture makes the Thumb-2 extensions mandatory and adds
the Advanced SIMD extensions (NEON). Before ARMv7, all cores conformed to
essentially the same architecture or feature set. To help address an increasing
range of differing applications, ARM introduced a set of architecture profiles:
•
ARM DEN0024A
ID050815
ARMv7-A provides all the features necessary to support a platform
Operating System such as Linux.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-1
ARMv8-A Architecture and Processors
ARM DEN0024A
ID050815
•
ARMv7-R provides predictable real-time high-performance.
•
ARMv7-M is targeted at deeply-embedded microcontrollers.
An M profile was also added to the ARMv6 architecture to enable features
for the older architecture. The ARMv6M profile is used by low-cost
microprocessors with low power consumption.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-2
ARMv8-A Architecture and Processors
2.1
ARMv8-A
The ARMv8-A architecture is the latest generation ARM architecture targeted at the
Applications Profile. The name ARMv8 is used to describe the overall architecture, which now
includes both 32-bit execution and 64-bit execution. It introduces the ability to perform
execution with 64-bit wide registers, while preserving backwards compatibility with existing
ARMv7 software.
v5
VFPv2
Thumb-2
TrustZone
SIMD
v8
v7
v6
VFPv3/v4
NEON
Key Feature ARMv7-A
Compatibility
A32+T32 ISAs
A64 ISAs
Scalar FP (SP
and DP)
Adv SIMD (SP
Float)
Scalar FP (SP
and DP)
Adv SIMD (SP &
DP Float)
AArch32
AArch64
Crypto
Crypto
Figure 2-1 Development of the ARMv8 architecture
The ARMv8-A architecture introduces a number of changes, which enable significantly higher
performance processor implementations to be designed.
Large physical address
This enables the processor to access beyond 4GB of physical memory.
64-bit virtual addressing
This enables virtual memory beyond the 4GB limit. This is important for modern
desktop and server software using memory mapped file I/O or sparse addressing.
Automatic event signaling
This enables power-efficient, high-performance spinlocks.
Larger register files
Thirty-one 64-bit general-purpose registers increase performance and reduce
stack use.
Efficient 64-bit immediate generation
There is less need for literal pools.
Large PC-relative addressing range
A +/-4GB addressing range for efficient data addressing within shared libraries
and position-independent executables.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-3
ARMv8-A Architecture and Processors
Additional 16KB and 64KB translation granules
This reduces Translation Lookaside Buffer (TLB) miss rates and depth of page
walks.
New exception model
This reduces OS and hypervisor software complexity.
Efficient cache management
User space cache operations improve dynamic code generation efficiency. Fast
Data cache clear using a Data Cache Zero instruction.
Hardware-accelerated cryptography
Provides 3× to 10× better software encryption performance. This is useful for
small granule decryption and encryption too small to offload to a hardware
accelerator efficiently, for example https.
Load-Acquire, Store-Release instructions
Designed for C++11, C11, Java memory models. They improve performance of
thread-safe code by eliminating explicit memory barrier instructions.
NEON double-precision floating-point advanced SIMD
This enables SIMD vectorization to be applied to a much wider set of algorithms,
for example, scientific computing, High Performance Computing (HPC) and
supercomputers.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-4
ARMv8-A Architecture and Processors
2.2
ARMv8-A Processor properties
Table 2-1 compares the properties of the processor implementations from ARM that support the
ARMv8-A architecture.
Table 2-1 Comparison of ARMv8-A processors
Processor
Cortex-A53
Cortex-A57
Release date
July 2014
January 2015
Typical clock speed
2GHz on 28nm
1.5 to 2.5 GHz on 20nm
Execution order
In-order
Out of order, speculative
issue, superscalar
Cores
1 to 4
1 to 4
Integer Peak throughput
2.3MIPS/MHz
4.1 to 4.76MIPS/MHza
Floating-point Unit
Yes
Yes
Half-precision
Yes
Yes
Hardware Divide
Yes
Yes
Fused Multiply Accumulate
Yes
Yes
Pipeline stages
8
15+
Return stack entries
4
8
Generic Interrupt Controller
External
External
AMBA interface
64-bit I/F AMBA 4
(Supports AMBA 4
and AMBA 5)
128-bit I/F AMBA 4
(Supports AMBA 4 and
AMBA 5)
L1 Cache size (Instruction)
8KB to 64 KB
48KB
L1 Cache structure (Instruction)
2-way set associative
3-way set associative
L1 Cache size (Data)
8KB to 64KB
32KB
L1 Cache structure (Data)
4-way set associative
2-way set associative
L2 Cache
Optional
Integrated
L2 Cache size
128KB to 2MB
512KB to 2MB
L2 Cache structure
16-way set associative
16-way set associative
Main TLB entries
512
1024
uTLB entries
10
48 I-side
32 D-side
A. IMPLEMENTATION DEFINED
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-5
ARMv8-A Architecture and Processors
2.2.1
ARMv8 processors
This section describes each of the processors that implement the ARMv8-A architecture. It only
gives a general description in each case. For more specific information on each processor, see
Table 2-1 on page 2-5.
The Cortex-A53 processor
The Cortex-A53 processor is a mid-range, low-power processor with between one and four
cores in a single cluster, each with an L1 cache subsystem, an optional integrated GICv3/4
interface, and an optional L2 cache controller.
The Cortex-A53 processor is an extremely power efficient processor capable of supporting
32-bit and 64-bit code. It delivers significantly higher performance than the highly successful
Cortex-A7 processor. It is capable of deployment as a standalone applications processor, or
paired with the Cortex-A57 processor in a big.LITTLE configuration for optimum performance,
scalability, and energy efficiency.
ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller
NEON
Data Engine
with crypto ext
Cortex-A53 processor
Floating-point
unit
Level 1
Instruction
Cache
Level 1 Data
Cache w/ECC
Performance Monitor
Unit
SCU
Memory
Management
Unit
Data Processing
Unit
ACP
3
2
Core
1
0
Integrated Level 2 Cache w/ECC
AMBA 4 ACE or AMBA 5 CHI Coherent Bus Interface
Figure 2-2 Cortex-A53 processor
The Cortex-A53 processor has the following features:
ARM DEN0024A
ID050815
•
In-order, eight stage pipeline.
•
Lower power consumption from the use of hierarchical clock gating, power domains, and
advanced retention modes.
•
Increased dual-issue capability from duplication of execution resources and dual
instruction decoders.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-6
ARMv8-A Architecture and Processors
•
Power-optimized L2 cache design delivers lower latency and balances performance with
efficiency.
The Cortex-A57 processor
The Cortex-A57 processor is targeted at mobile and enterprise computing applications
including compute intensive 64-bit applications such as high end computer, tablet, and server
products. It can be used with the Cortex-A53 processor into an ARM big.LITTLE configuration,
for scalable performance and more efficient energy use.
The Cortex-A57 processor features cache coherent interoperability with other processors,
including the ARM Mali™ family of Graphics Processing Units (GPUs) for GPU compute and
provides optional reliability and scalability features for high-performance enterprise
applications. It provides significantly more performance than the ARMv7 Cortex-A15
processor, at a higher level of power efficiency. The inclusion of cryptography extensions
improves performance on cryptography algorithms by 10 times over the previous generation of
processors.
ARM CoreSight Multicore Debug and Trace
Generic Interrupt Controller
NEON
Data Engine
with crypto ext
Cortex-A57 processor
Floating-point
unit
Level 1
Instruction
Cache
Level 1 Data
Cache w/ECC
3
2
Performance Monitor Unit
SCU
Memory
Protection Unit
ACP
Core
1
0
Integrated Level 2 Cache w/ECC
AMBA 4 ACE or AMBA5 CHI Coherent Bus Interface
Figure 2-3 Cortex-A57 processor core
The Cortex-A57 processor fully implements the ARMv8-A architecture. It enables multi-core
operation with between one and four cores multi-processing within a single cluster. Multiple
coherent SMP clusters are possible, through AMBA5 CHI or AMBA 4 ACE technology. Debug
and trace are available through CoreSight technology.
The Cortex-A57 processor has the following features:
•
ARM DEN0024A
ID050815
Out-of-order, 15+ stage pipeline.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-7
ARMv8-A Architecture and Processors
ARM DEN0024A
ID050815
•
Power-saving features include way-prediction, tag-reduction, and cache-lookup
suppression.
•
Increased peak instruction throughput through duplication of execution resources.
Power-optimized instruction decode with localized decoding, 3-wide decode bandwidth.
•
Performance optimized L2 cache design enables more than one core in the cluster to
access the L2 at the same time.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
2-8
Chapter 3
Fundamentals of ARMv8
In ARMv8, execution occurs at one of four Exception levels. In AArch64, the Exception level
determines the level of privilege, in a similar way to the privilege levels defined in ARMv7. The
Exception level determines the privilege level, so execution at ELn corresponds to privilege
PLn. Similarly, an Exception level with a larger value of n than another one is at a higher
Exception level. An Exception level with a smaller number than another is described as being
at a lower Exception level.
Exception levels provide a logical separation of software execution privilege that applies across
all operating states of the ARMv8 architecture. It is similar to, and supports the concept of,
hierarchical protection domains common in computer science.
The following is a typical example of what software runs at each Exception level:
ARM DEN0024A
ID050815
EL0
Normal user applications.
EL1
Operating system kernel typically described as privileged.
EL2
Hypervisor.
EL3
Low-level firmware, including the Secure Monitor.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-1
Fundamentals of ARMv8
Normal world
EL0
Application
Application
Application
Kernel
EL1
Application
Kernel
EL2
Hypervisor
EL3
Secure monitor
Figure 3-1 Exception levels
In general, a piece of software, such as an application, the kernel of an operating system, or a
hypervisor, occupies a single Exception level. An exception to this rule is in-kernel hypervisors
such as KVM, which operate across both EL2 and EL1.
ARMv8-A provides two security states, Secure and Non-secure. The Non-secure state is also
referred to as the Normal World. This enables an Operating System (OS) to run in parallel with
a trusted OS on the same hardware, and provides protection against certain software attacks and
hardware attacks. ARM TrustZone technology enables the system to be partitioned between the
Normal and Secure worlds. As with the ARMv7-A architecture, the Secure monitor acts as a
gateway for moving between the Normal and Secure worlds.
Normal world
EL0
EL1
EL2
EL3
Application
Application
Secure world
Application
Guest OS
Application
Guest OS
Secure firmware
Trusted OS
No Hypervisor in
Secure world
Hypervisor
Secure monitor
Figure 3-2 ARMv8 Exception levels in the Normal and Secure worlds
ARMv8-A also provides support for virtualization, though only in the Normal world. This
means that hypervisor, or Virtual Machine Manager (VMM) code can run on the system and
host multiple guest operating systems. Each of the guest operating systems is, essentially,
running on a virtual machine. Each OS is then unaware that it is sharing time on the system with
other guest operating systems.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-2
Fundamentals of ARMv8
The Normal world (which corresponds to the Non-secure state) has the following privileged
components:
Guest OS kernels
Such kernels include Linux or Windows running in Non-secure EL1. When
running under a hypervisor, the rich OS kernels can be running as a guest or host
depending on the hypervisor model.
Hypervisor
This runs at EL2, which is always Non-secure. The hypervisor, when present and
enabled, provides virtualization services to rich OS kernels.
The Secure world has the following privileged components:
Secure firmware
On an application processor, this firmware must be the first thing that runs at boot
time. It provides several services, including platform initialization, the
installation of the trusted OS, and routing of Secure monitor calls.
Trusted OS
Trusted OS provides Secure services to the Normal world and provides a runtime
environment for executing Secure or trusted applications.
The Secure monitor in the ARMv8 architecture is at a higher Exception level and is more
privileged than all other levels. This provides a logical model of software privilege.
Figure 3-2 on page 3-2 shows that a Secure version of EL2 is not available.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-3
Fundamentals of ARMv8
3.1
Execution states
The ARMv8 architecture defines two Execution States, AArch64 and AArch32. Each state is
used to describe execution using 64-bit wide general-purpose registers or 32-bit wide
general-purpose registers, respectively. While ARMv8 AArch32 retains the ARMv7 definitions
of privilege, in AArch64, privilege level is determined by the Exception level. Therefore,
execution at ELn corresponds to privilege PLn.
When in AArch64 state, the processor executes the A64 instruction set. When in AArch32 state,
the processor can execute either the A32 (called ARM in earlier versions of the architecture) or
the T32 (Thumb) instruction set.
The following diagrams show the organization of the Exception levels in AArch64 and
AArch32.
In AArch64:
Normal world
EL0
Application
EL1
Application
Application
Guest OS
EL2
Secure world
Application
Guest OS
Trusted OS
No Hypervisor in
Secure world
Hypervisor
EL3
Secure firmware
Secure monitor
Figure 3-3 Exception levels in AArch64
In AArch32:
Normal world
EL0
EL1
Application
Application
Secure world
Application
Guest OS
Application
Secure firmware
Guest OS
Trusted kernel
(operates at EL3)
EL2
EL3
Hypervisor
No EL2 in Secure
world
Secure monitor
Figure 3-4 Exception levels in AArch32
In AArch32 state, Trusted OS software executes in Secure EL3, and in AArch64 state it
primarily executes in Secure EL1.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-4
Fundamentals of ARMv8
3.2
Changing Exception levels
In the ARMv7 architecture, the processor mode can change under privileged software control
or automatically when taking an exception. When an exception occurs, the core saves the
current execution state and the return address, enters the required mode, and possibly disables
hardware interrupts.
This is summarized in the following table. Applications operate at the lowest level of privilege,
PL0, previously unprivileged mode. Operating systems run at PL1, and the Hypervisor in a
system with the Virtualization extensions at PL2. The Secure monitor, which acts as a gateway
for moving between the Secure and Non-secure (Normal) worlds, also operates at PL1.
Table 3-1 ARMv7 processor modes
ARM DEN0024A
ID050815
Mode
Function
Security
state
Privilege
level
User (USR)
Unprivileged mode in which most applications run
Both
PL0
FIQ
Entered on an FIQ interrupt exception
Both
PL1
IRQ
Entered on an IRQ interrupt exception
Both
PL1
Supervisor
(SVC)
Entered on reset or when a Supervisor Call instruction (SVC)
is executed
Both
PL1
Monitor (MON)
Entered when the SMC instruction (Secure Monitor Call) is
executed or when the processor takes an exception which is
configured for secure handling.
Provided to support switching between Secure and
Non-secure states.
Secure only
PL1
Abort (ABT)
Entered on a memory access exception
Both
PL1
Undef (UND)
Entered when an undefined instruction is executed
Both
PL1
System (SYS)
Privileged mode, sharing the register view with User mode
Both
PL1
Hyp (HYP)
Entered by the Hypervisor Call and Hyp Trap exceptions.
Non-secure only
PL2
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-5
Fundamentals of ARMv8
Non-secure state
Secure state
Non-secure PL0
USER mode
Secure PL0
USER mode
Non-secure PL1
Secure PL1
System mode (SYS)
Supervisor mode (SVC)
FIQ mode
IRQ mode
Undef (UND) mode
Abort (ABT) mode
System mode (SYS)
Supervisor mode (SVC)
FIQ mode
IRQ mode
Undef (UND) mode
Abort (ABT) mode
Non-secure PL2
Hyp mode
Secure PL1
Monitor mode (MON)
Figure 3-5 ARMv7 privilege levels
In AArch64, the processor modes are mapped onto the Exception levels as in Figure 3-6. As in
ARMv7 (AArch32) when an exception is taken, the processor changes to the Exception level
(mode) that supports the handling of the exception.
Normal world
User
SVC, ABT, IRQ,
FIQ, UND, SYS
Hyp
Mon
Application
Application
Secure world
Application
Guest OS
Application
Guest OS
Hypervisor
Secure firmware
EL0
Trusted OS
EL1
No Hypervisor in
Secure world
EL2
EL3
Secure monitor
Figure 3-6 AArch32 processor modes
Movement between Exception levels follows these rules:
ARM DEN0024A
ID050815
•
Moves to a higher Exception level, such as from EL0 to EL1, indicate increased software
execution privilege.
•
An exception cannot be taken to a lower Exception level.
•
There is no exception handling at level EL0, exceptions must be handled at a higher
Exception level.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-6
Fundamentals of ARMv8
ARM DEN0024A
ID050815
•
An exception causes a change of program flow. Execution of an exception handler starts,
at an Exception level higher than EL0, from a defined vector that relates to the exception
taken. Exceptions include:
— Interrupts such as IRQ and FIQ.
— Memory system aborts.
— Undefined instructions.
— System calls. These permit unprivileged software to make a system call to an
operating system.
— Secure monitor or hypervisor traps.
•
Ending exception handling and returning to the previous Exception level is performed by
executing the ERET instruction.
•
Returning from an exception can stay at the same Exception level or enter a lower
Exception level. It cannot move to a higher Exception level.
•
The security state does change with a change of Exception level, except when retuning
from EL3 to a Non-secure state. See Switching between Secure and Non-secure state on
page 17-8.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-7
Fundamentals of ARMv8
3.3
Changing execution state
There are times when you must change the execution state of your system. This could be, for
example, if you are running a 64-bit operating system, and want to run a 32-bit application at
EL0. To do this, the system must change to AArch32.
When the application has completed or execution returns to the OS, the system can switch back
to AArch64. Figure 3-7 on page 3-9 shows that you cannot do it the other way around. An
AArch32 operating system cannot host a 64-bit application.
To change between execution states at the same Exception level, you have to switch to a higher
Exception level then return to the original Exception level. For example, you might have 32-bit
and 64-bit applications running under a 64-bit OS. In this case, the 32-bit application can
execute and generate a Supervisor Call (SVC) instruction, or receive an interrupt, causing a
switch to EL1 and AArch64. (See Exception handling instructions on page 6-21.) The OS can
then do a task switch and return to EL0 in AArch64. Practically speaking, this means that you
cannot have a mixed 32-bit and 64-bit application, because there is no direct way of calling
between them.
You can only change execution state by changing Exception level. Taking an exception might
change from AArch32 to AArch64, and returning from an exception may change from AArch64
to AArch32.
Code at EL3 cannot take an exception to a higher exception level, so cannot change execution
state, except by going through a reset.
The following is a summary of some of the points when changing between AArch64 and
AArch32 execution states:
•
Both AArch64 and AArch32 execution states have Exception levels that are generally
similar, but there are some differences between Secure and Non-secure operation. The
execution state the processor is in when the exception is generated can limit the Exception
levels available to the other execution state.
•
Changing to AArch32 requires going from a higher to a lower Exception level. This is the
result of exiting an exception handler by executing the ERET instruction. See Exception
handling instructions on page 6-21.
•
Changing to AArch64 requires going from a lower to a higher Exception level. The
exception can be the result of an instruction execution or an external signal.
•
If, when taking an exception or returning from an exception, the Exception level remains
the same, the execution state cannot change.
•
Where an ARMv8 processor operates in AArch32 execution state at a particular
Exception level, it uses the same exception model as in ARMv7 for exceptions taken to
that Exception level. In the AArch64 execution state, it uses the exception handling model
described in Chapter 10 AArch64 Exception Handling.
Interworking between the two states is therefore performed at the level of the Secure monitor,
hypervisor or operating system. A hypervisor or operating system executing in AArch64 state
can support AArch32 operation at lower privilege levels. This means that an OS running in
AArch64 can host both AArch32 and AArch64 applications. Similarly, an AArch64 hypervisor
can host both AArch32 and AArch64 guest operating systems. However, a 32-bit operating
system cannot host a 64-bit application and a 32-bit hypervisor cannot host a 64-bit guest
operating system.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-8
Fundamentals of ARMv8
EL0
An AArch64
OS can host
a mix of
AArch64
and AArch32
applications
EL1
EL2
AArch32
App
AArch64
App
AArch32
App
AArch64 OS
An AArch64
hypervisor
can host
an AArch64 and
AArch32 OS
AArch64
App
An AArch32
OS cannot host
an AArch64
application
AArch32 OS
Hypervisor
An AArch32
hypervisor
cannot host
an AArch64 OS
Figure 3-7 Moving between AArch32 and AArch64
For the highest implemented Exception level (EL3 on the Cortex-A53 and Cortex-A57
processors), which execution state to use for each Exception level when taking an exception is
fixed. The Exception level can only be changed by resetting the processor. For EL2 and EL1, it
is controlled by the System registers on page 4-7.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
3-9
Chapter 4
ARMv8 Registers
The AArch64 execution state provides 31 × 64-bit general-purpose registers accessible at all
times and in all Exception levels.
Each register is 64 bits wide and they are generally referred to as registers X0-X30.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-1
ARMv8 Registers
Frame pointer
Procedure link register
X0/W0
X1/W1
X2/W2
X3/W3
X4/W4
X5/W5
X6/W6
X7/W7
X8/W8
X9/W9
X10/W10
X11/W11
X12/W12
X13/W13
X14/W14
X15/W15
X16/W16
X17/W17
X18/W18
X19/W19
X20/W20
X21/W21
X22/W22
X23/W23
X24/W24
X25/W25
X26/W26
X27/W27
X28/W28
X29/W29
X30/W30
EL0, EL1,
EL2, EL3
Figure 4-1 AArch64 general-purpose registers
Each AArch64 64-bit general-purpose register (X0-X30) also has a 32-bit (W0-W30) form.
63
32 31
0
Wn
Xn
Figure 4-2 64-bit register with W and X access.
The 32-bit W register forms the lower half of the corresponding 64-bit X register. That is, W0
maps onto the lower word of X0, and W1 maps onto the lower word of X1.
Reads from W registers disregard the higher 32 bits of the corresponding X register and leave
them unchanged. Writes to W registers set the higher 32 bits of the X register to zero. That is,
writing 0xFFFFFFFF into W0 sets X0 to 0x00000000FFFFFFFF.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-2
ARMv8 Registers
4.1
AArch64 special registers
In addition to the 31 core registers, there are also several special registers.
XZR/WZR
PC
Zero register
Program counter
Stack pointer
Special
registers
SP_EL0
SP_EL1
SPSR_EL1
ELR_EL1
SP_EL2
SPSR_EL2
ELR_EL2
SP_EL3
SPSR_EL3
ELR_EL3
EL0
EL1
EL2
EL3
Program Status Register
Exception Link Register
Figure 4-3 AArch64 special registers
Note
There is no register called X31 or W31. Many instructions are encoded such that the number 31
represents the zero register, ZR (WZR/XZR). There is also a restricted group of instructions
where one or more of the arguments are encoded such that number 31 represents the Stack
Pointer (SP).
When accessing the zero register, all writes are ignored and all reads return 0. Note that the
64-bit form of the SP register does not use an X prefix.
Table 4-1 Special registers in AArch64
Name
Size
Description
WZR
32 bits
Zero register
XZR
64 bits
Zero register
WSP
32 bits
Current stack pointer
SP
64 bits
Current stack pointer
PC
64 bits
Program counter
In the ARMv8 architecture, when executing in AArch64, the exception return state is held in the
following dedicated registers for each Exception level:
•
Exception Link Register (ELR).
•
Saved Processor State Register (SPSR).
There is a dedicated SP per Exception level, but it is not used to hold return state.
Table 4-2 Special registers by Exception level
EL0
EL1
EL2
EL3
SP_EL0
SP_EL1
SP_EL2
SP_EL3
Exception Link Register (ELR)
ELR_EL1
ELR_EL2
ELR_EL3
Saved Process Status Register (SPSR)
SPSR_EL1
SPSR_EL2
SPSR_EL3
Stack Pointer (SP)
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-3
ARMv8 Registers
4.1.1
Zero register
The zero register reads as zero when used as a source register and discards the result when used
as a destination register. You can use the zero register in most, but not all, instructions.
4.1.2
Stack pointer
In the ARMv8 architecture, the choice of stack pointer to use is separated to some extent from
the Exception level. By default, taking an exception selects the stack pointer for the target
Exception level, SP_ELn. For example, taking an exception to EL1 selects SP_EL1. Each
Exception level has its own stack pointer, SP_EL0, SP_EL1, SP_EL2, and SP_EL3.
When in AArch64 at an Exception level other than EL0, the processor can use either:
•
A dedicated 64-bit stack pointer associated with that Exception level (SP_ELn).
•
The stack pointer associated with EL0 (SP_EL0).
EL0 can only ever access SP_EL0.
Table 4-3 AArch64 Stack pointer options
Exception
level
Options
EL0
EL0t
EL1
EL1t, EL1h
EL2
EL2t, EL2h
EL3
EL3t, EL3h
The t suffix indicates that the SP_EL0 stack pointer is selected. The h suffix indicates that the
SP_ELn stack pointer is selected.
The SP cannot be referenced by most instructions. However, some forms of arithmetic
instructions, for example, the ADD instruction, can read and write to the current stack pointer to
adjust the stack pointer in a function. For example:
ADD SP, SP, #0x10
4.1.3
// Adjust SP to be 0x10 bytes before its current value
Program Counter
One feature of the original ARMv7 instruction set was the use of R15, the Program Counter
(PC) as a general-purpose register. The PC enabled some clever programming tricks, but it
introduced complications for compilers and the design of complex pipelines. Removing direct
access to the PC in ARMv8 makes return prediction easier and simplifies the ABI specification.
The PC is never accessible as a named register. Its use is implicit in certain instructions such as
PC-relative load and address generation. The PC cannot be specified as the destination of a data
processing instruction or load instruction.
4.1.4
Exception Link Register (ELR)
The Exception Link Register holds the exception return address.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-4
ARMv8 Registers
4.1.5
Saved Process Status Register
When taking an exception, the processor state is stored in the relevant Saved Program Status
Register (SPSR), in a similar way to the CPSR in ARMv7. The SPSR holds the value of PSTATE
before taking an exception and is used to restore the value of PSTATE when executing an
exception return.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V
SS IL
D A I F
M
M [3:0]
Figure 4-4 SPSR
The individual bits represent the following values for AArch64:
N
Negative result (N flag).
Z
Zero result (Z) flag.
C
Carry out (C flag).
V
Overflow (V flag).
SS
Software Step. Indicates whether software step was enabled when an exception
was taken.
IL
Illegal Execution State bit. Shows the value of PSTATE.IL immediately before
the exception was taken.
D
Process state Debug mask. Indicates whether debug exceptions from watchpoint,
breakpoint, and software step debug events that are targeted at the Exception level
the exception occurred in were masked or not.
A
SError (System Error) mask bit.
I
IRQ mask bit.
F
FIQ mask bit.
M[4]
Execution state that the exception was taken from. A value of 0 indicates
AArch64.
M[3:0]
Mode or Exception level that an exception was taken from.
In ARMv8, the SPSR written to depends on the Exception level. If the exception is taken in EL1,
then SPSR_EL1 is used. If the exception is taken in EL2, then SPSR_EL2 is used, and if the
exception is taken in EL3, SPSR_EL3 is used. The core populates the SPSR when taking an
exception.
Note
The register pairs ELR_ELn and SPSR_ELn that are associated with an Exception level retain
their state during execution at a lower Exception level.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-5
ARMv8 Registers
4.2
Processor state
AArch64 does not have a direct equivalent of the ARMv7 Current Program Status Register
(CPSR). In AArch64, the components of the traditional CPSR are supplied as fields that can be
made accessible independently. These are referred to collectively as Processor State (PSTATE).
The Processor State, or PSTATE fields, for AArch64 have the following definitions:
Table 4-4 PSTATE field definitions
Name
Description
N
Negative condition flag.
Z
Zero condition flag.
C
Carry condition flag.
V
oVerflow condition flag.
D
Debug mask bit.
A
SError mask bit.
I
IRQ mask bit.
F
FIQ mask bit.
SS
Software Step bit.
IL
Illegal execution state bit.
EL (2)
Exception level.
nRW
Execution state
0 = 64-bit
1 = 32-bit
SP
Stack Pointer selector.
0 = SP_EL0
1 = SP_ELn
In AArch64, you return from an exception by executing the ERET instruction, and this causes the
SPSR_ELn to be copied into PSTATE. This restores the ALU flags, execution state, Exception
level, and the processor branches. From here, you continue execution from the address in
ELR_ELn.
The PSTATE.{N, Z, C, V} fields can be accessed at EL0. All other PSTATE fields can be executed
at EL1 or higher and are UNDEFINED at EL0.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-6
ARMv8 Registers
4.3
System registers
In AArch64, system configuration is controlled through system registers, and accessed using
MSR and MRS instructions. This contrasts with ARMv7-A, where such registers were typically
accessed through coprocessor 15 (CP15) operations. The name of a register tells you the lowest
Exception level that it can be accessed from.
For example:
•
TTBR0_EL1 is accessible from EL1, EL2, and EL3.
•
TTBR0_EL2 is accessible from EL2 and EL3.
Registers that have the suffix _ELn have a separate, banked copy in some or all of the levels,
though usually not EL0. Few system registers are accessible from EL0, although the Cache Type
Register (CTR_EL0) is an example of one that can be accessible.
Code to access system registers takes the following form:
MRS
MSR
x0, TTBR0_EL1
TTBR0_EL1, x0
// Move TTBR0_EL1 into x0
// Move x0 into TTBR0_EL1
Previous versions of the ARM architecture have used coprocessors for system configuration.
However, AArch64 does not include support for coprocessors. Table 4-5 lists only the system
registers mentioned in this book.
For a complete list, see Appendix J of the ARM Architecture Reference Manual - ARMv8, for
ARMv8-A architecture profile.
The table shows the Exception levels that have separate copies of each register. For example,
separate Auxiliary Control Registers (ACTLRs) exist as ACTLR_EL1, ACTLR_EL2 and
ACTLR_EL3.
Table 4-5 System registers
Name
Register
Description
Allowed
values of n
ACTLR_ELn
Auxiliary Control
Register
Controls processor-specific features.
1, 2, 3
CCSIDR_ELn
Current Cache
Size ID Register
Provides information about the architecture of the currently
selected cache. See Cache discovery on page 11-18.
1
CLIDR_ELn
Cache Level ID
Register
The type of cache, or caches, implemented at each level.
The Level of Coherency and Level of Unification for the cache
hierarchy.
See Cache maintenance on page 11-13.
1, 2, 3
CNTFRQ_ELn
Counter-timer
Frequency
Register
Reports the frequency of the system timer. See Timers on
page 14-5.
0
CNTPCT_ELn
Counter-timer
Physical Count
Register
Holds the 64-bit current count value. See Timers on page 14-5.
0
CNTKCTL_ELn
Counter-timer
Kernel Control
Register
Controls the generation of an event stream from the virtual
counter. Also controls access from EL0 to the physical counter,
virtual counter, EL1 physical timers, and the virtual timer. See
Timers on page 14-5.
1
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-7
ARMv8 Registers
Table 4-5 System registers (continued)
Allowed
values of n
Name
Register
Description
CNTP_CVAL_ELn
Counter-timer
Physical Timer
Compare Value
Register
Holds the compare value for the EL1 physical timer. See Timers
on page 14-5.
0
CPACR_ELn
Coprocessor
Access Control
Register
Controls access to Trace, floating-point, and NEON
functionality. See ISB in more detail on page 13-9.
1
CSSELR_ELn
Cache Size
Selection Register
Selects the current Cache Size ID Register, CCSIDR_EL1, by
specifying the required cache level and the cache type, either
instruction or data cache. See Cache discovery on page 11-18.
1
CNTP_CTL_ELn
Counter-timer
Physical Control
Register
Control register for the EL1 physical timer. See Timers on
page 14-5.
0
CTR_ELn
Cache Type
Register
Information about the architecture of the integrated caches. See
Cache discovery on page 11-18.
0
DCZID_ELn
Data Cache Zero
ID Register
Indicates the block size written with byte values of 0 by the Data
Cache Zero by Virtual Address (DCZVA) system instruction.
See Cache discovery on page 11-18.
0
ELR_ELn
Exception Link
Register
Holds the address of the instruction which caused the exception.
1, 2, 3
ESR_ELn
Exception
Syndrome
Register
Includes information about the reasons for the exception. See
The Exception Syndrome Register on page 10-9.
1, 2, 3
FAR_ELn
Fault Address
Register
Holds the virtual faulting address. See Handling synchronous
exceptions on page 10-7.
1, 2, 3
FPCR
Floating-point
Control Register
Controls floating-point extension behavior. The fields in this
register map to the equivalent fields in the AArch32 FPSCR.
See New features for NEON and Floating-point in AArch64 on
page 7-2.
-
FPSR
Floating-point
Status Register
Provides floating-point system status information. The fields in
this register map to the equivalent fields in the AArch32
FPSCR. See New features for NEON and Floating-point in
AArch64 on page 7-2.
-
HCR_ELn
Hypervisor
Configuration
Register
Controls virtualization settings and trapping of exceptions to
EL2. See Exception handling on page 18-8.
2
MAIR_ELn
Memory Attribute
Indirection
Register
Provides the memory attribute encodings corresponding to the
possible values in a Long-descriptor format translation table
entry for stage 1 translations at ELn. See Memory types on
page 13-3.
1, 2, 3
MIDR_ELn
Main ID Register
The type of processor the code is running on (part number and
revision).
1
MPIDR_ELn
Multiprocessor
Affinity Register
The processor and cluster IDs, in multi-core or cluster systems.
See Determining which core the code is running on on
page 14-3.
1
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-8
ARMv8 Registers
Table 4-5 System registers (continued)
4.3.1
Allowed
values of n
Name
Register
Description
SCR_ELn
Secure
Configuration
Register
Controls Secure state and trapping of exceptions to EL3. See
Handling synchronous exceptions on page 10-7.
3
SCTLR_ELn
System Control
Register
Controls architectural features, for example the MMU, caches
and alignment checking.
0, 1, 2, 3
SPSR_ELn
Saved Program
Status Register
Holds the saved processor state when an exception is taken to
this mode or Exception level.
abt, fiq, irq,
und, 1,2, 3
TCR_ELn
Translation
Control Register
Determines which of the Translation Table Base Registers
define the base address for a translation table walk required for
the stage 1 translation of a memory access from ELn. Also
controls the translation table format and holds cacheability and
shareability information. See Separation of kernel and
application Virtual Address spaces on page 12-7.
1, 2, 3
TPIDR_ELn
User Read/Write
Thread ID
Register
Provides a location where software executing at ELn can store
thread identifying information, for OS management purposes.
See Context switching on page 12-27.
0, 1, 2, 3
TPIDRRO_ELn
User Read-Only
Thread ID
Register
Provides a location where software executing at EL1 or higher
can store thread identifying information. This informaton is
visible to software executing at EL0, for OS management
purposes. See Context switching on page 12-27.
0
TTBR0_ELn
Translation Table
Base Register 0
Holds the base address of translation table 0, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at ELn. See
Separation of kernel and application Virtual Address spaces on
page 12-7.
1, 2, 3
TTBR1_ELn
Translation Table
Base Register 1
Holds the base address of translation table 1, and information
about the memory it occupies. This is one of the translation
tables for the stage 1 translation of memory accesses at EL0 and
EL1. See Separation of kernel and application Virtual Address
spaces on page 12-7.
1
VBAR_ELn
Vector Based
Address Register
Holds the exception base address for any exception that is taken
to ELn. See AArch64 exception table on page 10-12.
1, 2, 3
VTCR_ELn
Virtualization
Translation
Control Register
Controls the translation table walks required for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
Also holds cacheability and shareability information for the
accesses. See Translations at EL2 and EL3 on page 12-20.
2
VTTBR_ELn
Virtualization
Translation Table
Base Register
Holds the base address of the translation table for the stage 2
translation of memory accesses from Non-secure EL0 and EL1.
See Memory translation on page 18-3.
2
The system control register
The System Control Register (SCTLR) is a register that controls standard memory, system
facilities and provides status information for functions that are implemented in the core.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-9
ARMv8 Registers
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EE
SA C A M
I
nTWE
UCI EOE
WXN
UCT
SED CP15BEN
nTWI DZE
SCTLR_EL1
SA0
UMA ITD
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I
EE
SA C A M
SCTLR_EL2
SCTLR_EL3
WXN
Figure 4-5 SCTLR bit assignments
Not all bits are available above EL1. The individual bits represent the following:
UCI
When set, enables EL0 access in AArch64 for DC CVAU, DC CIVAC, DC CVAC, and
IC IVAU instructions. See Cache maintenance on page 11-13.
EE
Exception endianness. See Endianness on page 4-12.
EOE
WXN
ARM DEN0024A
ID050815
0
Little endian.
1
Big endian.
Endianness of explicit data accesses at EL0. The possible values of this bit are:
0
Explicit data accesses at EL0 are little-endian.
1
Explicit data accesses at EL0 are big-endian.
Write permission implies XN (eXecute Never). See Access permissions on
page 12-23.
0
Regions with write permission are not forced to XN.
1
Regions with write permission are forced to XN.
nTWE
Not trap WFE. A value of 1 means that WFE instructions are executed as normal.
nTWI
Not trap WFI. A value of 1 means that WFI instructions are executed as normal.
UCT
When set, enables EL0 access in AArch64 to the CTR_EL0 register.
DZE
Access to DC ZVA instruction at EL0. See Cache maintenance on page 11-13.
0
Execution prohibited.
1
Execution allowed.
I
Instruction cache enable. This is an enable bit for instruction caches at EL0 and
EL1. Instruction accesses to cacheable Normal memory are cached.
UMA
User Mask Access. Controls access to interrupt masks from EL0, when EL0 is
using AArch64.
SED
SETEND Disable. Disables SETEND instructions at EL0 using AArch32.
0
SETEND instructions are enabled.
1
The SETEND instruction is disabled.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-10
ARMv8 Registers
ITD
IT Disable. The possible values of this bit are:
0
The IT instruction is available.
1
The IT instruction is treated as a 16-bit instruction. Only another 16-bit
instruction, or the first half of a 32-bit instruction, can follow. This
depends upon the implementation.
CP15BEN
CP15 barrier enable. If implemented, it is an enable bit for the AArch32 CP15
DMB, DSB, and ISB barrier operations.
SA0
Stack Alignment Check Enable for EL0.
SA
Stack Alignment Check Enable.
C
Data cache enable. This is an enable bit for data caches at EL0 and EL1. Data
accesses to cacheable Normal memory are cached.
A
Alignment check enable bit.
M
Enable the MMU.
Accessing the SCTLR
To access the SCTLR_ELn, use:
MRS , SCTLR_ELn
MSR SCTLR_ELn,
// Read SCTLR_ELn into Xt
// Write Xt to SCTLR_ELn
For example:
Example 4-1 Setting bits in the SCTLR
MRS
ORR
ORR
MSR
X0, SCTLR_EL1
X0, X0, #(1 << 2)
X0, X0, #(1 << 12)
SCTLR_EL1, X0
//
//
//
//
Read System Control Register configuration data
Set [C] bit and enable data caching
Set [I] bit and enable instruction caching
Write System Control Register configuration data
Note
The caches in the processor must be invalidated before caching of data and instructions is
enabled in any of the Exception levels.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-11
ARMv8 Registers
4.4
Endianness
There are two basic ways of viewing bytes in memory, either as Little-Endian (LE) or
Big-Endian (BE). On big-endian machines, the most significant byte of an object in memory is
stored at the lowest address, that is the address closest to zero. On little-endian machines, the
least significant byte is stored at the lowest address. The term byte-ordering can also be used
rather than endianness.
3
2
1
0
78
56
34
12
12
34
56
78
0
1
2
3
Byte
Little endian
0x12345678
Big endian
Byte
Figure 4-6
This data endianness is controlled independently for each Execution level. For EL3, EL2 and
EL1, the relevant register of SCTLR_ELn.EE sets the endianness. The additional bit at EL1,
SCTLR_EL1.E0E controls the data endian setting for EL0. In the AArch64 execution state, data
accesses can be LE or BE, while instruction fetches are always LE.
Whether a processor supports both LE and BE depends upon the implementation of the
processor. If only little-endianness is supported, then the EE and E0E bits are always 0.
Similarly, if only big-endianness is supported, then the EE and E0E bits are at a static 1 value.
When using AArch32, having the CPSR.E bit have a different value to the equivalent System
Control register EE bit when in EL1, EL2, or EL3 is now deprecated. The use of the ARMv7
SETEND instruction is also deprecated. It is possible to cause the Undef exception to be taken upon
executing a SETEND instruction, by setting the SCTLR.SED bit.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-12
ARMv8 Registers
4.5
Changing execution state (again)
In Changing execution state on page 3-8, we described the change between AArch64 and
AArch32 in terms of Exception levels. Now we consider the change from the point of view of
the registers.
On entry to an Exception level using AArch64 from an Exception level using AArch32:
•
The values of the upper 32 bits of registers that were accessible to any lower Exception
level using AArch32 execution are UNKNOWN.
•
The registers that are not accessible during AArch32 execution retain the state that they
had before AArch32 execution.
•
On exception entry to EL3, when EL2 has been using AArch32, the values of the upper
32 bits of the ELR_EL2 are UNKNOWN.
•
AArch64 Stack Pointers (SPs) and Exception Link Registers (ELRs) associated with an
Exception level that is not accessible during AArch32 execution, at that Exception level,
retain the state that they had before AArch32 execution. This applies to the following
registers:
— SP_EL0.
— SP_EL1.
— SP_EL2.
— ELR_EL1.
In general, application programmers write applications for either AArch32 or AArch64. It is
only the OS that must take account of the two execution states and the switch between them.
4.5.1
Registers at AArch32
Being virtually identical to ARMv7 means AArch32 must match ARMv7 privilege levels. It
also means that AArch32 only deals with ARMv7 32-bit general-purpose registers. Therefore,
there must be some correspondence between the ARMv8 architecture, and the view of it
provided by the AArch32 execution state.
Remember that in the ARMv7 architecture there are sixteen 32-bit general-purpose registers
(R0-R15) for software use. Fifteen of them (R0-R14) can be used for general-purpose data
storage. The remaining register, R15, is the program counter (PC) whose value is altered as the
core executes instructions. Software can also access the CPSR, and the saved copy of the CPSR
from the previously executed mode, is the SPSR. On taking an exception, the CPSR is copied
to the SPSR of the mode to which the exception is taken.
Which of these registers is accessed, and where, depends upon the processor mode the software
is executing in and the register itself. This is called banking, and the shaded registers in
Figure 4-7 on page 4-14 are banked. They use physically distinct storage and are usually
accessible only when a process is executing in that particular mode.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-13
ARMv8 Registers
R0
R0
R0
R0
R0
R0
R0
R0
R0
R1
R1
R1
R1
R1
R1
R1
R1
R1
R2
R2
R2
R2
R2
R2
R2
R2
R2
R3
R3
R3
R3
R3
R4
R4
R4
R4
R4
R4
R5
R5
R5
R5
R5
R5
R6
R3
R3
R3
R4
R4
R4
R5
R5
R5
R3
R6
R6
R6
R6
R6
R6
R6
R6
R7
R7
R7
R7
R7
R7
R7
R7
R7
R8
R8
R8_fiq
R8
R8
R8
R8
R8
R8
R9
R9_fiq
R9
R9
R9
R9
R9
R9
R10
R10
R10_fiq
R10
R10
R10
R10
R10
R10
R11
R11
R11_fiq
R11
R11
R11
R11
R11
R11
R12
R12
R12_fiq
R12
R12
R12
R12
R12
R12
R9
R13 (sp)
R13 (sp)
SP_fiq
SP_irq
SP_abt
SP_svc
SP_und
SP_mon
SP_hyp
R14 (lr)
R14 (lr)
LR_fiq
LR_irq
LR_abt
LR_svc
LR_und
LR_mon
LR_hyp
R15 (pc)
R15 (pc) R15 (pc)
(A/C)PSR
CPSR
User
Sys
R15 (pc)
R15 (pc) R15 (pc)
R15 (pc) R15 (pc)
R15 (pc)
CPSR
CPSR
CPSR
SPSR_hyp
SPSR_mon
SPSR_und
SPSR_fiq SPSR_irq SPSR_abt SPSR_svc
ELR_hyp
CPSR
CPSR
CPSR
CPSR
FIQ
IRQ
ABT
SVC
UND
MON
HYP
Banked
Figure 4-7 The ARMv7 register set showing banked registers
Banking is used in ARMv7 to reduce the latency for exceptions. However, this also means that
of a considerable number of possible registers, fewer than half can be used at any one time.
In contrast, the AArch64 execution state has 31 × 64-bit general-purpose registers accessible at
all times and in all Exception levels. A change in execution state between AArch64 and
AArch32 means that the AArch64 registers must necessarily map onto the AArch32 (ARMv7)
register set. This mapping is shown in Figure 4-8 on page 4-15.
The upper 32 bits of the AArch64 registers are inaccessible when executing in AArch32. If the
processor is operating in AArch32 state, it uses the 32-bit W registers, which are equivalent to
the 32-bit ARMv7 registers.
AArch32 maps the banked registers to AArch64 registers that would otherwise be inaccessible.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-14
ARMv8 Registers
W0
R0
R0
R0
R0
R0
R0
R0
R0
W1
R1
R1
R1
R1
R1
R1
R1
R1
R2
W2
R2
R2
R2
R2
R2
R2
R2
W3
R3
R3
R3
R3
R3
R3
R3
R3
W4
R4
R4
R4
R4
R4
R4
R4
R4
W5
R5
R5
R5
R5
R5
R5
R5
R5
W6
R6
R6
R6
R6
R6
R6
R6
R6
W7
R7
R7
R7
R7
R7
R7
R7
R7
R8
W24
R8
R8
R8
R8
R8
R8
W8
W9
R9
W25
R9
R9
R9
R9
R9
R9
W10
R10
W26
R10
R10
R10
R10
R10
R10
W11
R11
W27
R11
R11
R11
R11
R11
R11
R12
R12
R12
W12
R12
W28
R12
R12
R12
W29
W17
W21
W19
W13
R13 (sp)
W14
R14 (lr)
R15
R15 (pc) R15 (pc) R15 (pc)
(A/C)PSR
W30
CPSR
CPSR
W16
W20
W18
R15 (pc) R15 (pc)
CPSR
CPSR
CPSR
W23
R13
W15
W22
R14
R14
R15 (pc) R15 (pc)
CPSR
CPSR
R15 (pc)
CPSR
SPSR_fiq SPSR_irq SPSR_abt SPSR_EL1 SPSR_und SPSR_EL3 SPSR_EL2
ELR_EL2
User
Sys
FIQ
IRQ
ABT
SVC
UND
MON
HYP
Inaccessible from AArch64
Figure 4-8 AArch64 to AArch32 register mapping
The SPSR and ELR_Hyp registers in AArch32 are additional registers that are accessible using
system instructions only. They are not mapped into the general-purpose register space of the
AArch64 architecture. Some of these registers are mapped between AArch32 and AArch64:
•
SPSR_svc maps to SPSR_EL1.
•
SPSR_hyp maps to SPSR_EL2.
•
ELR_hyp maps to ELR_EL2.
The following registers are only used during AArch32 execution. However, because of the
execution at EL1 using AArch64, they retain their state despite them being inaccessible during
AArch64 execution at that Exception level.
•
SPSR_abt.
•
SPSR_und.
•
SPSR_irq.
•
SPSR_fiq.
The SPSR registers are only accessible during AArch64 execution at higher Exception levels
for context switching.
Again, if an exception is taken to an Exception level in AArch64 from an Exception level in
AArch32, the top 32 bits of the AArch64 ELR_ELn are all zero.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-15
ARMv8 Registers
4.5.2
PSTATE at AArch32
In AArch64, the different components of the traditional CPSR are presented as Processor State
(PSTATE) fields that can be made accessible independently. At AArch32, there are extra fields
corresponding to the ARMv7 CPSR bits.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V Q
IT
J
IL
GE
IT [7:2]
E A I F T M
M [3:0]
Figure 4-9 CPSR bit assignments in AArch32
Giving additional PSTATE bits which are accessible only at AArch32:
Table 4-6 PSTATE bit definitions
ARM DEN0024A
ID050815
Name
Description
Q
Cumulative saturation (sticky) flag.
GE (4)
Greater than or Equal flags.
IT (8)
If-Then execution bits.
J
J bit.
T
T32 bit.
E
Endianness bit.
M
Mode field.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-16
ARMv8 Registers
4.6
NEON and floating-point registers
In addition to the general-purpose registers, ARMv8 also has 32 128-bit floating-point registers
labeled V0-V31. The 32 registers are used to hold floating-point operands for scalar
floating-point instructions and both scalar and vector operands for NEON operations. NEON
and floating-point registers are also covered in Chapter 7 AArch64 Floating-point and NEON.
4.6.1
Floating-point register organization in AArch64
In NEON and floating-point instructions that operate on scalar data, the floating-point and
NEON registers behave similarly to the main general-purpose integer registers. Therefore, only
the lower bits are accessed, with the unused high bits ignored on a read and set to zero on a write.
The qualified names for scalar floating-point and NEON names indicate the number of
significant bits as follows, where n is a register number 0-31.
Table 4-7 Operand name for differently sized floats
Precision
Size (bits)
Name
Half
16
Hn
Single
32
Sn
Double
64
Dn
D31
Unused
S31
Unused
Unused
H31
Register V31
127
64 63
32 31
16 15
0
...
D0
Unused
S0
Unused
Unused
H0
Register V0
127
64 63
32 31
16 15
0
Figure 4-10 Arrangement of floating-point values
Note
16-bit floating-point is supported, but only as a format to be converted from or to. It is not
supported for data processing operations.
The F prefix and the float size is specified by the floating-point ADD instruction:
FADD Sd, Sn, Sm
FADD Dd, Dn, Dm
ARM DEN0024A
ID050815
// Single-precision
// Double-precision
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-17
ARMv8 Registers
The half-precision floating-point instructions are for converting between different sizes:
FCVT
FCVT
FCVT
FCVT
4.6.2
Sd,
Dd,
Hd,
Hd,
Hn
Hn
Sn
Dn
//
//
//
//
half-precision to single-precision
half-precision to double-precision
single-precision to half-precision
double-precision to half-precision
Scalar register sizes
In AArch64, the mapping for the integer scalars has changed from what is used in ARMv7-A to
the mapping shown in Figure 4-11:
Q31
D31
Unused
S31
Unused
Unused
H31
B31
Unused
Register V31
127
64 63
32 31
16 15 8 7
0
...
Q0
D0
Unused
S0
Unused
Unused
H0
B0
Unused
Register V0
127
64 63
32 31
16 15 8 7
0
Figure 4-11 Arrangement of ARMv8 registers when holding scalar values
In Figure 4-11 S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half
of D1, which is the bottom half of Q1, and so on. This eliminates many of the problems
compilers have in auto-vectorizing high-level code.
ARM DEN0024A
ID050815
•
The bottom 64-bits of each of the Q registers can also be viewed as D0-D31, 32 64-bit
wide registers for floating-point and NEON use.
•
The bottom 32-bits of each of the Q registers can also be viewed as S0-S31, 32 32-bit wide
registers for floating-point and NEON use.
•
The bottom 16-bits of each of the S registers can also be viewed as H0-H31, 32 16-bit
wide registers for floating-point and NEON use.
•
The bottom 8-bits of each of the H registers can also be viewed as B0-B31, 32 8-bit wide
registers for NEON use.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-18
ARMv8 Registers
Note
Only the bottom bits of each register set are used in each case. The rest of the register space is
ignored when read, and filled with zeros when written.
A consequence of this mapping is that if a program executing in AArch64 is interpreting D or
S registers from AArch32 execution. Then the program must unpack the D or S registers from
the V registers before using them.
For the scalar ADD instruction:
ADD Vd, Vn, Vm
If the size was, for example, 32 bits, the instruction would be:
ADD Sd, Sn, Sm
Table 4-8 Operand name for differently sized scalars
4.6.3
Word size
Size (bits)
Name
Byte
8
Bn
Halfword
16
Hn
Word
32
Sn
Doubleword
64
Dn
Quadword
128
Qn
Vector register sizes
Vectors can be 64-bits wide with one or more elements or 128-bits wide with two or more
elements as shown in Figure 4-12:
D
V0.2D
D
S
S
S
S
V0.4S
128-bit vector
H
B
H
B
B
H
B
B
H
B
B
H
B
B
H
B
B
B
...
127
H
64 63
B
B
32 31
V0.8H
H
B
B
16 15 8 7
V0.16B
0
D
Unused
V31.1D
S
Unused
S
V31.2S
64-bit vector
Unused
Unused
127
H
B
H
B
64 63
B
H
B
B
32 31
V31.4H
H
B
B
16 15 8 7
B
V31.8B
0
Figure 4-12 Vector sizes
For the vector ADD instruction:
ADD Vd.T, Vn.T, Vm.T
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-19
ARMv8 Registers
For 32-bit vectors this time, with 4 lanes, the instruction becomes:
ADD Vd.4S, Vn.4S, Vm.4S
Table 4-9 Operand names for different size vectors
Name
Shape
Vn.8B
8 lanes, each containing an 8-bit element
Vn.16B
16 lanes, each containing an 8-bit element
Vn.4H
4 lanes, each containing a 16-bit element
Vn.8H
8 lanes, each containing a 16-bit element
Vn.2S
2 lanes, each containing a 32-bit element
Vn.4S
4 lanes, each containing a 32-bit element
Vn.1D
1 lane containing a 64-bit element
Vn.2D
2 lanes, each containing a 64-bit element
When these registers are used in a specific instruction form, the names must be further qualified
to indicate the data shape. More specifically, this means the data element size and the number
of elements or lanes held within them.
4.6.4
NEON in AArch32 execution state.
In AArch32, the smaller registers are packed into larger ones (D0 and D1 are combined to form
Q1, for instance). This introduces some tricky loop-carried dependencies which can reduce the
ability of the compiler to vectorize loop structures.
S7
S6
S5
D3
S4
D2
Q1
127
63
S3
S2
31
S1
D1
15
7
0
7
0
S0
D0
Q0
127
63
31
15
Figure 4-13 Arrangement of ARMv7 SIMD registers
The floating-point and Advanced SIMD registers in AArch32 are mapped into the AArch64 FP
and SIMD registers. This is done to allow the floating-point and NEON registers of an
application or a virtual machine to be interpreted (and, as necessary, modified) by a higher level
of system software, for example, the OS or the Hypervisor.
The AArch64 V16-V31 FP and NEON registers are not accessible from AArch32. As with the
general-purpose registers, during execution in an Exception level using AArch32 these registers
retain their state from the previous execution using AArch64.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
4-20
Chapter 5
An Introduction to the ARMv8 Instruction Sets
One of the most significant changes introduced in the ARMv8 architecture is the addition of a
64-bit instruction set. This set complements the existing 32-bit instruction set architecture. This
addition provides access to 64-bit wide integer registers and data operations, and the ability to
use 64-bit sized pointers to memory. The new instructions are known as A64 and execute in the
AArch64 execution state. ARMv8 also includes the original ARM instruction set, now called
A32, and the Thumb (T32) instruction set. Both A32 and T32 execute in AArch32 state, and
provide backward compatibility with ARMv7.
Although ARMv8-A provides backward compatibility with the 32-bit ARM Architectures, the
A64 instruction set is separate and distinct from the older ISA and is encoded differently. A64
adds some additional capabilities while also removing other features that would potentially limit
the speed or energy efficiency of high performance implementations. The ARMv8 architecture
includes some enhancements to the 32-bit instruction sets (A32 and T32) as well. However,
code that makes use of such features is not compatible with older ARMv7 implementations.
Instruction opcodes in the A64 instruction set, however, are still 32 bits long, not 64 bits.
Programmers seeking a more detailed description of A64 assembly language can also refer to
the ARM® Compiler armasm Reference Guide v6.01.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-1
An Introduction to the ARMv8 Instruction Sets
5.1
The ARMv8 instruction sets
The new A64 instruction set is similar to the existing A32 instruction set. Instructions are 32 bits
wide and have similar syntax.
The instruction sets use a generic naming convention within the ARMv8 architecture, so that
the original 32-bit instruction set states are now called:
A32
When in AArch32 state, the instruction set is largely compatible with ARMv7,
though there are differences. See, ARMv8-A Architecture Reference Manual. It
also provides some new instructions to align with some of the features that are
introduced in the A64 instruction set.
T32
The Thumb instruction set was first included in the ARM7TDMI processor and
originally contained only 16-bit instructions. 16-bit instructions gave much
smaller programs at the cost of some performance. ARMv7 processors, including
those in the Cortex-A series, support Thumb-2 technology, which extends the
Thumb instruction set to provide a mix of 16-bit and 32-bit instructions. This
gives performance similar to that of ARM, while retaining the reduced code size.
Because of its size and performance advantages, it is increasingly common for all
32-bit code to be compiled or assembled to take advantage of Thumb-2
technology.
A new instruction set has been introduced that the core can use when in AArch64 state. In
keeping with the naming convention, and reflecting the 64-bit operation, this instruction set is
called:
A64
A64 provides similar functionality to the A32 and T32 instruction sets in
AArch32 or ARMv7. The design of the new A64 instruction set allowed several
improvements:
A consistent encoding scheme
The late addition of some instructions in A32 resulted in some
inconsistency in the encoding scheme. For example, LDR and STR
support for halfwords is encoded slightly differently to the mainstream
byte and word transfer instructions. The result of this is that the
addressing modes are slightly different.
Wide range of constants
A64 instructions provide a huge range of options for constants, each
tailored to the requirements of specific instruction types.
•
Arithmetic instructions generally accept a 12-bit immediate
constant.
•
Logical instructions generally accept a 32-bit or 64-bit constant,
which has some constraints in its encoding.
•
MOV instructions accept a 16-bit immediate, which can be shifted
to any 16-bit boundary.
•
Address generation instructions are geared to addresses aligned
to a 4KB page size.
There are slightly more complex rules for constants that are used in bit
manipulation instructions. However, bitfield manipulation instructions
can address any contiguous sequence of bits, in either the source or
destination operand.
A64 provides flexible constants, but encoding them, even determining
whether a particular constant can be legally encoded in a particular
context, can be non-trivial.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-2
An Introduction to the ARMv8 Instruction Sets
Data types are easier
A64 deals naturally with 64-bit signed and unsigned data types in that
it offers more concise and efficient ways of manipulating 64-bit
integers. This can be advantageous for all languages which provide
64-bit integers such as C or Java.
Long offsets
A64 instructions generally provide longer offsets, both for PC-relative
branches and for offset addressing.
The increased branch range makes it easier to manage inter-section
jumps. Dynamically generated code is generally placed on the heap so
it can, in practice, be located anywhere. The runtime system finds it
much easier to manage this with increased branch ranges, and fewer
fix-ups are required.
The need for literal pools (blocks of literal data embedded in the code
stream) has long been a feature of ARM instruction sets. This still
exists in A64. However, the larger PC-relative load offset helps
considerably with the management of literal pools, making it possible
to use one per compilation unit. This removes the need to manufacture
locations for multiple pools in long code sequences.
Pointers Pointers are 64-bit in AArch64, which allows larger amounts of virtual
memory to be addressed and gives more freedom for address mapping.
However, using 64-bit pointers does incur some costs. The same piece
of code typically uses more memory when running with 64-pointers
than with 32-bit pointers. Each pointer is stored in memory and
requires eight bytes instead of four. This might sound trivial, but can
add up to a significant penalty. Additionally, the increased use of
memory space that is associated with a move to 64 bits can cause a
drop in the number of accesses that hit in cache. This drop of cache hits
can reduce performance.
Some languages can be implemented with compressed pointers, such
as Java, to circumvent the performance issue.
Conditional constructs are used instead of IT blocks
IT blocks are a useful feature of T32, enabling efficient sequences that
avoid the need for short forward branches around unexecuted
instructions. However, they are sometimes difficult for hardware to
handle efficiently. A64 removes these blocks and replaces them with
conditional instructions such as CSEL, or Conditional Select and CINC,
or Conditional Increment. These conditional constructs are more
straightforward and easier to handle without special cases.
Shift and rotate behavior is more intuitive
The A32 or T32 shift and rotate behavior does not always map easily
to the behavior expected by high-level languages.
ARMv7 provides a barrel shifter that can be used as part of data
processing instructions. However, specifying the type of shift and the
amount to shift requires a certain number of opcode bits, which could
be used elsewhere.
A64 instructions therefore remove options that were rarely used, and
instead adds new explicit instructions to carry out more complicated
shift operations.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-3
An Introduction to the ARMv8 Instruction Sets
Code generation
When generating code, both statically and dynamically, for common
arithmetic functions, A32 and T32 often require different instructions,
or instruction sequences. This is to cope with different data types.
These operations in A64 are much more consistent so it is much easier
to generate common sequences for simple operations on differently
sized data types.
For example, in T32 the same instruction can have different encodings
depending on what registers are used (either a low register or a high
register).
The A64 instruction set encodings are much more regular and
rationalized. Consequently, an assembler for A64 typically requires
fewer lines of code than an assembler for T32.
Fixed-length instructions
All A64 instructions are the same length, unlike T32, which is a
variable-length instruction set. This makes management and tracking
of generated code sequences easier, particularly affecting dynamic
code generators.
Three operands map better
A32, in general, preserves a true three-operand structure for
data-processing operations. T32, on the other hand, contains a
significant number of two-operand instruction formats, which make it
slightly less flexible when generating code. A64 sticks to a consistent
three-operand syntax, which further contributes to the regularity and
homogeneity of the instruction set for the benefit of compilers.
5.1.1
Distinguishing between 32-bit and 64-bit A64 instructions
Most integer instructions in the A64 instruction set have two forms, which operate on either
32-bit or 64-bit values within the 64-bit general-purpose register file.
When looking at the register name that the instruction uses:
•
If the register name starts with X, it is a 64-bit value.
•
If the register name starts with W, it is a 32-bit value.
Where a 32-bit instruction form is selected, the following facts hold true:
•
Right shifts and rotates inject at bit 31, instead of bit 63.
•
The condition flags, where set by the instruction, are computed from the lower 32 bits.
•
Writes to the W register set bits [63:32] of the X register to zero.
This distinction applies even when the results of a 32-bit instruction form would be
indistinguishable from the lower 32 bits computed by the equivalent 64-bit instruction form. For
example, a 32-bit bitwise ORR could be performed using a 64-bit ORR and simply ignoring the top
32 bits of the result. The A64 instruction set includes separate 32 and 64-bit forms of the ORR
instruction.
The C and C++ LP64 and LLP64 data models are expected to be the most commonly used on
AArch64. They both define the frequently used int, short, and char types to be 32 bits or less.
By maintaining this semantic information in the instruction set, implementations can exploit this
information. For example, to avoid expending energy or cycles to compute, forward, and store
the unused upper 32 bits of such data types. Implementations are free to exploit this freedom in
whatever way they choose to save energy.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-4
An Introduction to the ARMv8 Instruction Sets
So the new A64 instruction set provides distinct sign and zero-extend instructions. Additionally.
the A64 instruction set means it is possible to extend and shift the final source register of an ADD,
SUB, CMN, or CMP instruction and the index register of a Load or Store instruction. This results in
efficient implementation of array index calculations involving a 64-bit array pointer and 32-bit
array index.
5.1.2
Addressing
When the processor can store 64-bit values in a single register, it becomes much simpler to
access large amounts of memory within a program. A single thread executing on a 32-bit core
is limited to accessing 4GB of address space. Large parts of that addressable space are reserved
for use by the OS kernel, library code, peripherals, and more. As a result, lack of space means
that the program might need to map some data in or out of memory while executing. Having a
larger address space, with 64-bit pointers, avoids this problem. It also makes techniques such as
memory-mapped files more attractive and convenient to use. The file contents are mapped into
the memory map of a thread, even though the physical RAM might not be large enough to
contain the whole file.
Other improvements to addressing include the following:
Exclusive accesses
Exclusive load-store of a byte, halfword, word and doubleword. Exclusive access
to a pair of doublewords permits atomic updates of a pair of pointers, for example
circular list inserts. All exclusive accesses must be naturally aligned, and
exclusive pair access must be aligned to twice the data size, that is, 128 bits for a
pair of 64-bit values.
Increased PC-relative offset addressing
PC-relative literal loads have an offset range of ±1MB. Compared to the
PC-relative loads of A32, this reduces the number of literal pools, and increases
sharing of literal data between functions. In turn, this reduces I-cache and TLB
pollution.
Most conditional branches have a range of ±1MB, expected to be sufficient for
the majority of conditional branches that take place within a single function.
Unconditional branches, including branch and link, have a range of ±128MB,
expected to be sufficient to span the static code segment of most executable load
modules and shared objects, without needing linker-inserted veneers.
Note
Veneers are small pieces of code that are automatically inserted by the linker, for
example, when it detects that a branch target is out of range. The veneer becomes
an intermediate target of the original branch with the veneer itself then being a
branch to the target address.
The linker can reuse a veneer generated for a previous call, for other calls to the
same function if it is in range from both calls. Occasionally, such veneers can be
a performance factor.
If you have a loop that calls multiple functions through veneers, you will get
many pipeline flushes and therefore sub-optimal performance. Placing related
code together in memory can avoid this.
PC-relative load and store and address generation with a range of ±4GB can be
performed inline using only two instructions, that is, without the need to load an
offset from a literal pool.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-5
An Introduction to the ARMv8 Instruction Sets
Unaligned address support
Except for exclusive and ordered accesses, all loads and stores support the use of
unaligned addresses when accessing normal memory. This simplifies porting
code to A64.
Bulk transfers
The LDM, STM, PUSH, and POP instructions do not exist in A64. Bulk transfers can be
constructed using the LDP and STP instructions. These instructions load and store
a pair of independent registers from consecutive memory locations.
The LDNP and STNP instructions provide a streaming or non-temporal hint, that the
data does not need to be retained in caches.
The PRFM, or prefetch memory instructions enable targeting of a prefetch to a
specific cache level.
Load/Store
All Load/Store instructions now support consistent addressing modes. This
makes it much easier, for example, to treat char, short, int and long long in the
same way when loading and storing quantities from memory.
The floating-point and NEON registers now support the same addressing modes
as the core registers, making it easier to use the two register banks
interchangeably.
Alignment checking
When executing in AArch64, additional alignment checking is performed on
instruction fetches and on loads or stores using the stack pointer, enabling
misalignment checking of the PC or the current SP.
This approach is preferable to forcing the correct alignment of the PC or SP,
because a misalignment of the PC or SP commonly indicates a software error,
such as corruption of an address in software.
There are a number of types of alignment checking:
•
Program Counter alignment checking generates an exception associated
with instruction fetch whenever an attempt is made to execute an
instruction fetched with a misaligned PC in AArch64.
A misaligned PC is defined to be one where bits [1:0] of the PC are not 00.
A PC misalignment is identified in the exception syndrome register
associated with the target Exception level.
When the exception is handled using AArch64, the associated exception
link register holds the entire PC in its misaligned form, as does the Fault
Address Register, FAR_ELn, for the Exception level in which the exception
is taken.
PC alignment checking is performed in AArch64, and in AArch32 as part
of Data Abort exception handling.
•
Stack Pointer (SP) alignment checking generates an exception associated
with data memory access whenever a load or store using the stack pointer
as a base address in AArch64 is attempted.
A misaligned stack pointer is one where bits [3:0] of the stack pointer, used
as the base address of the calculation, are not 0000. The stack pointer must
be 16-byte aligned whenever it is usedas a base address.
Stack pointer alignment checking is only performed in AArch64, and can
be enabled independently for each Exception level:
—
ARM DEN0024A
ID050815
EL0 and EL1 are controlled by two separate bits in SCTLR_EL1.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-6
An Introduction to the ARMv8 Instruction Sets
5.1.3
—
EL2 is controlled by a bit in SCTLR_EL2.
—
EL3 is controlled by a bit in SCTLR_EL3.
Registers
The A64 64-bit register bank helps reduce register pressure in most applications.
The A64 Procedure Call Standard (PCS) passes up to eight parameters in registers (X0-X7). In
contrast, A32 and T32 pass only four arguments in registers, with any excess being passed on
the stack.
The PCS also defines a dedicated Frame Pointer (FP), which makes debugging and call-graph
profiling easier by making it possible to reliably unwind the stack. Refer to Chapter 9 The ABI
for ARM 64-bit Architecture for further information.
A consequence of adopting 64-bit wide integer registers is the varying widths of variables used
by programming languages. A number of standard models are currently in use, which differ
mainly in the size defined for integers, longs, and pointers:
Table 5-1 Variable width
Type
ILP32
LP64
LLP64
char
8
8
8
short
16
16
16
int
32
32
32
long
32
64
32
long long
64
64
64
size_t
32
64
64
pointer
32
64
64
64-bit Linux implementations use LP64 and this is supported by the A64 Procedure Call
Standard. Other PCS variants are defined that can be used by other operating systems.
Zero register
The zero register (WZR/XZR) is used for a few encoding tricks. For example,
there is no plain multiply encoding, just multiply-add. The instruction MUL W0, W1,
W2 is identical to MADD W0, W1, W2, WZR which uses the zero register. Not all
instructions can use the XZR/WZR. As we mentioned in Chapter 4, the zero
register shares the same encoding as the stack pointer. This means that, for some
arguments, for a very limited number of instructions, WZR/XZR is not available,
but WSP/SP is used instead.
Example 5-1 Using the Zero register to write a zero to memory
In A32:
mov
str
r0, #0
r0, [...]
In A64 using the zero register:
str
wzr, [...]
No need for a spare register. Or write 16 bytes of zeros using:
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-7
An Introduction to the ARMv8 Instruction Sets
stp xzr, xzr, [...] etc
A convenient side-effect of the zero register is that there are many NOP instructions
with large immediate fields. For example, ADR XZR, # alone gives you 21 bits
of data in an instruction with no other side effects. This is very useful for JIT
compilers, where code can be patched at runtime.
Stack pointer
The Stack Pointer (SP) cannot be referenced by most instructions. Some forms of
arithmetic instructions can read or write the current stack pointer. This might be
done to adjust the stack pointer in a function prologue or epilogue. For example:
ADD SP, SP, #256
// SP = SP + 256
Program counter
The current Program Counter (PC) cannot be referred to by number as if part of
the general register file and therefore cannot be used as the source or destination
of arithmetic instructions, or as the base, index or transfer register of load and
store instructions.
The only instructions that read the PC are those whose function it is to compute a
PC-relative address (ADR, ADRP, literal load, and direct branches), and the
branch-and-link instructions that store a return address in the link register (BL and
BLR). The only way to modify the program counter is using branch, exception
generation and exception return instructions.
Where the PC is read by an instruction to compute a PC-relative address, then its
value is the address of that instruction. Unlike A32 and T32, there is no implied
offset of 4 or 8 bytes.
FP and NEON registers
The most significant update to the NEON registers is that NEON now has 32
16-byte registers, instead of the 16 registers it had before. The simpler mapping
scheme between the different register sizes in the floating-point and NEON
register bank make these registers much easier to use. The mapping is easier for
compilers and optimizers to model and analyze.
Register indexed addressing
The A64 instruction set provides additional addressing modes with respect to
A32, allowing a 64-bit index register to be added to the 64-bit base register, with
optional scaling of the index by the access size. Additionally, it provides sign or
zero-extension of a 32-bit value within an index register, again with optional
scaling.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-8
An Introduction to the ARMv8 Instruction Sets
5.2
C/C++ inline assembly
In this section, we briefly cover how to include assembly code within C or C++ language
modules.
The asm keyword can incorporate inline GCC syntax assembly code into a function. For
example:
#include
int add(int i, int j)
{
int res = 0;
asm (
"ADD %w[result], %w[input_i], %w[input_j]"
//Use `%w[name]` to operate on W
// registers (as in this case).
// You can use `%x[name]` for X
// registers too, but this is the
// default.
: [result] "=r" (res)
: [input_i] "r" (i), [input_j] "r" (j)
);
return res;
}
int main(void)
{
int a = 1;
int b = 2;
int c = 0;
c = add(a,b)
printf(“Result of %d + %d = %d\n, a, b, c);
}
The general form of an asm inline assembly statement is:
asm(code [: output_operand_list [: input_operand_list [: clobber_list]]]);
where:
code is the assembly code. In our example, this is "ADD %[result], %[input_i], %[input_j]".
output_operand_list is an optional list of output operands, separated by commas. Each operand
consists of a symbolic name in square brackets, a constraint string, and a C expression in
parentheses. In this example, there is a single output operand: [result] "=r" (res).
input_operand_list is an optional list of input operands, separated by commas. Input operands
use the same syntax as output operands. In this example, there are two input operands: [input_i]
"r" (i) and [input_j] "r" (j).
clobber_list is an optional list of clobbered registers, or other values. In our example, this is
omitted.
When calling functions between C/C++ and assembly code, you must follow the AAPCS64
rules.
For further information, see:
https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-L
anguage-with-C
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-9
An Introduction to the ARMv8 Instruction Sets
5.3
Switching between the instruction sets
It is not possible to use code from the two execution states within a single application. There is
no interworking between A64 and A32 or T32 instruction sets in ARMv8 as there is between
A32 and T32 instruction sets. Code written in A64 for the ARMv8 processors cannot run on
ARMv7 Cortex-A series processors. However, code written for ARMv7-A processors can run
on ARMv8 processors in the AArch32 execution state. This is summarized in Figure 5-1.
T32
Mixed 16 and 32-bit instructions
32-bit general purpose registers
BX
BLX
MOV PC
LDR PC
Exception
entry or
return
Exception
entry
Exception
return
A64
32-bit instructions
32 and 64-bit general purpose registers
A32
32-bit instructions
32-bit general purpose registers
Figure 5-1 Switching between instruction sets
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
5-10
Chapter 6
The A64 instruction set
Many programmers writing at the application level do not need to write code in assembly
language. However, assembly code can be useful in cases where highly optimized code is
required. This is the case when when writing compilers, or where use of low level features not
directly available in C is needed. It might be required for portions of boot code, device drivers,
or when developing operating systems. Finally, it can be useful to be able to read assembly code
when debugging C, and particularly, to understand the mapping between assembly instructions
and C statements.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-1
The A64 instruction set
6.1
Instruction mnemonics
The A64 assembly language overloads instruction mnemonics, and distinguishes between the
different forms of an instruction based on the operand register names. For example, the ADD
instructions below all have different encodings, but you only have to remember one mnemonic,
and the assembler automatically chooses the correct encoding based on the operands.
ADD W0, W1, W2
ADD X0, X1, X2
ADD X0, X1, W2, SXTW
ADD X0, X1, #42
ADD V0.8H, V1.8H, V2.8H
ARM DEN0024A
ID050815
//
//
//
//
//
//
add 32-bit registers
add 64-bit registers
add sign extended 32-bit register to 64-bit extended
register
add immediate to 64-bit register
NEON 16-bit add, in each of 8 lanes
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-2
The A64 instruction set
6.2
Data processing instructions
These are the fundamental arithmetic and logical operations of the processor and operate on
values in the general-purpose registers, or a register and an immediate value. Multiply and
divide instructions on page 6-4 can be considered special cases of these instructions.
Data processing instructions mostly use one destination register and two source operands. The
general format can be considered to be the instruction, followed by the operands, as follows:
Instruction Rd, Rn, Operand2
The second operand might be a register, a modified register, or an immediate value. The use of
R indicates that it can be either an X or a W register.
The data processing operations include:
6.2.1
•
Arithmetic and logical operations.
•
Move and shift operations.
•
Instructions for sign and zero extension.
•
Bit and bitfield manipulation.
•
Conditional comparison and data processing.
Arithmetic and logical operations
Table 6-1 shows some of the available arithmetic and logical operations.
Table 6-1 Arithmetic and logical operations
Type
Instructions
Arithmetic
ADD, SUB, ADC, SBC, NEG
Logical
AND, BIC, ORR, ORN, EOR, EON
Comparison
CMP, CMN, TST
Move
MOV, MVN
Some instructions also have an S suffix, indicating that the instruction sets flags. Of the
instructions in Table 6-1, this includes ADDS, SUBS, ADCS, SBCS, ANDS, and BICS. There are other flag
setting instructions, notably CMP, CMN and TST, but these do not take an S suffix.
The operations ADC and SBC perform additions and subtractions that also use the carry condition
flag as an input.
ADC{S}: Rd = Rn + Rm + C
SBC{S}: Rd = Rn - Rm - 1 + C
Example 6-1 Arithmetic instructions
ADD W0, W1, W2, LSL #3
SUBS X0, X4, X3, ASR #2
MOV X0, X1
CMP W3, W4
ADD W0, W5, #27
ARM DEN0024A
ID050815
//
//
//
//
//
W0 = W1 + (W2 << 3)
X0 = X4 - (X3 >> 2), set flags
Copy X1 to X0
Set flags based on W3 - W4
W0 = W5 + 27
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-3
The A64 instruction set
The logical operations are essentially the same as the corresponding boolean operators operating
on individual bits of the register.
The BIC (Bitwise bit Clear) instruction performs an AND of the register that is the first after the
destination register, with the inverted value of the second operand. For example, to clear bit [11]
of register X0, use:
MOV X1, #0x800
BIC X0, X0, X1
ORN and EON perform an OR or EOR respectively with a bitwise-NOT of the second operand.
The comparison instructions only modify the flags and have no other effect. The range of
immediate values for these instructions is 12 bits, and this value can be optionally shifted 12 bits
to the left.
6.2.2
Multiply and divide instructions
The multiply instructions provided are broadly similar to those in ARMv7-A, but with the
ability to perform 64-bit multiplies in a single instruction.
Table 6-2 Multiplication operations in assembly language
Opcode
Description
Multiply instructions
MADD
Multiply add
MNEG
Multiply negate
MSUB
Multiply subtract
MUL
Multiply
SMADDL
Signed multiply-add long
SMNEGL
Signed multiply-negate long
SMSUBL
Signed multiply-subtract long
SMULH
Signed multiply returning high half
SMULL
Signed multiply long
UMADDL
Unsigned multiply-add long
UMNEGL
Unsigned multiply-negate long
UMSUBL
Unsigned multiply-subtract long
UMULH
Unsigned multiply returning high half
UMULL
Unsigned multiply long
Divide instructions
SDIV
Signed divide
UDIV
Unsigned divide
There are multiply instructions that operate on 32-bit or 64-bit values and return a result of the
same size as the operands. For example, two 64-bit registers can be multiplied to produce a
64-bit result with the MUL instruction.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-4
The A64 instruction set
MUL X0, X1, X2
// X0 = X1 * X2
There is also the ability to add or subtract an accumulator value in a third source register, using
the MADD or MSUB instructions.
The MNEG instruction can be used to negate the result, for example:
MNEG X0, X1, X2
// X0 = -(X1 * X2)
Additionally, there are a range of multiply instructions that produce a long result, that is,
multiplying two 32-bit numbers and generating a 64-bit result. There are both signed and
unsigned variants of these long multiplies (UMULL, SMULL). There are also options to accumulate
a value from another register (UMADDL, SMADDL) or to negate (UMNEGL, SMNEGL).
Including 32-bit and 64-bit multiply with optional accumulation give a result size the same size
as the operands:
•
32 ± (32 × 32) gives a 32-bit result.
•
64 ± (64 × 64) gives a 64-bit result.
•
± (32 × 32) gives a 32-bit result.
•
± (64 × 64) gives a 64-bit result.
Widening multiply, that is signed and unsigned, with accumulation gives a single 64-bit result:
•
64 ± (32 × 32) gives a 64-bit result.
•
± (32 × 32) gives a 64-bit result.
A 64 × 64 to 128-bit multiply requires a sequence of two instructions to generate a pair of 64-bit
result registers:
•
± (64 × 64) gives the lower 64 bits of the result [63:0].
•
(64 × 64) gives the higher 64 bits of the result [127:64].
Note
The list contains no 32 × 64 options. You cannot directly multiply a 32-bit W register by a 64-bit
X register.
The ARMv8-A architecture has support for signed and unsigned division of 32-bit and 64-bit
sized values. For example:
UDIV W0, W1, W2
SDIV X0, X1, X2
// W0 = W1 / W2 (unsigned, 32-bit divide)
// X0 = X1 / X2 (signed, 64-bit divide)
Overflow and divide-by-zero are not trapped:
•
Any integer division by zero returns zero.
•
Overflow can only occur in SDIV:
—
ARM DEN0024A
ID050815
INT_MIN / -1 returns INT_MIN, where INT_MIN is the smallest negative number that
can be encoded in the registers used for the operation. The result is always rounded
towards zero, as in most C/C++ dialects.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-5
The A64 instruction set
6.2.3
Shift operations
The following instructions are specifically for shifting:
•
Logical Shift Left (LSL). The LSL instruction performs multiplication by a power of 2.
•
Logical Shift Right (LSR). The LSR instruction performs division by a power of 2.
•
Arithmetic Shift Right (ASR). The ASR instruction performs division by a power of 2,
preserving the sign bit.
•
Rotate right (ROR). The ROR instruction performs a bitwise rotation, wrapping the bits
rotated from the LSB into the MSB.
Table 6-3 Shift and move operations
Instruction
Description
Shift
ASR
Arithmetic shift right
LSL
Logical shift left
LSR
Logical shift right
ROR
Rotate right
Move
MOV
Move
MVN
Bitwise NOT
LSL Logical shift left
Bits shifted
out are lost
Register
LSR Logical shift right
0
Register
0
Bits shifted
out are lost
Unsigned division by 2n
where n is the shift amount
Multiplication by 2n where n is
the shift amount
ASR Arithmetic shift right
ROR Rotate right
sign-bit
Register
Bits shifted
out are lost
Division by 2n, where n is the
shift amount, preserving the
sign bit
Register
Bit rotate with wrap around
from LSB to MSB
Figure 6-1 Shift operations
The register that is specified for a shift can be 32-bit or 64-bit. The amount to be shifted can be
specified either as an immediate, that is up to register size minus one, or by a register where the
value is taken only from the bottom five (modulo-32) or six (modulo-64) bits.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-6
The A64 instruction set
6.2.4
Bitfield and byte manipulation instructions
There are instructions that extend a byte, halfword, or word to register size, which can be either
X or W. These instructions exist in both signed (SXTB, SXTH, SXTW) and unsigned (UXTB, UXTH)
variants and are aliases to the appropriate bitfield manipulation instruction.
Both the signed and unsigned variants of these instructions extend a byte, halfword, or word
(although only SXTW operates on a word) to register size. The source is always a W register. The
destination register is either an X or a W register, except for SXTW which must be an X register.
For example:
SXTB X0, W1
// Sign extend the least significant byte of register W1
// from 8-bits to 64-bit by repeating the leftmost bit of the
// byte.
Bitfield instructions are similar to those that exist in ARMv7 and include Bit Field Insert (BFI),
and signed and unsigned Bit Field Extract ((S/U)BFX). There are extra bitfield instructions too,
such as BFXIL (Bit Field Extract and Insert Low), UBFIZ (Unsigned Bit Field Insert in Zero), and
SBFIZ (Signed Bit Field Insert in Zero).
31
0
0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0
BFI W0, W0, #9, #6
;Bit field insert
31
0
0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0
UBFX W1, W0, #18, #7
;Bit field extract
31
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
Zero extend
BFC W1, WZR, #3, #4
0
;Bit field clear
31
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
Figure 6-2 Bit manipulation instructions
Note
There are also BFM, UBFM, and SBFM instructions. These are Bit Field Move instructions, which are
new for ARMv8. However, the instructions do not need to be used explicitly, as aliases are
provided for all cases. These aliases are the bitfield operations already described: [SU]XT[BHWX],
ASR/LSL/LSR immediate, BFI, BFXIL, SBFIZ, SBFX, UBFIZ, and UBFX.
If you are familiar with the ARMv7 architecture, you might recognize the other bit manipulation
instruction:
•
ARM DEN0024A
ID050815
CLZ Count leading zero bits in a register.
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-7
The A64 instruction set
Similarly, the same byte manipulation instructions:
•
RBIT Reverse all bits.
•
REV Reverse the byte order of a register.
•
REV16 Reverse the byte order of each halfword in a register.
Xn
Xd
Figure 6-3 REV16 instruction
•
REV32 Reverse the byte order of each word in a register.
Xn
Xd
Figure 6-4 REV32 instruction
These operations can be performed on either word (32-bit) or doubleword (64-bit) sized
registers, except for REV32, which applies only to 64-bit registers.
6.2.5
Conditional instructions
The A64 instruction set does not support conditional execution for every instruction. Predicated
execution of instructions does not offer sufficient benefit to justify its significant use of opcode
space.
Processor state on page 4-6, describes the four status flags, Zero (Z), Negative (N), Carry (C)
and Overflow (V). Table 6-4 indicates the value of these bits for flag setting operations.
Table 6-4 Condition flag
Flag
Name
Description
N
Negative
Set to the same value as bit[31] of the result. For a 32-bit signed integer, bit[31] being set indicates
that the value is negative.
Z
Zero
Set to 1 if the result is zero, otherwise it is set to 0.
C
Carry
Set to the carry-out value from result, or to the value of the last bit shifted out from a shift
operation.
V
Overflow
Set to 1 if signed overflow or underflow occurred, otherwise it is set to 0.
The C flag is set if the result of an unsigned operation overflows the result register.
The V flag operates in the same way as the C flag, but for signed operations.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-8
The A64 instruction set
Note
The condition flags (NZCV) and the condition codes are the same as in A32 and T32. However,
A64 adds NV (0b1111), though it behaves the same as its complement, AL (0b1110). This differs
from A32, which did not assign any meaning to 0b1111.
Table 6-5 Condition codes
Code
Encoding
Meaning (when set by CMP)
Meaning (when set by FCMP)
Condition flags
EQ
0b0000
Equal to.
Equal to.
Z =1
NE
0b0001
Not equal to.
Unordered, or not equal to.
Z=0
CS
0b0010
Carry set (identical to HS).
Greater than, equal to, or unordered (identical
to HS).
C=1
HS
0b0010
Greater than, equal to (unsigned)
(identical to CS).
Greater than, equal to, or unordered (identical
to CS).
C=1
CC
0b0011
Carry clear (identical to LO).
Less than (identical to LO).
C=0
LO
0b0011
Unsigned less than (identical to
CC).
Less than (identical to CC).
C=0
MI
0b0100
Minus, Negative.
Less than.
N=1
PL
0b0101
Positive or zero.
Greater than, equal to, or unordered.
N=0
VS
0b0110
Signed overflow.
Unordered. (At least one argument was NaN).
V=1
VC
0b0111
No signed overflow.
Not unordered. (No argument was NaN).
V=0
HI
0b1000
Greater than (unsigned).
Greater than or unordered.
(C = 1) && (Z = 0)
LS
0b1001
Less than or equal to (unsigned).
Less than or equal to.
(C = 0) || (Z = 1)
GE
0b1010
Greater than or equal to (signed).
Greater than or equal to.
N==V
LT
0b1011
Less than (signed).
Less than or unordered.
N!=V
GT
0b1100
Greater than (signed).
Greater than.
(Z==0) && (N==V)
LE
0b1101
Less than or equal to (signed).
Less than, equal to or unordered.
(Z==1) || (N!=V)
AL
0b1110
Always executed.
Default. Always executed.
Any
NV
0b1111
Always executed.
Always executed.
Any
There are a small set of conditional data processing instructions. These instructions are
unconditionally executed but use the condition flags as an extra input to the instruction. This set
has been provided to replace common usage of conditional execution in ARM code.
The instructions types which read the condition flags are:
Add/subtract with carry
The traditional ARM instructions, for example, for multi-precision arithmetic and
checksums.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-9
The A64 instruction set
Conditional select with optional increment, negate, or invert
Conditionally select between one source register and a second incremented,
negated, inverted, or unmodified source register.
These are the most common uses of single conditional instructions in A32 and
T32. Typical uses include conditional counting or calculating the absolute value
of a signed quantity.
Conditional operations
The A64 instruction set enables conditional execution of only program flow control branch
instructions. This is in contrast to A32 and T32 where most instructions can be predicated with
a condition code. These can be summarized as follows:
Conditional select (move)
•
CSEL Select between two registers based on a condition. Unconditional
instructions, followed by a conditional select, can replace short conditional
sequences.
•
CSINC Select between two registers based on a condition. Return the first
source register or the second source register incremented by one.
•
CSINV Select between two registers based on a condition. Return the first
source register or the inverted second source register.
•
CSNEG Select between two registers based on a condition. Return the first
source register or the negated second source register.
Conditional set
Conditionally select between 0 and 1 (CSET) or 0 and -1 (CSETM). Used, for
example, to set the condition flags as a boolean value or mask in a general
register.
Conditional compare
(CMP and CMN) Sets the condition flags to the result of a comparison if the original
condition is true. If not true, the conditional flags are set to a specified condition
flag state. The conditional compare instruction is very useful for expressing
nested or compound comparisons.
Note
Conditional select and conditional compare are also available for floating-point registers using
the FCSEL and FCCMP instructions.
For example:
CSINC X0, X1, X0, NE
// Set the return register X0 to X1 if Zero flag clear,
// else increment X0
Some aliases to the example instructions are provided, where either the zero register is used, or
the same register is used as both destination and both source registers for the instruction.
For example:
CINC X0, X0, LS
CSET W0, EQ
CSETM X0, NE
ARM DEN0024A
ID050815
//
//
//
//
If less than or same (LS) then X0 = X0 + 1
If the previous comparison was equal (Z=1) then W0 = 1,
else W0 = 0
If not equal then X0 = -1, else X0 = 0
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-10
The A64 instruction set
This class of instructions provides a powerful way to avoid the use of branches or conditionally
executed instructions. Compilers, or assembly programmers, might adopt a technique of
performing the operations for both branches of an if-then-else statement. Then the correct result
is selected at the end.
For example, consider the simple C code:
if (i == 0)
r = r + 2;
else
r = r - 1;
This might produce code similar to:
CMP w0, #0
SUB w2, w1, #1
ADD w1, w1, #2
CSEL w1, w1, w2, EQ
ARM DEN0024A
ID050815
//
//
//
//
if (i == 0)
r = r - 1
r = r + 2
select between the two results
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-11
The A64 instruction set
6.3
Memory access instructions
As with all prior ARM processors, the ARMv8 architecture is a Load/Store architecture. This
means that no data processing instruction operates directly on data in memory. The data must
first be loaded into registers, modified, and then stored to memory. The program must specify
an address, the size of data to be transferred, and a source or destination register. There are
additional Load and Store instructions which provide further options, such as non-temporal
Load/Store, Load/Store exclusives, and Acquire/Release.
Memory instructions can access Normal memory in an unaligned fashion (see Chapter 13
Memory Ordering). This is not supported by exclusive accesses, load acquire or store release
variants. If unaligned accesses are not desired, they can be configured to be faulted.
6.3.1
Load instruction format
The general form of a Load instruction is as follows:
LDR Rt,
For loads into integer registers, you can choose a size to load. For example, to load a size smaller
than the specified register value, append one of the following suffixes to the LDR instruction:
•
LDRB (8-bit, zero extended).
•
LDRSB (8-bit, sign extended).
•
LDRH (16-bit, zero extended).
•
LDRSH (16-bit, sign extended).
•
LDRSW (32-bit, sign extended).
There are also unscaled-offset forms such as LDUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the LDUR form
explicitly, because most assemblers can select the appropriate version based on the offset used.
You do not need to specify a zero-extended load to an X register, because writing a W register
effectively zero extends to the entire register width.
LDRSB W4,
8A
Memory.
8A
R4
8A
Memory.
8A
R4
8A
Memory.
8A
R4
Sign extend
00
00
00
00
FF
FF
FF
LDRSB X4,
Sign extend
FF
FF
FF
FF
FF
FF
FF
LDRB W4,
Zero extend
00
00
00
00
00
00
00
Figure 6-5 Load instructions
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-12
The A64 instruction set
6.3.2
Store instruction format
Similarly, the general form of a Store instruction is as follows:
STR Rn,
There are also unscaled-offset forms such as STUR (see Specifying the address for a Load
or Store instruction on page 6-14). Programmers will not normally need to use the STUR form
explicitly, as most assemblers can select the appropriate version based on the offset used.
The size to be stored might be smaller than the register. You specify this by adding a B or H
suffix to the STR. It is always the least significant part of the register that is stored in such a case.
6.3.3
Floating-point and NEON scalar loads and stores
Load and Store instructions can also access floating-point/NEON registers. Here, the size is
determined only by the register being loaded or stored, which can be any of the B, H, S, D, or
Q registers. This information is summarized in Table 6-6, and Table 6-7.
For Load instructions:
Table 6-6 Memory bits written by Load instructions
Load
Xt
Wt
Qt
Dt
St
Ht
Bt
LDR
64
32
128
64
32
16
9
LDP
128
64
256
128
64
-
-
LDRB
-
8
-
-
-
-
-
LDRH
-
16
-
-
-
-
-
LDRSB
8
8
-
-
-
-
-
LDRSH
16
16
-
-
-
-
-
LDRSW
32
-
-
-
-
-
-
LDPSW
-
-
-
-
-
-
-
For Store instructions:
Table 6-7 Memory bits read by Store instructions
Store
Xt
Wt
Qt
Dt
St
Ht
Bt
STR
64
32
126
64
32
16
8
STP
128
64
256
128
64
-
-
STRB
-
8
-
-
-
-
-
STRH
-
16
-
-
-
-
-
No sign-extension options are available for loads into FP/SIMD registers. Addresses for such
loads are still specified using the general-purpose registers.
For example:
LDR D0, [X0, X1]
Loads register D0 with the doubleword at the memory address pointed to by X0 plus X1.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-13
The A64 instruction set
Note
Floating-point and scalar NEON Loads and Stores use the same addressing modes as integer
Loads and Stores.
6.3.4
Specifying the address for a Load or Store instruction
The addressing modes available to A64 are similar to those in A32 and T32. There are some
additional restrictions as well as some new features, but the addressing modes available to A64
will not be surprising to someone familiar with A32 or T32.
In A64, the base register of an address operand must always be an X register. However, several
instructions support zero-extension or sign-extension so that a 32-bit offset can be provided as
a W register.
Offset modes
Offset addressing modes add an immediate value or an optionally-modified register value to a
64-bit base register to generate an address.
Table 6-8 Offset addressing modes
Example instruction
Description
LDR X0, [X1]
Load from the address in X1
LDR X0, [X1, #8]
Load from address X1 + 8
LDR X0, [X1, X2]
Load from address X1 + X2
LDR X0, [X1, X2, LSL, #3]
Load from address X1 + (X2 << 3)
LDR X0, [X1, W2, SXTW]
Load from address X1 + sign_extend(W2)
LDR X0, [X1, W2, SXTW, #3]
Load from address X1 + (sign_extend(W2) << 3)
Typically, when specifying a shift or extension option, the shift amount can be either 0 (the
default) or log2 of the access size in bytes (so that Rn << multiplies Rn by the access
size). This supports common array-indexing operations.
// A C example showing accesses that a compiler is likely to generate.
void example_dup(int32_t a[], int32_t length) {
int32_t first = a[0];
// LDR W3, [X0]
for (int32_t i = 1; i < length; i++) {
a[i] = first;
// STR W3, [X0, W2, SXTW, #2]
}
}
Index modes
Index modes are similar to offset modes, but they also update the base register. The syntax is the
same as in A32 and T32, but the set of operations is more restrictive. Usually, only immediate
offsets can be provided for index modes.
ARM DEN0024A
ID050815
Copyright © 2015 ARM. All rights reserved.
Non-Confidential
6-14
The A64 instruction set
There are two variants: pre-index modes which apply the offset before accessing the memory,
and post-index modes which apply the offset after accessing the memory.
Table 6-9 Index addressing modes
Example instruction
Description
LDR X0, [X1, #8]!
Pre-index: Update X1 first (to X1 + #8), then load from the new address
LDR X0, [X1], #8
Post-index: Load from the unmodified address in X1 first, then update X1 (to X1 + #8)
STP X0, X1, [SP, #-16]!
Push X0 and X1 to the stack.
LDP X0, X1, [SP], #16
Pop X0 and X1 off the stack.
These options map cleanly onto some common C operations:
// A C example showing accesses that a compiler is likely to generate.
void example_strcpy(char * dst, const char * src)
{
char c;
do {
c = *(src++);
// LDRB W2, [X1], #1
*(dst++) = c;
// STRB W2, [X0], #1
} while (c != '\0');
}
PC-relative modes (load-literal)
A64 adds another addressing mode specifically for accessing literal pools. Literal pools are
blocks of data encoded in an instruction stream. The pools are not executed, but their data can
be accessed from surrounding code using PC-relative memory addresses. Literal pools are often
used to encode constant values that do not fit into a simple move-immediate instruction.
In A32 and T32, the PC can be read like a general-purpose register, so a literal pool can be
accessed simply by specifying PC as the base register.
In A64, PC is not generally accessible, but instead there is a special addressing mode (for load
instructions only) that accesses a PC-relative address. This special-purpose addressing mode
also has a much greater range than the PC-relative loads in A32 and T32 could achieve, so literal
pools can be positioned more sparsely.
Table 6-10
Example instruction
Description
LDR W0,
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.7
Linearized : Yes
Create Date : 2015:05:08 08:47:18Z
Copyright : Copyright ©€2015 ARM. All rights reserved.
Author : ARM Limited
Creator : FrameMaker 8.0
Keywords : Cortex-A, Cortex-A50, Cortex-A53, Cortex-A57, ARMv8
Title : ARM Cortex-A Series Programmer’s Guide for ARMv8-A
Modify Date : 2017:12:07 07:56:44-05:00
Producer : 3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)
Page Count : 296
Page Mode : UseOutlines
EXIF Metadata provided by EXIF.tools