Front Matter The Definitive Guide To Arm Cortex M3 And M4 Processors C

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 1015

DownloadFront Matter The Definitive Guide To Arm Cortex-M3 And Cortex-M4 Processors C
Open PDF In BrowserView PDF
The Definitive Guide to
Ò
Ò
ARM Cortex -M3 and
Cortex-M4 Processors
Third Edition

Joseph Yiu
ARM Ltd., Cambridge, UK

AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO

Newnes is an imprint of Elsevier

Newnes is an imprint of Elsevier
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
225 Wyman Street, Waltham, MA 02451, USA
Copyright Ó 2014 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in
any form or by any means electronic, mechanical, photocopying, recording or otherwise
without the prior written permission of the publisher
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department
in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email:
permissions@elsevier.com. Alternatively you can submit your request online by visiting
the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining
permission to use Elsevier material
Notice
No responsibility is assumed by the publisher for any injury and/or damage to persons or
property as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions or ideas contained in the material herein.
Because of rapid advances in the medical sciences, in particular, independent verification
of diagnoses and drug dosages should be made
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBNe13: 978-0-12-408082-9
For information on all Newnes publications
visit our website at www.newnespress.com
Printed and bound in the United States
14 15 16 17 18 10 9 8 7 6 5 4 3 2 1

Foreword
There is a revolution on the embedded market: Most new microcontrollers are nowadays based on the ARM architecture and specifically on the popular Cortex-M3 and
Cortex-M4 processors. Recently we also saw the launch of several new ARM
processors. At the low-end of the spectrum, the Cortex-M0+ processor has been
introduced for applications that were previously dominated by 8-bit and 16-bit
microcontrollers. The new 64-bit Cortex-A50 series processors address the highend market such as servers. Along with the demand for standardized systems and
energy efficient computing performance, the Internet-of-Things (IoT) is one driver
for this revolution. In the year 2020, analysts are forecasting 50 billion devices that
are connected to the IoT, and the ARM processors will span the whole application
range from sensors to servers. Many devices will be based on Cortex-M3 and
Cortex-M4 microcontrollers and may just use a small battery or even energy harvesting as power source.
Using ARM Cortex-M3 and Cortex-M4 processors based devices today is
straightforward since a wide range of development tools, debug utilities, and
many example projects are available. However, writing efficient applications could
require in-depth knowledge about the hardware architecture and the software model.
This book provides essential information for system architects and software engineers: It gives insight into popular software development tools along with extensive
programming examples that are based on the Cortex Microcontroller Software Interface Standard (CMSIS). It also covers the Digital Signal Processing (DSP) features
of the Cortex-M4 processor and the CMSIS-DSP library for interfacing with the
analog world. And with many embedded applications becoming more complex
and the wider availability of more capable microcontrollers, using of real-time
operating systems is becoming common practice. All these topics are covered
with easy-to-understand application examples.
I recommend this book to all type of users: From students that start with a
small Cortex-M microcontroller project to system experts that need an in-depth
understanding of processor features.
Reinhard Keil
Director of MCU Tools, ARM

xxi

Preface
The last few years has seen the ARMÒ CortexÒ-M3 processor continue to expand its
market coverage and the adoption of the Cortex-M4 processor gaining momentum.
At the same time the software development tools and various technologies surrounding the Cortex-M processors have also evolved. For example, the CMSIS-Core is
now being used in almost all Cortex-M device driver libraries and the CMSIS project
has expanded into areas such as the DSP library software.
In this edition, I have restructured my original book to enable beginners to
quickly understand the M3 & M4 processor architecture, enabling them in the process to quickly develop software applications. I have also covered a number of
advanced topics that numerous users have asked me to cover and which were
missing from the previous editions e and were not covered in other books or in
documentation created by ARM. In this edition I have also added a great deal of
new information on the Cortex-M4 processor, for example, the detail uses of the
floating point unit and the DSP instructions, and have extended the coverage of a
number of topics. For example, this edition includes more microcontroller software
development suites than previous editions, including a chapter on Real-Time Operating Systems (RTOS) based on the CMSIS-RTOS API, and additional information
on a number of advanced topics.
Also included in this edition are two chapters on DSP written by Paul Beckmann,
CEO of DSP Concepts, a company that has developed the CMSIS-DSP library for
ARM. I am extremely pleased to have his contribution, since his in-depth knowledge
of DSP applications and the CMSIS-DSP library make this book a worthwhile
investment for any ARM-embedded software developer.
This book is for both embedded hardware system designers and software engineers. Because it has a wide range of chapters covering topics from “Getting
Started” e to those detailing advanced information, it is suitable for a wide range
of readers including programmers, embedded product designers, electronic enthusiasts, academic researchers, and even System-on-a-Chip (SoC) engineers. A chapter
on software porting is also included to help readers who are porting software from
other architectures or from ARM7TDMIÔ , a classic ARM processor, to Cortex-M
microcontrollers.
Hopefully you will find this book useful and well worth reading.
Joseph Yiu

xxiii

Synopsis
This is the third edition of the Definitive Guide to the ARMÒ CortexÒ -M3. The book
name has been changed to reflect the addition of the details for the ARM Cortex-M4
processor. This third edition has been fully revised and updated, and now includes
extensive information on the ARM Cortex-M4 processor, providing a complete
up-to-date guide to both Cortex-M3 and Cortex-M4 processor but which also enables migration from various processor architectures to the exciting world of the
Cortex-M3 and M4.
The book presents the background of the ARM architecture and outlines the features of the processors such as the instruction set and interrupt-handling and also
demonstrates how to program and utilize various advanced features available such
as the floating point unit.
Chapters on Getting Started with KeilÔ MDK-ARM, IAR EWARM, gcc, and
CooCox CoIDE tools are available to enable beginners to start developing program
codes. The book then covers several important areas of software development such
as input/output of information, using embedded OSs (CMSIS-RTOS), and mixed
language projects with assembly and C.
Two chapters on DSP features and CMSIS-DSP libraries are contributed by Paul
Beckmann, PhD, the founder and CEO of DSP Concepts. DSP Concepts is the company that developed the CMSIS-DSP library for ARM. These two chapters cover
DSP fundamentals and how to write DSP software for the Cortex-M4 processor,
including examples of using the CMSIS-DSP library, as well as useful information
about the DSP capability of the Cortex-M4 processor.
Various debugging techniques are also covered in various chapters of the book, as
well as topics on software porting from other architectures. This is the most comprehensive guide to the ARM Cortex-M3 and Cortex-M4 processors, written by an
ARM engineer who helped to develop the core. It includes a full range of easyto-understand examples, diagrams, quick reference appendices such as instruction
sets, and CMSIS-Core APIs.
ARM, CORTEX, CORESIGHT, CORELINK, THUMB, AMBA, AHB, APB,
Keil, ARM7TDMI, ARM7, ARM9, ARM1156T2(F)-S, Mali, DS-5, Embedded
Trace Macrocell, and PrimeCell are registered trademarks of ARM Limited in the
EU and/or elsewhere. All rights reserved. Other names may be trademarks of their
respective owners.

xxv

About this Book
The source code of the example projects in this book can be download from the
companion website from Elsevier: http://booksite.elsevier.com/9780124080829

xxvii

Contributor Bio-Paul Beckmann
Paul Beckmann is the founder of DSP Concepts, an engineering services company
that specializes in DSP algorithm development and supporting tools. He has many
years of experience developing and implementing numerically intensive algorithms
for audio, communications, and video. Paul has taught industry courses on digital
signal processing, and holds a variety of patents in processing techniques. Prior to
founding DSP Concepts, Paul spent 9 years at Bose Corporation and was involved
in R&D and product development activities.

xxix

Acknowledgments
I would like to thank the following people for providing me with help, advice and
feedback for the 3rd edition of this book:
First of all, a big thank you to Paul Beckmann, PhD, for contributing two chapters on the DSP subject. The DSP capability is an important part of the Cortex-M4
processor and the CMSIS-DSP library is a significant stepping stone for allowing
microcontroller users to develop DSP applications. This book would not be complete without these two chapters.
Secondly, I would like to thanks my colleagues at ARM for their support. I have
received much useful feedback from Joey Ye, Stephen Theobald, Graham Cunningham, Edmund Player, Drew Barbier, Chris Shore, Simon Craske, and Robert Boys.
Also many thanks for the support from the ARM Embedded marketing team:
Richard York, Andrew Frame, Neil Werdmuller, and Ian Johnson.
I would also like to thank Reinhard Keil, Robert Rostohar, and Martin Günther of
Keil for answering my many questions on CMSIS, Anders Lundgren of IAR Systems for reviewing the materials related to EWARM, and Magnus Unemyr for
reviewing materials related to Atollic TrueStudioÒ.
I also want to thank the following people for their help in assisting with the
writing of the first and second editions of this book: Dominic Pajak, Alan Tringham,
Nick Sampays, Dan Brook, David Brash, Haydn Povey, Gary Campbell, Kevin
McDermott, Richard Earnshaw, Shyam Sadasivan, Simon Axford, Takashi Ugajin,
Wayne Lyons, Samin Ishtiaq, Dev Banerjee, Simon Smith, Ian Bell, Jamie Brettle,
Carlos O’Donell, Brian Barrera, and Daniel Jacobowitz.
And of course, I must express my gratitude to all the readers of my previous
books that have provided me with their very useful feedback.
Also, many thanks to the staff at Elsevier for their professional work, which has
enabled this book to be published
And finally, a special thank you to all of my friends for their support and understanding whist I was writing this book.
Regards,
Joseph Yiu

xxxi

Terms and Abbreviations
Abbreviation

Meaning

ADK
AHB
AHB-AP
AMBA
APB
API
ARM ARM
ASIC
ATB
BE8
CMSIS
CPI
CPU
DAP
DSP
DWT
EABI/ABI
ETM
FPB
FPGA
FPU
FSR
ICE
IDE
IRQ
ISA
ISR
ITM
JTAG

AMBA Design Kit
Advanced High-Performance Bus
AHB Access Port
Advanced Microcontroller Bus Architecture
Advanced Peripheral Bus
Application Programming Interface
ARM Architecture Reference Manual
Application Specific Integrated Circuit
Advanced Trace Bus
Byte Invariant Big Endian Mode
Cortex Microcontroller Software Interface Standard
Cycles Per Instruction
Central Processing Unit
Debug Access Port
Digital Signal Processor/Digital Signal Processing
Data WatchPoint and Trace
Embedded Application Binary Interface
Embedded Trace Macrocell
Flash Patch and Breakpoint
Field Programmable Gate Array
Floating Point Unit
Fault Status Register
In-Circuit Emulator
Integrated Development Environment
Interrupt Request (normally refers to external interrupts)
Instruction Set Architecture
Interrupt Service Routine
Instrumentation Trace Macrocell
Joint Test Action Group (a standard of test/debug
interfaces)
JTAG Debug Port
Link Register
Least Significant Bit
Load/Store Unit
Multiply Accumulate
Microcontroller Unit
Memory Management Unit

JTAG-DP
LR
LSB
LSU
MAC
MCU
MMU

(Continued)

xxxiii

xxxiv

Terms and Abbreviations

Abbreviation

Meaning

MPU
MSB
MSP
NaN
NMI
NVIC
OS
PC
PMU
PSP
PPB
PSR
RTOS
SCB
SCS
SIMD
SP, MSP, PSP

Memory Protection Unit
Most Significant Bit
Main Stack Pointer
Not-a-Number (floating point representation)
Non-maskable Interrupt
Nested Vectored Interrupt Controller
Operating System
Program Counter
Power Management Unit
Process Stack Pointer
Private Peripheral Bus
Program Status Register
Real-Time Operating System
System Control Block
System Control Space
Single Instruction, Multiple Data
Stack Pointer, Main Stack Pointer, Process Stack
Pointer
System-on-a-Chip
Stack Pointer
State Retention Power Gating
Serial-Wire
Serial-Wire Debug Port
Serial-Wire JTAG Debug Port
Serial-Wire Viewer (an operation mode of TPIU)
Tightly Coupled Memory (Cortex-M1 feature)
Trace Port Analyzer
Trace Port Interface Unit
Technical Reference Manual
Unified Assembly Language
Wakeup Interrupt Controller

SoC
SP
SRPG
SW
SW-DP
SWJ-DP
SWV
TCM
TPA
TPIU
TRM
UAL
WIC

Conventions
Various typographical conventions have been used in this book, as follows:
•

Normal assembly program codes:

•

Assembly code in generalized syntax; items inside “< >” must be replaced by
real register names:

•

C program codes:

•

Values:
1. 4’hC , 0x123 are both hexadecimal values
2. #3 indicates item number 3 (e.g., IRQ #3 means IRQ number 3)
3. #immed_12 refers to 12-bit immediate data
Register bits:
Typically used to illustrate a part of a value based on bit position. For example,
bit[15:12] means bit number 15 down to 12.
Register access types:
1. R is Read only
2. W is Write only
3. R/W is Read or Write accessible
4. R/Wc is Readable and clear by a Write access

MOV R0, R1 ; Move data from Register R1 to Register R0

MRS , 
for (i=0;i<3;i++) { func1(); }

•
•

xxxv

CHAPTER

Introduction to ARMÒ
CortexÒ-M Processors

1

CHAPTER OUTLINE
1.1 What are the ARMÒ CortexÒ-M processors?........................................................... 2
1.1.1 The CortexÒ-M3 and Cortex-M4 processors ......................................... 2
1.1.2 The CortexÒ-M processor family ......................................................... 3
1.1.3 Differences between a processor and a microcontroller ........................ 4
1.1.4 ARMÒ and the microcontroller vendors ............................................... 5
1.1.5 Selecting CortexÒ-M3 and Cortex-M4 microcontrollers......................... 6
1.2 Advantages of the CortexÒ-M processors ............................................................... 8
1.2.1 Low power ....................................................................................... 8
1.2.2 Performance .................................................................................... 9
1.2.3 Energy efficiency .............................................................................. 9
1.2.4 Code density .................................................................................... 9
1.2.5 Interrupts ........................................................................................ 9
1.2.6 Ease of use, C friendly ...................................................................... 9
1.2.7 Scalability ..................................................................................... 10
1.2.8 Debug features............................................................................... 10
1.2.9 OS support .................................................................................... 10
1.2.10 Versatile system features............................................................... 10
1.2.11 Software portability and reusability ................................................ 10
1.2.12 Choices (devices, tools, OS, etc.) ................................................... 10
1.3 Applications of the ARMÒ CortexÒ-M processors.................................................. 11
1.4 Resources for using ARMÒ processors and ARM microcontrollers ......................... 12
1.4.1 What can you find on the ARMÒ website........................................... 12
1.4.2 Documentation from the microcontroller vendors............................... 12
1.4.3 Documentation from tool vendors..................................................... 14
1.4.4 Other resources .............................................................................. 14
1.5 Background and history ...................................................................................... 15
1.5.1 A brief history of ARMÒ................................................................... 15
1.5.2 ARMÒ processor evolution ............................................................... 16
1.5.3 Architecture versions and ThumbÒ ISA............................................. 18
1.5.4 Processor naming ........................................................................... 22
1.5.5 About the ARMÒ ecosystem............................................................. 23

The Definitive Guide to ARMÒ CortexÒ-M3 and Cortex-M4 Processors. http://dx.doi.org/10.1016/B978-0-12-408082-9.00001-4
Copyright Ó 2014 Elsevier Inc. All rights reserved.

1

2

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

1.1 What are the ARMÒ CortexÒ-M processors?
1.1.1 The CortexÒ-M3 and Cortex-M4 processors

The CortexÒ-M3 and Cortex-M4 are processors designed by ARMÒ. The CortexM3 processor was the first of the Cortex generation of processors, released by
ARM in 2005 (silicon products released in 2006). The Cortex-M4 processor was
released in 2010 (released products also in 2010).
The Cortex-M3 and Cortex-M4 processors use a 32-bit architecture. Internal
registers in the register bank, the data path, and the bus interfaces are all 32 bits
wide. The Instruction Set Architecture (ISA) in the Cortex-M processors is called
the ThumbÒ ISA and is based on Thumb-2 Technology which supports a mixture
of 16-bit and 32-bit instructions.
The Cortex-M3 and Cortex-M4 processors have:
•
•
•
•
•
•
•
•
•
•

Three-stage pipeline design
Harvard bus architecture with unified memory space: instructions and data use
the same address space
32-bit addressing, supporting 4GB of memory space
On-chip bus interfaces based on ARM AMBAÒ (Advanced Microcontroller Bus
Architecture) Technology, which allow pipelined bus operations for higher throughput
An interrupt controller called NVIC (Nested Vectored Interrupt Controller)
supporting up to 240 interrupt requests and from 8 to 256 interrupt priority levels
(dependent on the actual device implementation)
Support for various features for OS (Operating System) implementation such as a
system tick timer, shadowed stack pointer
Sleep mode support and various low power features
Support for an optional MPU (Memory Protection Unit) to provide memory
protection features like programmable memory, or access permission control
Support for bit-data accesses in two specific memory regions using a feature
called Bit Band
The option of being used in single processor or multi-processor designs
The ISA used in Cortex-M3 and Cortex-M4 processors provides a wide range of
instructions:

•
•
•
•
•
•

General data processing, including hardware divide instructions
Memory access instructions supporting 8-bit, 16-bit, 32-bit, and 64-bit data, as
well as instructions for transferring multiple 32-bit data
Instructions for bit field processing
Multiply Accumulate (MAC) and saturate instructions
Instructions for branches, conditional branches and function calls
Instructions for system control, OS support, etc.
In addition, the Cortex-M4 processor also supports:

•
•

Single Instruction Multiple Data (SIMD) operations
Additional fast MAC and multiply instructions

1.1 What are the ARMÒ CortexÒ-M processors?

•
•

Saturating arithmetic instructions
Optional floating point instructions (single precision)1

Both the Cortex-M3 and Cortex-M4 processors are widely used in modern
microcontroller products, as well as other specialized silicon designs such as System
on Chips (SoC) and Application Specific Standard Products (ASSP).
In general, the ARM Cortex-M processors are regarded as RISC (Reduced
Instruction Set Computing) processors. Some might argue that certain characteristics of the Cortex-M3 and Cortex-M4 processors, such as the rich instruction
set and mixed instruction sizes, are closer to CISC (Complex Instruction Set
Computing) processors. But as processor technologies advance, the instruction
sets of most RISC processors are also getting more complex, so much so that this
traditional boundary between RISC and CISC processor definition can no longer
be applied.
There are a lot of similarities between the Cortex-M3 and Cortex-M4 processors.
Most of the instructions are available on both processors, and the processors have the
same programmer’s model for NVIC, MPU, etc. However, there are some differences in their internal designs, which allow the Cortex-M4 processor to deliver
higher performance in DSP applications, and to support floating point operations.
As a result, some of the instructions available on both processors can be executed
in fewer clock cycles on the Cortex-M4.

1.1.2 The CortexÒ-M processor family
The CortexÒ-M3 and Cortex-M4 processors are two of the products in the ARMÒ
Cortex-M processor family. The whole Cortex-M processor family is shown in
Figure 1.1.
The Cortex-M3 and Cortex-M4 processors are based on ARMv7-M architecture. Both are high-performance processors that are designed for microcontrollers.
Because the Cortex-M4 processor has SIMD, fast MAC, and saturate arithmetic instructions, it can also carry out some of the digital signal processing applications
that traditionally have been carried out by a separate Digital Signal Processor
(DSP).
The Cortex-M0, Cortex-M0þ, and the Cortex-M1 processors are based on
ARMv6-M, which has a smaller instruction set. Both Cortex-M0 and CortexM0þ are very small size in terms of gate count, with just about 12K gates2 in minimum configuration, and are ideal for low-cost microcontroller products. The
Cortex-M0þ processor has the most state-of-the-art low power optimizations, and
has more available optional features.
The Cortex-M1 processor is designed specifically for FPGA applications. It has
Tightly Coupled Memory (TCM) features that can be implemented using memories
1

In many technical documents, and in command line option switches for a number of C compilers, the
name Cortex-M4F is used for Cortex-M4 processor with the optional floating point unit.
2
The silicon area is equivalent to approximately 12000 2-input NAND gates.

3

4

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

FIGURE 1.1
The Cortex-M processor family

inside the FPGA, and the design allows high clock frequency operations in advanced
FPGA. For example, it can run at over 200 MHz in Altera Stratix III FPGA.
For general data processing and I/O control tasks, the Cortex-M0 and CortexM0þ processors have excellent energy efficiency due to the low gate count design.
But for applications with complex data processing requirements, they may take more
instructions and clock cycles. In this case, the Cortex-M3 or Cortex-M4 processor
would be more suitable, because the additional instructions available in these processors allow the processing to be carried out with fewer instructions compared to
ARMv6-M architecture. As a result, we need different processors for different
applications.
It is worthwhile to note that the Cortex-M processors are not the only ARM processors to be used in generic microcontroller products. The venerable ARM7Ô processor has been very successful in this market, with companies like NXP (formerly
Philips Semiconductor), Texas Instruments, Atmel, OKI, and many other vendors
delivering ARM-based microcontrollers using classic ARM processors like
ARM7TDMIÔ . There are also wide ranges of microcontrollers designed with
ARM9Ô processors. The ARM7 processor is the most widely used 32-bit embedded
processor in history, with over 2 billion processors produced each year in a huge
variety of electronics products, from mobile phones to automotive systems.

1.1.3 Differences between a processor and a microcontroller
ARMÒ does not make microcontrollers. ARM designs processors and various components that silicon designers need and licenses these designs to various silicon
design companies including microcontroller vendors. Typically we call these
designs “Intellectualy Property” (IP) and the business model is called IP licensing.

1.1 What are the ARMÒ CortexÒ-M processors?

FIGURE 1.2
A microcontroller contains many different blocks

In a typical microcontroller design, the processor takes only a small part of the silicon area. The other areas are taken up by memories, clock generation (e.g., PLL) and
distribution logic, system bus, and peripherals (hardware units like I/O interface units,
communication interface, timers, ADC, DAC, etc.) as shown in Figure 1.2.
Although many microcontroller vendors use ARM Cortex-M processors as their
choice of CPU, the memory system, memory map, peripherals, and operation characteristics (e.g., clock speed and voltage) can be completed differently from one
product to another. This allows microcontroller manufacturers to add additional features in their products and differentiate their products from others on the market.
This book is focused on the Cortex-M3 and the Cortex-M4 processors. For details
of the complete microcontroller system design, such as peripheral details, memory
map, and I/O pin assignments, you still need to read the reference manuals provided
by the microcontroller vendor.

1.1.4 ARMÒ and the microcontroller vendors
Currently there are more than 15 silicon vendors3 using ARMÒ CortexÒ-M3 or
Cortex-M4 processors in microcontroller products. There are also some other
3

Current Cortex-M3/M4 microcontroller vendors include: Analog Devices, Atmel, Cypress,
EnergyMicro, Freescale, Fujitsu, Holtek, Infineon, Microsemi, Milandr, NXP, Samsung, Silicon
Laboratories, ST Microelectronics, Texas Instrument, and Toshiba.

5

6

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

companies that use Cortex-M3 or Cortex-M4 for SoC designs, others companies that
use only Cortex-M0 or Cortex-M0þ processors.
After a company licenses the Cortex-M processor design, ARM provides the
design source code of the processor in a language called Verilog-HDL (Hardware
Description Language). The design engineers in these companies then add their
own design blocks like peripherals and memories, and use various EDA tools to
convert the whole design from Verilog-HDL and various other forms into a transistor
level chip layout.
ARM also provides other Intellectual Property (IP) products, and some can be used
by these companies in their microcontroller products (see Figure 1.3). For example:
•
•
•

Design of the cell libraries such as logic gates and memories (ARM Physical IP
products)
Peripherals and AMBAÒ infrastructure components (Cortex-M System Design
Kit (CMSDK), ARM CoreLinkÔ IP products)
Additional debug components for linking debug systems in multi-processor
design (ARM CoreSightÔ IP products)

For example, ARM provides a product called the Cortex-M System Design Kit
(CMSDK), a design kit for Cortex-M processor with AMBA infrastructure components, baseline peripherals, example systems, and example software. This allows
chip designers to start using the Cortex-M processors quickly and reduces the total
chip development effort with reusable IP.
But of course, there is still a lot of work for the microcontroller chip designers to
do. All of these microcontroller companies are working hard to develop better
peripherals, lower power memories, and adding their own secret recipes to try to
make their products better than others. In addition, they also need to develop
example software and support materials to make it easier for the embedded product
designers to use their chips.
On the software side, ARM has various software development platforms such as
the KeilÔ Microcontroller Development Kit (MDK-ARM) and ARM Development
Studio 5 (DS-5Ô ). These software development suites contain compilers, debuggers,
and instruction set simulators. Designers can also use other third-party software
development tools if they prefer. Since all of the Cortex-M microcontrollers have
the same processor cores, the embedded product designers can use the same development suite for a massive range of microcontrollers from different vendors.

1.1.5 Selecting CortexÒ-M3 and Cortex-M4 microcontrollers
On the market there are a wide range of CortexÒ-M microcontroller products. These
range from low-cost, off-the-shelf microcontroller products to high-performance
multi-processor systems on a chip. There are many factors to be considered when
selecting a microcontroller device for a product. For example:
•
•

Peripherals and interface features
Memory size requirements of the application

ARM
Processor IPs
SecurCore

Cortex-R

Cortex-M

ARM9

ARM11

ARM7

Mali GPU

Analog

ARM
Processor

System
Control

Microcontroller

Infrastructure
Other IPs
Bus
components

Cell libraries

Periperals

I/O libraries

Memory
controllers

Memory
libraries

Broad range of IP blocks
available for licensing

SRAM

Peripherals

Flash

Interface
Complete microcontroller design
The microcontroller design is completed by
microcontroller vendor, with possibly a
number of ARM IP components inside.

FIGURE 1.3
A microcontroller might contain multiple ARM IP products

Manufacture by
microcontroller vendors
or semiconductor
foundry

1.1 What are the ARMÒ CortexÒ-M processors?

Cortex-A

Design by
microcontroller vendors

7

8

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

•
•
•
•
•
•
•
•
•

Low power requirements
Performance and maximum frequency
Chip package
Operation conditions (voltage, temperature, electromagnetic interference)
Cost and availability
Software development tool support and development kits
Future upgradability
Firmware packages and firmware security
Availability of application notes, design examples, and support

There are no golden rules on how to select the best microcontroller. All of
the factors depend on your target applications as well as your project’s situation.
Some of the factors, like cost and product availability, might vary from time
to time.
When developing projects based on off-the-shelf microcontrollers, usually the
example projects and documentation from the microcontroller vendors are the
best starting points. In addition, microcontroller vendors might also provide:
•
•
•

Application notes
Development kits
Reference designs

You might also find additional examples from tools vendors and from various
websites on the Internet.
When designing a Printed Circuit Board (PCB) for the Cortex-M microcontrollers or any other ARM-based microcontroller, it is best to bring out the debug interface with a standardize connector layout as documented in Appendix H. This makes
debugging much easier.

1.2 Advantages of the CortexÒ-M processors

The ARMÒ CortexÒ-M processors have many technical and non-technical advantages compared to other architectures.

1.2.1 Low power
Compared to other 32-bit processor designs, CortexÒ-M processors are relatively
small. The Cortex-M processor designs are also optimized for low power consumption. Currently, many Cortex-M microcontrollers have power consumption of less
than 200 mA/MHz, with some of them well under 100 mA/MHz. In addition, the
Cortex-M processors also include support for sleep mode features and can be
used with various advanced ultra-low power design technologies. All these allow
the Cortex-M processors to be used in various ultra-low power microcontroller
products.

1.2 Advantages of the CortexÒ-M processors

1.2.2 Performance
The CortexÒ-M3 and Cortex-M4 processors can deliver over 3 CoreMark/MHz and
1.25 DMIPS/MHz (based on the Dhrystone 2.1 benchmark). This allows Cortex-M3
and Cortex-M4 microcontrollers to handle many complex and demanding applications. Alternatively you can run the application with a much slower clock speed
to reduce power consumption.

1.2.3 Energy efficiency
Combining low power and high-performance characteristics, the CortexÒ-M3 and
Cortex-M4 processors have excellent energy efficiency. This means that, you can
still do a lot of processing, with a limited supply of energy. Or you can get tasks
done quicker and allow the system to stay in sleep mode for longer durations of
time, enabling longer battery life in portable products.

1.2.4 Code density
The ThumbÒ ISA provides excellent code density. This means that to achieve the
same tasks, you need a smaller program size. As a result you can reduce cost and
power consumption by using a microcontroller with smaller flash memory size,
and chip manufacturers can produce microcontroller chips with smaller packages.

1.2.5 Interrupts
The CortexÒ-M3 and Cortex-M4 processors have a configurable interrupt controller
design, which can support up to 240 vectored interrupts and multiple levels of interrupt priorities (from 8 to 256 levels). Nesting of interrupts is automatically handled
by hardware, and the interrupt latency is only 12 clock cycles for systems with zero
wait state memory. The interrupt processing capability makes the Cortex-M processors suitable for many real-time control applications.4

1.2.6 Ease of use, C friendly
The CortexÒ-M processors are very easy to use. In fact, they are easier than
compared to many 8-bit processors because Cortex-M processors have a simple,
linear memory map, and there are no special architectural restrictions, which you
often find in 8-bit microcontrollers (e.g., memory banking, limited stack levels,
non-re-entrant code, etc.). You can program almost everything in C including the
interrupt handlers.
4

There is always great debate as to whether we can have a “real-time” system using general processors. By definition, “real-time” means that the system can get a response within a guaranteed
period. In any processor-based system, you may or may not be able to get this response due to choice
of OS, interrupt latency, or memory latency, as well as if the CPU is running a higher priority
interrupt.

9

10

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

1.2.7 Scalability
The CortexÒ-M processor family allows easy scaling of designs from low-cost, simple microcontrollers costing less than a dollar to high-end microcontrollers running
at 200 MHz or more. You can also find Cortex-M microcontrollers with multiprocessor designs. With all these, due to the consistency of the processor architecture, you only need one tool chain and you can reuse your software easily.

1.2.8 Debug features
The CortexÒ-M processors include many debug features that allow you to analyze
design problems easily. Besides standard design features, which you can find in
most microcontrollers like halting and single stepping, you can also generate a trace
to capture program flow, data changes, profiling information, and so on. In multiple
processor designs, the debug system of each Cortex-M processor can be linked
together to share debug connections.

1.2.9 OS support
The CortexÒ-M processors are designed with OS applications in mind. A number of
features are available to make OS implementation easier and make OS operations
more efficient. Currently there are over 30 embedded OSs available for Cortex-M
processors.

1.2.10 Versatile system features
The CortexÒ-M3 and Cortex-M4 processors support a number of system features
such as bit addressable memory range (bit band feature) and MPU (Memory Protection Unit).

1.2.11 Software portability and reusability
Since the architecture is very C friendly, you can program almost everything in standard ANSI C. One of ARM’s initiatives called CMSIS (CortexÒ Microcontroller
Software Interface Standard) makes programming for Cortex-M processor based
products even easier by providing standard header files and an API for standard
Cortex-M processor functions. This allows better software reusability and also
makes porting application code easier.

1.2.12 Choices (devices, tools, OS, etc.)
One of the best things about using CortexÒ-M microcontrollers number amount of
available choices. Besides the thousands of microcontroller devices available, you
also have a wide range of coins on software development/debug tools, embedded
OS, middleware, etc.

1.3 Applications of the ARMÒ CortexÒ-M processors

1.3 Applications of the ARMÒ CortexÒ-M processors

With their wide range of powerful features, the ARMÒ CortexÒ-M3 and Cortex-M4
processors are ideal for a wide variety of applications:
Microcontrollers: The Cortex-M processor family is ideally suited for microcontroller products.
This includes low-cost microcontrollers with small memory sizes and highperformance microcontrollers with high operation speeds. These microcontrollers
can be used in consumer products, from toys to electrical appliances, or even
specialized products for Information Technology (IT), industrial, or even medical
systems.
Automotive: Another application for the Cortex-M3 and Cortex-M4 processors
is in the automotive industry. As these processors offer great performance, very high
energy efficiency, and low interrupt latency, they are ideal for many real-time control
systems. In addition, the flexibility of the processor design (e.g., it supports up to 240
interrupt sources, optional MPU) makes it ideal for highly integrated ASSPs (Application Specific Standard Products) for the automotive industry. The MPU feature
also provides robust memory protection, which is required in some of these
applications.
Data communications: The processor’s low power and high efficiency, coupled
with instructions in ThumbÒ-2 for bit-field manipulation, make the Cortex-M3 and
Cortex-M4 processors ideal for many communication applications, such as Bluetooth and ZigBee.
Industrial control: In industrial control applications, simplicity, fast response,
and reliability are key factors. Again, the interrupt support features on Cortex-M3
and Cortex-M4 processors, including their deterministic behavior, automatic
nested interrupt handling, MPU, and enhanced fault-handling, make them strong
candidates in this area.
Consumer products: In many consumer products, a high-performance microprocessor (or several) is used. The Cortex-M3 and Cortex-M4 processors, being
small, are highly efficient and low in power, and at the same time provide the performance required for handling complex GUIs on LCD panels and various communication protocols.
Systems-on-Chips (SoC): In some high-end application processor designs,
Cortex-M processors are used in various subsystems such as audio processing
engines, power management systems, FSM (Finite State Machine) replacement,
I/O control task off loading, etc.
Mixed signal designs: In the IC design world, the digital and analog designs are
converging. While microcontrollers contain more and more analogue components
(e.g., ADC, DAC), some analog ICs such as sensors, PMIC (Power Management
IC), and MEMS (Microelectromechanical Systems) now also include processors
to provide additional intelligence. The low power capability and small gate count
characteristics of the Cortex-M processors make it possible for them to be integrated
on mixed signal IC designs.

11

12

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

There are already many Cortex-M3 and Cortex-M4 processor-based products on
the market,5 including low-end microcontrollers at less than 0.5 U.S. dollar, making
the cost of ARM microcontrollers comparable to or lower than that of many 8-bit
microcontrollers.

1.4 Resources for using ARMÒ processors and ARM
microcontrollers
1.4.1 What can you find on the ARMÒ website

Although ARMÒ does not make or sell CortexÒ-M3 or Cortex-M4 microcontrollers,
there is quite a range of useful documentation on the ARM website. The documentation section of the ARM website (called Infocenter, http://infocenter.arm.com/)
contains various specifications, application notes, knowledge articles, etc. Table 1.1
lists some of the reference documents containing details of the Cortex-M3 and
Cortex-M4 processors.
Table 1.2 listed some of the Application Notes that can be useful for microcontroller software developers.
On the Infocenter you can also find manuals for ARM software products, such as
the C compiler and linker, including KeilÔ products.
For readers who are interested in the details of integrating Cortex-M processors
into System-on-Chip designs or FPGA, the information listed in Table 1.3 might be
useful.

1.4.2 Documentation from the microcontroller vendors
The documentation and resources from the microcontroller vendors are essential in
embedded software development. Typically you can find:
•
•
•

Reference manual for the microcontroller chip. This provides the programmer’s
model of the peripherals, memory maps and other information needed for software development.
Data sheet of the microcontroller you use. This contains the information on
package, pin layout, operation conditions (e.g., temperature), voltage and current
characteristics, and other information you may need when designing the PCB.
Application notes. These contain examples of using the peripherals or features
on the microcontrollers, or information on handling specific task (e.g., flash
programming).

You might also find additional resources on development kits, and additional
firmware libraries.

5

At the end of 3rd quarter of 2012, the accumulative shipment of Cortex-M3 and Cortex-M4
processors was 2.5 billion units.

1.4 Resources for using ARMÒ processors and ARM microcontrollers

Table 1.1 Reference ARM Document on the Cortex-M3 and Cortex-M4 Processors
Document

Reference

ARMv7-M Architecture Reference Manual
This is the specification of the architecture on which Cortex-M3 and
Cortex-M processors are based. It contains detailed information about
the instruction set, architecture defined behaviors, etc. This document
can be accessed via the ARM website after a simple registration
process.
Cortex-M3 Devices Generic User Guide
This is a user guide written for software developers using the Cortex-M3
processor. It provides information on the programmer’s model, details
on using core peripherals such as NVIC, and general information about
the instruction set.
Cortex-M4 Devices Generic User Guide
This is a user guide written for software developers using the Cortex-M4
processor. It provides information on the programmer’s model, details
on using core peripherals such as NVIC, and general information about
the instruction set.
Cortex-M3 Technical Reference Manual
This is a specification of the Cortex-M3 processor product. It contains
implementation specific information such as instruction timing and
some of the interface information (for silicon designers).
Cortex-M4 Technical Reference Manual
This is a specification of the Cortex-M3 processor product. It contains
implementation specific information such as instruction timing and
some of the interface information (for silicon designers).
Procedure Call Standard for the ARM Architecture
This document specifies how software code should work in procedure
calls. This information is often needed for software projects with mixed
assembly and C languages.

1

2

3

4

5

13

Table 1.2 ARM Application Notes That Can be Useful for Microcontroller Software
Developers
Document

Reference

AN179 – Cortex-M3 Embedded Software Development
AN210 – Running FreeRTOS on the Keil MCBSTM32 Board with
RVMDK Evaluation Tools
AN234 – Migrating from PIC Microcontrollers to Cortex-M3
AN237 – Migrating from 8051 to Cortex Microcontrollers
AN298 – Cortex-M4 Lazy Stacking and Context Switching
AN321 – ARM Cortex-M Programming Guide to Memory Barrier
Instructions

10
17
18
19
11
8

13

14

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

Table 1.3 ARM Documents That Can be Useful for SoC/FPGA Designers
Document

Reference

AMBA 3 AHB-Lite Protocol Specification
This is the specification for the AHB (Advanced High-performance
Bus) Lite protocol, an on-chip bus protocol used on the bus
interfaces of the Cortex-M processors. AMBA (Advanced
Microcontroller Bus Architecture) is a collection of on-chip bus
protocols developed by ARM and is used by many IC design
companies.
AMBA 3 APB Protocol Specification
This is the specification for the APB (Advanced Peripheral Bus) Lite
protocol, an on-chip bus protocol used for connecting peripherals to
the internal bus system, and to connect debug components to the
Cortex-M processors. APB is part of the AMBA specification.
CoreSight Technology System Design Guide
An introductory guide for silicon/FPGA designers who want to
understand the basics of the CoreSight Debug Architecture. The
debug system for the Cortex-M processors is based on the
CoreSight Debug Architecture.

14

15

17

Table 1.4 Keil Application Notes That Can be Useful for Microcontroller Software
Developers
Document

Reference

Keil Application Note 202 – MDK-ARM Compiler Optimizations
Keil Application Note 209 – Using Cortex-M3 and Cortex-M4 Fault
Exceptions
Keil Application Note 221 – Using CMSIS-DSP Algorithms with RTX

20
21
22

1.4.3 Documentation from tool vendors
Very often the software development tool vendors also provide lots of useful information. In addition to tool chain manuals (e.g., compiler, linker) you can also find application notes. For example, on the KeilÔ website (http://www.keil.com/appnotes/list/
arm.h), you can find various for tutorials of using Keil MDK-ARM with Cortex-M
development kits, as well as some application notes that cover some general
programming information. Table 1.4 listed several application notes on the Keil
website which are particularly useful for application development on the Cortex-M
processors.

1.4.4 Other resources
On the ARMÒ website there are lots of other useful documents. For example, on the
Infocenter you can find an ARM and ThumbÒ-2 Instruction Set quick reference card

1.5 Background and history

(reference 25). Although this quick reference card is not specific to the instruction
set for the CortexÒ-M processor, it can still be a handy reference for the majority
of the instructions.
There are plenty of software vendors that provide software products like RTOS for
Cortex-M processors. Often these companies also provide useful documentation on
their websites that show how to use their products as well as general design guidelines.
On social media websites like YouTube, you can also find various tutorials on
using Cortex-M based products, such as an introduction to microcontroller products
and software tools.
There are a number of online discussion forums available that are focused on
ARM Technologies. For example, the ARM website has a forum (http://forums.
arm.com), and tool vendors and microcontroller vendors might also have their
own online forums. In addition, some social media websites also have an ARM
focused group; for example, LinkedIn has an ARM Based Group.6
There are already a number of books available on Cortex-M processors. Besides
this book and “The Definitive Guide to the ARM Cortex-M0,” Hitex also have a free
online book on the STM32,7 and you can also find quite a number of other books
available in various online stores.
Don’t forget that the distributor which provides you with the microcontroller
chips can also be a useful source of information.

1.5 Background and history
1.5.1 A brief history of ARMÒ

Over the years, ARMÒ has designed many processors, and many features of the
CortexÒ-M3 and Cortex-M4 processors are based on the successful technologies
which have evolved from some of the processors designed in the past. To help
you understand the variations of ARM processors and architecture versions, let’s
look at a little bit of ARM history.
ARM was formed in 1990 as Advanced RISC Machines Ltd., a joint venture
between Apple Computers, Acorn Computer Group, and VLSI Technology.
In 1991, ARM introduced the ARM6 processor family (used in Apple Newton,
see Figure 1.4), and VLSI became the initial licensee. Subsequently, additional
companies, including Texas Instruments, NEC, Sharp, and ST Microelectronics,
licensed the ARM processor designs, extending the applications of ARM processors into mobile phones, computer hard disks, personal digital assistants (PDAs),
home entertainment systems, and many other consumer products.
Nowadays, ARM partners ship in excess of 5 billion chips with ARM processors each year (7.9 billion in 20118). Unlike many semiconductor companies,
6

http://www.linkedin.com/groups/ARM-Based-Group-85447
http://www.hitex.co.uk/index.php?id¼download-insiders-guides00
8
Data from ARM Holdings e H1Q2 2012 presentation
7

15

16

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

FIGURE 1.4
The Apple Newton MessagePad H1000 PDA (based on ARM 610, released in 1993) placed
next to an Apple iPhone 4, which is based on the Apple A4 processor that contains an
ARM Cortex-A8 processor, released in 2010

ARM does not manufacture processors or sell the chips directly. Instead, ARM
licenses the processor designs to business partners, including a majority of the
world’s leading semiconductor companies. Based on the ARM low-cost and
power-efficient processor designs, these partners create their processors, microcontrollers, and system-on-chip solutions. This business model is commonly called
IP licensing.
In addition to processor designs, ARM also licenses systems-level IP such as
peripherals and memory controllers. To support the customers who use ARM
products, ARM has developed a strong base of development tools, hardware, and
software products to enable partners to develop their own products, and to enable
software developers to develop software for ARM platforms.

1.5.2 ARMÒ processor evolution
Before the CortexÒ-M3 processor was released, there were already quite a number of
different ARMÒ processors available and some of them were already used in microcontrollers. One of the most successful processor products from ARM the
ARM7TDMIÔ processor, which is used in many 32-bit microcontrollers around the
world. Unlike traditional 32-bit processors, the ARM7TDMI supports two instruction

1.5 Background and history

sets, one called the ARM instruction set with 32-bit instructions, and another 16-bit
instruction set called ThumbÒ. By allowing both instruction sets to be used on the processor, the code density is greatly increased, hence reducing the memory footprint of
the application code. At the same time, critical tasks can still execute with good speed.
This enables ARM processors to be used in many portable devices, which require low
power and small memory. As a result, ARM processors are the first choice for mobile
devices like mobile phones.
Since then, ARM has continued to develop new processors to address the needs
of different applications. For example, the ARM9Ô processor family is used in a
large number of high-performance microcontrollers and the ARM11Ô processor
family is used in a large number of mobile phones.
Following the introduction of the ARM11 family, it was decided that many
of the new technologies, such as the optimized Thumb-2 instruction set, were
just as applicable to the lower cost markets of microcontrollers and automotive
components. It was also decided that although the architecture needed to be
consistent from the lowest MCU to the highest performance application
processor, there was a need to deliver processor architectures that best fit
applications, enabling very deterministic and low gate count processors for
cost-sensitive markets, and feature-rich and high-performance ones for highend applications.
Over the past few years, ARM has extended its product portfolio by diversifying
its CPU development, which resulted in the new processor family name “Cortex.”
In this Cortex processor range, the processors are divided into three profiles
(Figure 1.5):
•
•
•

The A profile is designed for high-performance open application platforms.
The R profile is designed for high-end embedded systems in which real-time
performance is needed.
The M profile is designed for deeply embedded microcontroller-type systems.
Let’s look at these profiles in a bit more detail.
Cortex-A: Application processors that are designed to handle complex
applications such as high-end embedded operating systems (OSs) (e.g., iOS,
Android, Linux, and Windows). These applications require the highest
processing power, virtual memory system support with memory management
units (MMUs), and, optionally, enhanced Java support and a secure program
execution environment. Example products include high-end smartphones,
tablets, televisions, and even computing servers.
Cortex-R: Real-time, high-performance processors targeted primarily at the
higher end of the real-time market e these are applications, such as hard drive
controllers, baseband controllers for mobile communications, and automotive
systems, in which high processing power and high reliability are essential and for
which low latency and determinism are important.

17

18

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

Cortex-A57
ARM Cortex
processors

Performance,
functionality

High-end application
processors

Cortex-A15
Cortex-A9

Cortex-A53

Cortex-A8

Cortex-A7
Cortex-A5
Cortex-R7

High performance
real-time systems

Cortex-R5

ARM11
series
Cortex-R4, R4F
ARM9E
series

Cortex-M4
Cortex-M3
Cortex-M0+

ARM7TDMI
Cortex-M1
2003

2005

Microcontroller
applications

Cortex-M0
2009

2012

Future

FIGURE 1.5
Diversity of processor products for three areas in the Cortex processor family

Cortex-M: Processors targeting smaller scale applications such as
microcontrollers and mixed signal design, where criteria like low cost, low
power, energy efficiency, and low interrupt latency are important. At the same
time, the processor design has to be easy to use and able to provide deterministic
behavior as required in many real-time control systems.
By creating this product range partitioning, the requirements of each marketing
segment are addressed, allowing the ARM architecture to reach even more applications than before.
The Cortex processor families are the first products developed on ARM
architecture v7, and the Cortex-M3 processor is based on one profile of ARMv7,
called ARMv7-M, an architecture specification for microcontroller products.

1.5.3 Architecture versions and ThumbÒ ISA
ARMÒ develops new processors, new instructions, and architectural features are
added from time to time (Figure 1.6). As a result, there are different versions of
the architecture. For example, the successful ARM7TDMIÔ is based on the architecture version ARMv4T (The “T” means Thumb instruction support). Note that
architecture version numbers are independent of processor names.

1.5 Background and history

v4

v4T

v5

v5E

Enhanced
DSP
instructions
added

ARM

v6

v7

SIMD, v6
memory
support
added

Thumb-2
technology
introduced
Thumb

Thumb
instructions
introduced

Architecture development

FIGURE 1.6
Instruction set enhancement

The ARMv5TE architecture was introduced with the ARM9E processor families,
including the ARM926E-S and ARM946E-S processors. This architecture added
“Enhanced” Digital Signal Processing (DSP) instructions for multimedia applications.
With the arrival of the ARM11 processor family, the architecture was extended to
ARMv6. New features in this architecture included memory system features and
Single Instruction Multiple Data (SIMD) instructions. Processors based on the
ARMv6 architecture include the ARM1136J(F)-S, the ARM1156T2(F)-S, and the
ARM1176JZ(F)-S.
In order to address different needs of a wide range of application areas, architecture version 7 is divided into three profiles (Figure 1.7):
•
•
•

CortexÒ-A Processors: ARMv7-A Architecture
Cortex-R Processors: ARMv7-R Architecture
Cortex-M Processors: ARMv7-M & ARMv6-M Architectures

Following the success of the Cortex-M3 processor, an additional architecture
profile called ARMv6-M architecture was also created, to address the needs of
ultra-low power designs. It uses the same programmer’s model and exception
handling methods as ARMv7-M (i.e., NVIC), but uses mostly just Thumb instructions from ARMv6 to reduce the complexity of the design. The Cortex-M0, Cortex-M0þ, and Cortex-M1 processors are based on the ARMv6-M architecture.
The Cortex-M3 and Cortex-M4 processors are based on ARMv7-M, an architecture specification for microcontroller products. Please note that the enhanced DSP

19

20

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

Architecture
v4 / v4T

Architecture
v5 / v5E

Architecture v6

Architecture v7
v7-A (Application)
E.g. Cortex-A8, Cortex-A9,
Cortex-A15
v7-R (Real-Time)
E.g. Cortex-R4, Cortex-R5,
Cortex-R7

ARM1136, 1176,
1156T-2

v7-M (Microcontroller)
E.g. Cortex-M3, Cortex-M4

Architecture v6-M

Examples

ARM7TDMI,
920T,
Intel StrongARM

ARM926, 946,
966,
Intel XScale
Cortex-M0+,
Cortex-M0
Cortex-M1 (FPGA)

FIGURE 1.7
The evolution of ARM processor architecture

features in the Cortex-M4 processor are often referenced as ARMv7E-M, where the
“E” reference to the “Enhanced” DSP instructions, as in ARMv5TE. The architecture details are documented in the ARMv7-M Architecture Reference Manual (reference 1). The architecture documentation contains the following key areas:
•
•
•
•

Programmer’s model
Instruction set
Memory model
Debug architecture

Processor specific information, such as interface details and instruction timing,
are documented in the product specific Technical Reference Manual (TRM) and
other manuals from ARM.
In Figure 1.7 we can see that the Cortex-M0, Cortex-M0þ, and Cortex-M1 processors are based on ARMv6-M. The ARMv6-M architecture is very similar to
ARMv7-M in many ways, such as its interrupt handling, ThumbÒ-2 technology,
and debug architecture. However, ARMv6-M has a smaller instruction set.
Following the success of the Cortex-M3 processor release, ARM decided to
further expand its product range in microcontroller applications. The first step
was to allow users to implement their ARM processor in an FPGA (Field Programmable Gate Array) easily, and the second step was to address the ultra-low power
embedded processor. To do this, ARM took the Thumb instruction set from the existing ARMv6 architecture and developed a new architecture based on the exception
and debug features in ARMv7-M architecture. As a result, ARMv6-M architecture
was formed, and the processors based on this architecture are the Cortex-M0 processor (for microcontroller and ASICs) and the Cortex-M1 processor (for FPGA)
(Figure 1.8).
The result of this development is a processor architecture that enables development of very small and energy efficient processors. At the same time, they are very
easy to use, just like the Cortex-M3 and Cortex-M4.

1.5 Background and history

ARMv6
Architecture

Thumb
instruction set

Memory map

ARMv7-M
Architecture

Programmer’s Model
and Exception Model

ARMv6-M
Architecture

Ultra low
power design
with wide
range of
features

Low power
optimized, low
cost design

ARM
Cortex-M0+

ARM
Cortex-M0

Thumb-2 system

CoreSight
Debug
Architecture

Serial-Wire and
Debug control

FPGA specific
features &
Optimization

ARM
Cortex-M1

FIGURE 1.8
ARMv6-M architecture is based on many features from ARMv7-M

All the Cortex-M processors support Thumb-2 technology and support
different subsets of the Thumb ISA (Instruction Set Architecture). Before
Thumb-2 Technology was available, the Thumb ISA was a 16-bit only instruction
set. Thumb-2 technology extended the Thumb Instruction Set Architecture
(ISA) into a highly efficient and powerful instruction set that delivers significant
benefits in terms of ease of use, code size, and performance (Figure 1.9).
With support for both 16-bit and 32-bit instructions in the Thumb-2 instruction
set, there is no need to switch the processor between Thumb state (16-bit instructions) and ARM state (32-bit instructions). For example, in the ARM7 or
ARM9Ô family of processors, you might need to switch to ARM state if you
want to carry out complex calculations or a large number of conditional operations
and good performance is needed, whereas in the Cortex-M processors, you can mix
32-bit instructions with 16-bit instructions without switching state, getting high code
density and high performance with no extra complexity.
Thumb-2 Technology is a very important feature of ARMv7. Compared with the
instructions supported on ARM7 family processors (architecture ARMv4T), the
Cortex-M3 and Cortex-M4 processor’s instruction set has a large number of new
features. For the first time, a hardware divide instruction is available on an ARM
processor, and a number of multiply instructions are also available on the CortexM3 and Cortex-M4 processors to improve data-crunching performance. The
Cortex-M3 and Cortex-M4 processors also support unaligned data accesses, a
feature previously available only in high-end processors.

21

22

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

Thumb-2 technology
(32-bit and 16-bit Thumb instruction set)
32-bit Thumb instructions

ARMv7E-M architecture
(Cortex-M4)

ARMv7-M architecture
(Cortex-M3)

Thumb
instructions
(16-bit)

A few 16-bit Thumb
instructions are not
available in ARMv6-M

ARMv6-M architecture
(Cortex-M0, Cortex-M0+, Cortex-M1)

FIGURE 1.9
The relationship between the Thumb instruction set and the instruction set implemented in
the Cortex-M processors

1.5.4 Processor naming
Traditionally, ARMÒ used a numbering scheme to name processors. In the early
days (the 1990s), suffixes were also used to indicate features on the processors.
For example, with the ARM7TDMIÔ processor, the “T” indicates Thumb instruction support, “D” indicates JTAG debugging, “M” indicates fast multiplier, and
“I” indicates an embedded ICE module. Subsequently, it was decided that these features should become standard features of future ARM processors; therefore, these
suffixes are no longer added to the new processor family names. Instead, ARM
created a new scheme for processor numbering to indicate variations in memory
interface, cache, and tightly coupled memory (TCM).
For example, ARM processors with cache and MMUs are now given the suffix
“26” or “36,” whereas processors with MPUs are given the suffix “46” (e.g.,
ARM946E-S). In addition, other suffixes are added to indicate synthesizable9 (S)
and Jazelle (J) technology. Table 1.5 presents a summary of processor names.
With ARMv7, ARM has migrated away from these complex numbering schemes
that needed to be decoded, moving to a consistent naming for families of processors,
with Cortex as the overall brand. In addition to illustrating the compatibility across
processors, this system removes confusion between architectural version and processor family number; for example, CortexÒ-M always refers to processors for microcontroller applications, and this covers both ARMv7-M and ARMv6-M products.

9

A synthesizable core design is available in the form of a hardware description language (HDL) such
as Verilog or VHDL and can be converted into a design netlist using synthesis software.

1.5 Background and history

Table 1.5 Naming of Classic ARM Processors; “(F)” Means Optional Floating Point
Unit
Processor Name

Architecture
Version

ARM7TDMI
ARM7TDMI-S
ARM7EJ-S
ARM920T
ARM922T
ARM926EJ-S
ARM946E-S
ARM966E-S
ARM968E-S
ARM966HS
ARM1020E
ARM1022E
ARM1026EJ-S
ARM1136J(F)-S
ARM1176JZ(F)-S
ARM11 MPCore

ARMv4T
ARMv4T
ARMv5TEJ
ARMv4T
ARMv4T
ARMv5TEJ
ARMv5TE
ARMv5TE
ARMv5TE
ARMv5TE
ARMv5TE
ARMv5TE
ARMv5TEJ
ARMv6
ARMv6Z
ARMv6K

ARM1156T2(F)-S

ARMv6T2

Memory Management
Features

Other
Features

DSP, Jazelle
MMU
MMU
MMU
MPU

MPU (optional)
MMU
MMU
MMU or MPU
MMU
MMU þ TrustZone
MMU þ multiprocessor
cache support
MPU

DSP, Jazelle
DSP
DSP
DMA, DSP
DSP
DSP
DSP
DSP, Jazelle
DSP, Jazelle
DSP, Jazelle
DSP, Jazelle
DSP

1.5.5 About the ARMÒ ecosystem
Besides working with silicon vendors, ARMÒ is also working closely and actively
with various parties that develop ARM solutions or use ARM products. For example,
this includes vendors that provide software development suites, embedded OS and
middleware, as well as design services providers, distributors, training providers,
academic researchers, and so on (Figure 1.10). This close collaboration allows these
parties to provide high-quality products or services, and allows more users to benefit
from using the ARM architecture.
The ARM ecosystem also enables better knowledge sharing, which helps software
developers to develop their applications faster and with better quality. For example,
microcontroller users can get help and expert advice easily from various public forums on the Internet. Various microcontroller vendors, distributors, and other training
service providers also organize regular ARM microcontroller training courses.
ARM also works closely with various open source projects to help the open
source community to develop software for ARM platforms. For example, the Linaro
organization (http://www.linaro.org) was set up by ARM as a not-for-profit engineering organization to enhance open source software such as GCC, Linux, and
multimedia support.

23

24

CHAPTER 1 Introduction to ARMÒ CortexÒ-M Processors

Silicon
partners
EDA tool
vendors

Researchers,
academics

ARM

Open
source
community

Design
services &
training

Software &
middleware
vendors

Choices
● More choices of microcontrollers
● More choice on development tools
● More development boards
● More open source project support
● More OS support
● More middleware and software solutions

Knowledge sharing
● Resources on the Internet
● Large user community
● Technical forums
● Seminars and webinars (many free)
● Strong supports

Distributors
Development
tools vendors

Users
ARM ecosystem

FIGURE 1.10
The ARM ecosystem

Companies that develop ARM products or use ARM technologies can join the
ecosystem by becoming a member of the ARM Connected Community. The
ARM Connected Community is a global network of companies aligned to provide
a complete solution, from design to manufacture and end use, for products based
on the ARM architecture. ARM offers a variety of resources to Community members, including promotional programs and peer-networking opportunities that enable
a variety of ARM Partners to come together to provide end-to-end customer solutions. Today, the ARM Connected Community has more than 100010 corporate
members. Joining the ARM Connected Community is easy; details are on the
ARM website http://cc.arm.com.
ARM also has a University Program that enables academic organizations like
universities to access ARM technologies such as processor IP, reference materials,
and so on. Details of the ARM University Program can be found on the ARM
website (http://www.arm.com/support/university/).

10

Information from Q4 2012.

CHAPTER

Introduction to Embedded
Software Development
CHAPTER OUTLINE

2

2.1 What are inside typical ARMÒ microcontrollers?.................................................. 25
2.2 What you need to start ....................................................................................... 26
2.2.1 Development suites ........................................................................ 26
2.2.2 Development boards ....................................................................... 27
2.2.3 Debug adaptor................................................................................ 27
2.2.4 Software device driver..................................................................... 29
2.2.5 Examples....................................................................................... 29
2.2.6 Documentation and other resources ................................................. 30
2.2.7 Other equipment ............................................................................ 30
2.3 Software development flow ................................................................................. 30
2.4 Compiling your applications................................................................................ 32
2.5 Software flow ..................................................................................................... 36
2.5.1 Polling........................................................................................... 36
2.5.2 Interrupt driven .............................................................................. 36
2.5.3 Multi-tasking systems ..................................................................... 38
2.6 Data types in C programming .............................................................................. 39
2.7 Inputs, outputs, and peripherals accesses ........................................................... 40
2.8 Microcontroller interfaces .................................................................................. 45
2.9 The CortexÒ microcontroller software interface standard (CMSIS)......................... 46
2.9.1 Introduction to CMSIS .................................................................... 46
2.9.2 Areas of standardization in CMSIS-Core ........................................... 48
2.9.3 Organization of CMSIS-Core ............................................................ 50
2.9.4 How do I use CMSIS-Core?.............................................................. 50
2.9.5 Benefits of CMSIS-Core .................................................................. 53
2.9.6 Various versions of CMSIS............................................................... 54

2.1 What are inside typical ARMÒ microcontrollers?
There are many different things inside a microcontroller. In many microcontrollers,
the processor takes less than 10% of the silicon area, and the rest of the silicon die is
occupied by other components such as:
•
•
•

Program memory (e.g., flash memory)
SRAM
Peripherals

The Definitive Guide to ARMÒ CortexÒ-M3 and Cortex-M4 Processors. http://dx.doi.org/10.1016/B978-0-12-408082-9.00002-6
Copyright Ó 2014 Elsevier Inc. All rights reserved.

25

26

CHAPTER 2 Introduction to Embedded Software Development

•
•
•
•
•
•

Internal bus infrastructure
Clock generator (including Phase Locked Loop), reset generator, and distribution
network for these signals
Voltage regulator and power control circuits
Other analog components (e.g., ADC, DAC, voltage reference circuits)
I/O pads
Support circuits for manufacturing tests, etc.

While some of these components are directly visible to programmers, some others
could be invisible to software developers (e.g., support circuit for manufacturing tests).
Don’t worry; to use a CortexÒ-M microcontroller, we only need to have basic
understanding of the processors (e.g., how to use the interrupt features), as well as
the detailed programmer’s model of the peripherals. Since the peripherals from different
microcontroller vendors are different, you need to download and read the user manuals
(or similar documents) from microcontroller vendors. This book is focused on the processors, although a number of examples on using peripherals are also covered.
Peripherals and control registers for system management are accessible from the
memory map. To make it easier for software developers, most microcontroller vendors provide C header files and driver libraries for their microcontrollers. In most
cases, these files are developed with the Cortex Microcontroller Software Interface
Standard (CMSIS), which means it used a set of standardized headers for accessing
processor features. We will cover more on this later in this chapter.
In most cases, the processor does all the work of controlling the peripherals and
handles the system management. This book will cover a few examples of using a
number of popular Cortex-M3/M4-based microcontrollers. In some microcontrollers there are also some smart peripherals that can do small amounts of processing
without processor intervention. This depends on the vendor-specific peripherals on
the microcontrollers and is beyond the scope of this book, but you can find the details
in user manuals on the microcontroller vendor’s website.

2.2 What you need to start
2.2.1 Development suites

With more than 10 different vendors selling C compiler suites for CortexÒ-M microcontrollers, deciding which one to use can be a difficult choice. The development
suites range from open-source free tools, to budget low-cost tools, to high-end commercial packages. The current available choices included various products from the
following vendors:
•
•
•
•
•
•

KeilÔ Microcontroller Development Kit (MDK-ARM)
ARMÒ DS-5Ô (Development Studio 5)
IAR Systems (Embedded Workbench for ARM Cortex-M)
Red Suite from Code Red Technologies (acquired by NXP in 2013)
Mentor Graphics Sourcery CodeBench (formerly CodeSourcery Sourcery gþþ)
mbed.org

2.2 What you need to start

•
•
•
•
•
•
•
•
•
•
•

Altium Tasking VX-toolset for ARM Cortex-M
Rowley Associates (CrossWorks)
Coocox
Texas Instruments Code Composer Studio (CCS)
Raisonance RIDE
Atollic TrueStudio
GNU Compiler Collection (GCC)
ImageCraft ICCV8
Cosmic Software C Cross Compiler for Cortex-M
mikroElektronika mikroC
Arduino

Some development boards also include a basic or evaluation edition of the development suites. In addition, there are development suites for other languages. For
example:
•
•
•

Oracle Java ME Embedded
IS2T MicroEJ Java virtual machine
mikroElektronika mikroBasic, mikroPascal

The illustrations in this book are mostly based on the Keil Microcontroller
Development Kit (MDK-ARM) because of its popularity, but most of the example
code can be used with the other development suites.

2.2.2 Development boards

There are already a large number of development kits for the CortexÒ-M3/M4
microcontrollers from various microcontroller vendors and their distributors.
Many of them are offered at an excellent price. For example, you can get a
Cortex-M3 evaluation board for less than $12.
You can also get development kits from software tool vendors; for example, companies like KeilÔ (an example is show in Figure 2.1), IAR Systems, and Code Red
Technologies all have a number of development boards available.
A number of low-cost development boards are designed to work with particular
development suites. For example, the “mbed.org” development boards, a low-cost
solution for rapid software prototyping, are designed to work with the mbed development platform.
To start learning about ARMÒ Cortex-M microcontrollers, it is not always necessary to get an actual development board. Several development suites include an instruction set simulator, and the Keil MDK-ARM even supports device-level
simulation for some of the popular Cortex-M microcontrollers. So you can learn
Cortex-M programming just by using simulation.

2.2.3 Debug adaptor
In order to download your program code to the microcontroller, and to carry out
debug operations like halting and single stepping, you might need a debug adaptor

27

28

CHAPTER 2 Introduction to Embedded Software Development

FIGURE 2.1
A Cortex-M3 development board from Keil (MCBSTM32)

to convert a USB connection from your PC to a debug communication protocol used
by the microcontrollers. Most C compiler vendors have their own debug adaptor
products. For example, KeilÔ has the ULINK product family (Figure 2.2), and
IAR provides the I-Jet product. Most development suites also support third-party
debug adaptors. Note that different vendors might have different terminologies for
these debug adaptors, for example, debug probe, USB-JTAG adaptor, JTAG/SW
Emulator, JTAG In-Circuit Emulator (ICE), etc.
Some of the development kits already have a USB debug adaptor built-in on the
board. This includes some of the low-cost evaluation boards from Texas Instruments,

FIGURE 2.2
The Keil ULINK debug adaptor family

2.2 What you need to start

FIGURE 2.3
An example of development board with USB debug adaptor e STM32 Value Line Discovery

ST Microelectronics (e.g., STM32 Value Line Discovery; Figure 2.3), NXP, EnergyMicro, etc. Many of these onboard USB adaptors are also supported by mainstream
commercial development suites. So you can start developing software for the
CortexÒ-M microcontrollers with a tiny budget.
In a number of evaluation/development boards, the built-in USB debug adaptor
can also be used to connect to other development boards. You can also find “opensource” versions of such debug adaptors. The CMSIS-DAP from ARM and CoLink
from Coocox are two examples.
While these low-cost debug adaptors work for most debug operations, there
might be some features that are not well supported. There are a number of commercial USB debug adaptor products that offer a large number of useful features.

2.2.4 Software device driver
The term device driver here is quite different from its meaning in a PC environment.
In order to help microcontroller software developers, microcontroller vendors usually provide header files and C codes that include:
•
•

Definitions of peripheral registers
Access functions for configuring and accessing the peripherals

By adding these files to your software projects, you can access various peripheral
functions via function calls and access peripheral registers easily. If you want to, you
can also create modified versions of the access functions based on the methods
shown in the driver code and optimize them for your application.

2.2.5 Examples
Don’t forget to download some example code from the microcontroller vendor’s
website. Most of the microcontroller vendors put their device-driver codes and

29

30

CHAPTER 2 Introduction to Embedded Software Development

examples on their websites as free downloads. This can save you a lot of time in
developing new applications.

2.2.6 Documentation and other resources
Aside from user manuals of the microcontrollers, often you can also find application
notes, FAQs, and online discussion forums on microcontroller vendor websites. The
user manuals are essential, as they provides the details of the peripherals’ programmer models.
On the ARMÒ website, the documentation is placed in a section called Info
Center (http://infocenter.arm.com). From there you can find the CortexÒ-M3/M4
Devices Generic User Guides (references 2 and 3), which covers the programming
model of the processors, as well as various application notes.
Finally, you can also find a number of useful application notes and online
discussion forums from tool vendor websites.

2.2.7 Other equipment
Depending on the applications you are developing and the development board you
are using, you might need additional hardware that interfaces to the development
boards, such as external LCD display modules or communication interface adaptors.
Also you might need some hardware development tools like a laboratory power
supply, logic analyzer/oscilloscope, signal generator, etc.

2.3 Software development flow
The software development flow depends on the compiler suite you use. Assuming
that you are using a compiler suite with Integrated Development Environment
(IDE), the software development flow (as shown in Figure 2.4) usually involves:
Create project e you need to create a project that will specify the location of
source files, compile target, memory configurations, compilation options, and so
on. Many IDEs have a project creation wizard for this step.
Add files to project e you need to add the source code files required by the
project. You might also need to specify the path of any included header files in
the project options. Obviously you might also need to create new program source
code files and write the program. Note that you should be able to reuse a number
of files from the device-driver library to reduce the effort in writing new files.
This includes startup code, header files, and some of the peripheral control
functions.
Setup project options e In most cases, the project file created allows a number of
project options such as compiler optimization options, memory map, and output
file types. Based on the development board and debug adaptor you have, you
might also need to setup options for debug and code download.

A simplified software development flow

2.3 Software development flow

FIGURE 2.4

31

32

CHAPTER 2 Introduction to Embedded Software Development

Compile and link e In most cases, a project contains a number of files that are
compiled separately. After the compilation process, each source file will have a
corresponding object file. In order to generate the final combined executable
image, a separate linking process is required. After the link stage, the IDE can
also generate the program image in other file formats for the purpose of programming the image to the device.
Flash programming e Almost all of the CortexÒ-M microcontrollers use flash
memories for program storage. After a program image is created, we need
to download the program to the flash memory of the microcontroller. To do
this, you need a debug adaptor if the microcontroller board you use does not
have one built in. The actual flash programming procedures can be quite
complex, but these are usually fully handled by the IDE and you can carry out
the whole programming process with a single mouse click. Note that if you
want to, you can also download applications to SRAM and execute them
from there.
Execute program and debug e After the compiled program is downloaded
to the microcontroller, you can then run the program and see if it works.
You can use the debug environment in the IDE to stop the processor
(commonly referred as halt) and check the status of the system to ensure it is
working properly. If it doesn’t work correctly, you can use various debug
features like single stepping to examine the program operations in detail. All
these operations will require a debug adaptor (or the one built in to the
development kit if available) to link up the IDE and the microcontroller being
tested. If a software bug is found, then you can edit your program code,
recompile the project, download the code to the microcontroller, and test
it again.
If you are using open source toolchain, you might not have an IDE and might
need to handle the compile and link process using scripts or makefile. Depending
on the microcontroller product you are using, there can be third-party tools that
can be used to download the compiled program image to the flash memory in the
microcontroller.
During execution of the compiled program, you can check the program execution
status and results by outputting information via various I/O mechanisms such as a
UART interface or an LCD module. A number of examples in this book will
show how some of these methods can be implemented. See Chapter 18 for some
of the examples.

2.4 Compiling your applications
The procedure for compiling an embedded program depends on the development
tools you use. Later in this book, a number of chapters cover the use of a couple
of development tools to compile simple applications (Chapters 15 to 17). Here we
will first have a look at some basic concepts of the compilation process.

Common software compilation flow

2.4 Compiling your applications

FIGURE 2.5

33

34

CHAPTER 2 Introduction to Embedded Software Development

First, we assume that you are developing your project using C programming
language. This is the most commonly used programming language for microcontroller software development. Your project might also contain some assembly language files; for example, startup code that is supplied by microcontroller vendors.
In most cases, the compilation process will be similar to the one shown in
Figure 2.5.
Most development suites contain the tools listed in Table 2.1.
Different development tools have different ways to specify the layout of the program and data memory in the microcontroller system. In ARMÒ toolchains, you can
use a file type called scatter-loading file, or in the case of KeilÔ MDK-ARM, the
scatter-loading file can be generated automatically by the mVision development
environment. For some other ARM toolchains, you can also use command line
options to specify the locations of ROM and RAM.
In a GNU-based toolchain, the memory specification is handled by linker scripts.
These scripts are typically included in the installation of commercial gcc toolchains.
However, some gcc users might have to create these files themselves. A later chapter
of this book contains examples for compiling programs using gcc, which covers
more information on linker scripts.
When using the GNU gcc toolchain, it is common to compile the whole
application in one go instead of separating the compilation and linking stages
(Figure 2.6).
The gcc compilation automatically invokes the linker and assembler if needed.
This arrangement ensures that the details of the required parameters and libraries
are passed on to the linker correctly. Using the linker as a separate step can be error
prone and therefore is not recommended by most gcc tool vendors.

Table 2.1 Various Tools You Can Find in a Development Suite
Tools

Descriptions

C compiler
Assembler
Linker

To compile C program files into object files
To assemble assembly code files into object files
A tool to join multiple object files together and define memory
configuration
A tool to program the compiled program image to the flash
memory of the microcontroller
A tool to control the operation of the microcontroller and to
access internal operation information so that status of the
system can be examined and the program operations can be
checked
A tool to allow the program execution to be simulated without
real hardware
Various tools, for example, file converters to convert the
compiled files into various formats

Flash programmer
Debugger

Simulator
Other utilities

Common software compilation flow for GNU toolchain

2.4 Compiling your applications

FIGURE 2.6

35

36

CHAPTER 2 Introduction to Embedded Software Development

2.5 Software flow
There are many ways to construct program flow for an application. Here we will
cover some of the basic concepts.

2.5.1 Polling
For very simple applications, the processor can wait until there is data ready for processing, process it, and then wait again. This is very easy to setup and works fine for
simple tasks. Figure 2.7 shows a simple polling program flow chart.
In most cases, a microcontroller will have to serve multiple interfaces and therefore be required to support multiple processes. The polling program flow method can
be expanded to support multiple processes easily (Figure 2.8). This arrangement is
sometimes called a “super-loop.”
The polling method works well for simple applications, but it has several disadvantages. For example, when the application gets more complex, the polling loop
design might get very difficult to maintain. Also, it is difficult to define priorities between different services using polling e you might end up with poor responsiveness,
where a peripheral requesting service might need to wait a long time while the processor is handling less important tasks.

2.5.2 Interrupt driven
Another main disadvantage of the polling method is that it is not energy efficient. Lots
of energy is wasted during the polling when service is not required. To solve this

FIGURE 2.7
Polling method for simple application processing

2.5 Software flow

FIGURE 2.8
Polling method for application with multiple devices that need processing

problem, almost all microcontrollers have some sort of sleep mode support to reduce
power, in which the peripheral can wake up the processor when it requires a service
(Figure 2.9). This is commonly known as an interrupt-driven application.
In an interrupt-driven application, interrupts from different peripherals can be
assigned with different interrupt priority levels. For example, important/critical
peripherals can be assigned with a higher priority level so that if the interrupt arrives
when the processor is servicing a lower priority interrupt, the execution of the lower
priority interrupt service is suspended, allowing the higher priority interrupt service
to start immediately. This arrangement allows much better responsiveness.
In some cases, the processing of data from peripheral services can be partitioned
into two parts: the first part needs to be done quickly, and the second part can be carried out a little bit later. In such situations we can use a mixture of interrupt-driven
and polling methods to construct the program. When a peripheral requires service, it
triggers an interrupt request as in an interrupt-driven application. Once the first part
of the interrupt service is carried out, it updates some software variables so that the
second part of the service can be executed in the polling-based application code
(Figure 2.10).

37

38

CHAPTER 2 Introduction to Embedded Software Development

FIGURE 2.9
Simple interrupt-driven application

Using this arrangement, we can reduce the duration of high-priority interrupt handlers so that lower priority interrupt services can get served quicker. At the same time,
the processor can still enter sleep mode to save power when no servicing is needed.

2.5.3 Multi-tasking systems
When the applications get more complex, a polling and interrupt-driven program
structure might not be able to handle the processing requirements. For example,
some tasks that can take a long time to execute might need to be processed concurrently. This can be done by dividing the processor’s time into a number of time slots
and allocating the time slots to these tasks. While it is technically possible to create
such an arrangement by manually partitioning the tasks and building a simple scheduler to handle this, it is often impractical to do this in real projects as it is time
consuming and can make the program much harder to maintain and debug.
In these applications, a Real-Time Operating System (RTOS) can be used to handle
the task scheduling (Figure 2.11). An RTOS allows multiple processes to be executed
concurrently, by dividing the processor’s time into time slots and allocating the time
slots to the processes that require services. A timer is need to handle the timekeeping
for the RTOS, and at the end of each time slot, the timer generates a timer interrupt,
which triggers the task scheduler and decides if context switching should be carried

2.6 Data types in C programming

FIGURE 2.10
Application with both polling method and interrupt-driven arrangement

out. If yes, the current executing process is suspended and the processor executes
another process.
Besides task scheduling, RTOSs also have many other features such as semaphores, message passing, etc. There are many RTOSs developed for the CortexÒ-M
processors, and many of them are completely free of charge.

2.6 Data types in C programming
The C programming language supports a number of “standard” data types. However,
the way a data item is represented in hardware depends on the processor architecture

39

40

CHAPTER 2 Introduction to Embedded Software Development

FIGURE 2.11
Using an RTOS to handle multiple tasks

as well as the C compiler. In different processor architectures, the size of certain data
types can be different. For example, the integer is often 16-bit in 8-bit or 16-bit
microcontrollers, and is always 32-bit in the ARMÒ architecture. Table 2.2 shows
the common data types in ARM architecture, including all CortexÒ-M processors.
These data types are supported by all C compilers.
Because of differences of size in certain data types, it might be necessary to
modify the source code when porting an application from an 8-bit or 16-bit microcontroller to an ARM Cortex-M microcontroller. More details on porting software
from 8-bit and 16-bit architectures are covered in Chapter 24.
In ARM programming, we also refer to the size of a data item as BYTE, HALF
WORD, WORD, and DOUBLE WORD, as shown in Table 2.3.
These terms are very common in ARM documentation, including the instruction
set descriptions as well as hardware descriptions.

2.7 Inputs, outputs, and peripherals accesses
Almost all microcontrollers have various Input/Output (I/O) interfaces and peripherals such as timers, Real-time Clock (RTC), and so on. For microcontroller
products based on the ARMÒ CortexÒ-M3, and M4 processors, as well as common

2.7 Inputs, outputs, and peripherals accesses

Table 2.2 Size and Range of Data Types in ARM Architecture Including Cortex-M
Processors
C and C99 (stdint.h)
Data Type

Number
of Bits

Range
(Signed)

Range
(Unsigned)

char, int8_t, uint8_t
short int16_t, uint16_t
int, int32_t, uint32_t

8
16
32

Long

32

long long, int64_t, uint64_t
Float
Double

64
32
64

long double

64

Pointers
Enum

32
8 / 16/ 32

bool (Cþþ only), _Bool (C only)
wchar_t

8
16

128 to 127
0 to 255
32768 to 32767
0 to 65535
2147483648 to
0 to
2147483647
4294967295
2147483648 to
0 to
2147483647
4294967295
(2^63) to (2^63 - 1)
0 to (2^64 - 1)
3.4028234  1038 to 3.4028234  1038
1.7976931348623157  10308 to
1.7976931348623157  10308
1.7976931348623157  10308 to
1.7976931348623157  10308
0x0 to 0xFFFFFFFF
Smallest possible data type, except when
overridden by compiler option
True or false
0 to 65535

Table 2.3 Data Size Definition in ARM Processor
Terms

Size

Byte
Half word
Word
Double word

8-bit
16-bit
32-bit
64-bit

interface peripherals such as GPIO, SPI, UART, I2C, you can also find many
advanced interface peripherals like USB, CAN, Ethernet, and analogue interfaces
like ADCs (Analog to Digital Converters) and DACs (Digital to Analog Converters).
Most of these interface peripherals are vendor specific, so you need to read the user
manuals provided by the microcontroller vendors to learn how to use them. In most
cases you can also find programming examples on microcontroller vendor websites.
On these microcontrollers, the peripherals are memory-mapped, which means
the registers are accessible from the system memory map. In order to access these
peripherals registers in C programs, we can use pointers. We will see some examples
of how this can be done in the following sections.

41

42

CHAPTER 2 Introduction to Embedded Software Development

Typically, a peripheral requires an initialization process before it can be used.
This might include some of the following steps:
•

•

•

•

Programming the clock control circuitry to enable the clock signal connection to
the peripheral, and clock signal connection to corresponding I/O pins if needed.
Many modern microcontrollers allow fine tuning of clock signal distribution,
such as enabling/disabling the clock connection to each individual peripheral for
better energy saving. Typically the clocks to peripherals are turned off by default
and you need to enable the clock before programming the peripheral. In some
cases you might also need to enable the clock to the peripheral bus system.
In some cases you might need to configure the operation mode of the I/O pins. Most
microcontrollers have multiplexed I/O pins that can be used for multiple purposes.
In order to use a peripheral, it might be necessary to configure its I/O pins to match
the usage (e.g., input/output direction, function, etc.). In addition, you might also
need to program additional configuration registers to define the expected electrical
characteristics such as output type (voltage, pull up/down, open drain, etc.).
Peripheral configuration. Most peripherals contain a number of programmable
registers that need configuration before using the peripheral. In some cases, you
can find the programming sequence a bit more complex than that of a 8-bit microcontroller, because the peripherals on 32-bit microcontrollers are often much
more sophisticated than peripherals on 8-bit/16-bit systems. On the other hand,
often the microcontroller vendors will have provided device-driver library code
and you can use these driver functions to reduce the programming work required.
Interrupt configuration. If a peripheral is to be used with interrupt operations, you
will need to program the interrupt controller on the Cortex-M3/M4 processor
(NVIC) to enable the interrupt and to configure the interrupt priority level.

All these initialization steps are carried out by programming peripheral registers
in various peripheral blocks. As mentioned, peripheral registers are memorymapped and therefore can be accessed using pointers. For example, you can define
a General Purpose Input Output (GPIO) register set as a number of pointers as:
/* STM32F 100RBT6B e GPIO A Port Configuration Register Low */
#define GPIOA_CRL (*((volatile unsigned long *) (0x40010800)))
/* STM32F 100RBT6B e GPIO A Port Configuration Register High */
#define GPIOA_CRH (*((volatile unsigned long *) (0x40010804)))
/* STM32F 100RBT6B e GPIO A Port Input Data Register */
#define GPIOA_IDR (*((volatile unsigned long *) (0x40010808)))
/* STM32F 100RBT6B e GPIO A Port Output Data Register */
#define GPIOA_ODR (*((volatile unsigned long *) (0x4001080C)))
/* STM32F 100RBT6B e GPIO A Port Bit Set/Reset Register */
#define GPIOA_BSRR(*((volatile unsigned long *) (0x40010810)))
/* STM32F 100RBT6B e GPIO A Port Bit Reset Register */
#define GPIOA_BRR (*((volatile unsigned long *) (0x40010814)))
/* STM32F 100RBT6B e GPIO A Port Configuration Lock Register */
#define GPIOA_LCKR (*((volatile unsigned long *) (0x40010818)))

2.7 Inputs, outputs, and peripherals accesses

Then we can use the definitions directly. For example:
void GPIOA_reset(void) /* Reset GPIO A */
{
// Set all pins as analog input mode
GPIOA_CRL = 0; // Bit 0 to 7, all set as analog input
GPIOA_CRH = 0; // Bit 8 to 15, all set as analog input
GPIOA_ODR = 0; // Default output value is 0
return;
}

This method is fine for a small number of peripheral registers. However, as the
number of peripheral registers increases, this coding style can be problematic
because:
•
•

•

For each register address definition, the program needs to store the 32-bit address
constant, resulting in increased code size.
When there are multiple instantiations of the same peripheral, for example, the
STM32 microcontroller has five GPIO peripherals, and the same definition
has to be repeated for each of the instantiations. This is not scalable and
makes it hard for software maintenance.
It is not easy to create a function that can be shared between multiple
instantiations of the same peripheral. For example, with the above example
definition we might have to create the same GPIO reset function for each of the
GPIO ports, resulting in increased code size.

In order to solve these problems, the common practice is to define the peripheral
registers as data structures. For example, in the device-driver software package from
the microcontroller vendors, we can find:
typedef struct
{
__IO uint32_t CRL;
__IO uint32_t CRH;
__IO uint32_t IDR;
__IO uint32_t ODR;
__IO uint32_t BSRR;
__IO uint32_t BRR;
__IO uint32_t LCKR;
} GPIO_TypeDef;

Then each peripheral base address (GPIO A to GPIO G) is defined as pointers to
the data structure:
#define PERIPH_BASE ((uint32_t)0x40000000)
/*!< Peripheral base address in the bit-band region */
.
#define APB2PERIPH_BASE (PERIPH_BASE + 0x10000)
.

43

44

CHAPTER 2 Introduction to Embedded Software Development

#define GPIOA_BASE (APB2PERIPH_BASE + 0x0800)
#define GPIOB_BASE (APB2PERIPH_BASE + 0x0C00)
#define GPIOC_BASE (APB2PERIPH_BASE + 0x1000)
#define GPIOD_BASE (APB2PERIPH_BASE + 0x1400)
#define GPIOE_BASE (APB2PERIPH_BASE + 0x1800)
.
#define GPIOA ((GPIO_TypeDef *) GPIOA_BASE)
#define GPIOB ((GPIO_TypeDef *) GPIOB_BASE)
#define GPIOC ((GPIO_TypeDef *) GPIOC_BASE)
#define GPIOD ((GPIO_TypeDef *) GPIOD_BASE)
#define GPIOE ((GPIO_TypeDef *) GPIOE_BASE)
.

In these code snippets, there are a number of new things we have not covered:
The “__IO” is defined in a standardized header file in CMSIS. It implies a
volatile data item (e.g., a peripheral register), which can be read or written to by
software. Aside from “__IO,” a peripheral register can also be defined as “__I”
(read only) and “__O” (write only).
#ifdef __cplusplus
#define __I volatile
/*!< defines ’read only’ permissions */
#else
#define __I volatile const /*!< defines ’read only’ permissions */
#endif
#define __O volatile /*!< defines ’write only’ permissions */
#define __IO volatile /*!< defines ’read / write’ permissions */

The “uint32_t” (unsigned 32-bit integer) is a data type supported in C99. This
ensures the data size is 32-bit, independent of the processor architecture, which can
help the software to be more portable. To use this data type, the project needs to
include the standard data type header (Note: if you are using a CMSIS-compliant
device header file this is already done for you in the device header file):
#include /* Include standard types */
/* C99 standard data types:
uint8_t : unsigned 8-bit, int8_t : signed 8-bit,
uint16_t : unsigned 16-bit, int16_t : signed 16-bit,
uint32_t : unsigned 32-bit, int32_t : signed 32-bit,
uint64_t : unsigned 64-bit, int64_t : signed 64-bit
*/

When peripherals are declared using such a method, we can create functions that
can be used for each instance of the peripheral easily. For example, the code to reset
the GPIO port can be written as:
void GPIO_reset(GPIO_TypeDef* GPIOx)
{
// Set all pins as analog input mode
GPIOx->CRL = 0; // Bit 0 to 7, all set as analog input

2.8 Microcontroller interfaces

GPIOx->CRH = 0; // Bit 8 to 15, all set as analog input
GPIOx->ODR = 0; // Default output value is 0
return;
}

To use this function, we just need to pass the peripheral base pointer to the function:
GPIO_reset(GPIOA); /* Reset GPIO A */
GPIO_reset(GPIOB); /* Reset GPIO B */
.

This method for declaring peripheral registers is used by almost all of the CortexM microcontroller device-driver packages.

2.8 Microcontroller interfaces
The applications running in the microcontroller connect with external world using
various peripheral interfaces. While usage of peripheral interfaces is not the main
focus of this book, a few basic examples will be covered. In most cases, you can
use device-driver library software packages from the microcontroller vendors to
simplify the software development, and you can find examples and application notes
on the Internet for such information.
Unlike programming for PCs, most embedded applications do not have a rich GUI.
Some development boards might have an LCD screen, but many others just have a
couple of LEDs and buttons. While the application itself might not require a user interface, often a simple text-based communication method is very useful for software
development. For example, it can be handy to able to use printf to display a value
captured by the Analog-to-Digital Converter (ADC) during program execution.
A number of methods can be used to handle such message display:
•
•
•
•

Using a character LCD display module connected to the I/O pins of the
microcontroller
Using a simple UART to communicate with a terminal program running on a PC
Set up a USB interface on the microcontroller as a virtual COM port to
communicate with a terminal program running on a PC
Use the Instrumentation Trace Macrocell (ITM), a standard debug feature on the
CortexÒ-M3/M4, to communicate with the debugger software

In some cases, a character LCD might be part of the embedded product, so using
this hardware to display information can be convenient. However, the size of the
screen limits the amount of information that can be displayed at a time.
A UART is easy to use, and allows more information to be passed to the developer quickly. The Cortex-M3/M4 processor does not have a UART as standard, but
most microcontroller vendors have included a UART peripheral in their microcontroller designs. However, most modern computers do not have a UART interface
(COM port) anymore, so you might need to use a USB-to-UART adaptor cable to

45

46

CHAPTER 2 Introduction to Embedded Software Development

FIGURE 2.12
Using a UART to communicate with a PC via USB

handle this communication. In addition, you need to have a TTL-to-RS232 adaptor
in your development setup to convert the signal’s voltage (see Figure 2.12).
In some development boards (e.g., Texas Instruments Stellaris LaunchPad), the
onboard debug adaptor has the feature of converting UART communications to USB.
If the microcontroller you use has a USB interface, you can use this to communicate with a PC using USB. For example, you can use a Virtual COM port solution
for text-based communication with a terminal program running on a computer. It
requires more effort in setting up the software but allows the microcontroller hardware to interface with the PC directly, avoiding the cost of the RS232 adaptors.
If you are using commercial debug adaptors like the Keil ULINK2, Segger J-LINK,
or similar, you can use a feature called Instrumentation Trace Macrocell (ITM) to transfer messages to the debug host (the PC running the debugger) and display the messages
in the development environment. This does not require any extra hardware and does not
require much software overhead. It allows the peripheral interfaces to be free for other
purposes. Examples of using the ITM are covered in Chapter 18.
The technique to redirect text messages from a “printf” (in C language) to specific hardware (e.g., UART, character LCD, etc.) is commonly referred as “retargeting.” Retargeting can also be used to handle user inputs and system functions. The
C code for retargeting is toolchain specific. Examples of retargeting for a couple of
development tools will be covered in Chapter 18.

2.9 The CortexÒ microcontroller software interface
standard (CMSIS)
2.9.1 Introduction to CMSIS

Earlier in this chapter we mentioned CMSIS. CMSIS was developed by ARMÒ to
allow microcontroller and software vendors to use a consistent software infrastructure
to develop software solutions for CortexÒ-M microcontrollers. As such, you can see
that many software products for Cortex-M microcontrollers are CMSIS-compliant.

2.9 The CortexÒ microcontroller software interface standard (CMSIS)

Currently the Cortex-M microcontroller market comprises:
•

•
•
•

More than 15 microcontroller vendors shipping Cortex-M microcontroller
products (see section 1.1.4 for the list of Cortex-M3 and Cortex-M4 microcontroller vendors), with some other silicon vendors providing Cortex-M based
FPGA and ASICs
More than 10 toolchain vendors
More than 30 embedded operating systems
Additional Cortex-M middleware software providers for codecs, communication
protocol stacks, etc.

With such a large ecosystem, some form of standardization of the way the software infrastructure works becomes necessary to ensure software compatibility with
various development tools and between different software solutions.
At the same time, embedded systems are also becoming more and more complex,
and the amount of effort in developing and testing the software has increased substantially. In order to reduce development time as well as reducing the risk of having
defects in products, software reuse is becoming more and more common. In addition, the complexity of the embedded systems has also increased the use of thirdparty software solutions. For example, an embedded software project might involve
software components from many different sources:
•
•
•
•
•

Software developed by in house developers
Software reused from other projects
Device-driver libraries from microcontroller vendors
Embedded OSs
Other third-party software products such as communication protocol stacks

In such scenarios, the interoperability of various software components becomes
critical. For all these reasons, ARM worked with various microcontroller vendors,
tools vendors, and software solution providers to develop CMSIS, a software framework covering most Cortex-M processors and Cortex-M microcontroller products.
The aims of CMSIS include:
•
•

•

•

Enhanced software reusability e makes it easier to reuse software code in
different Cortex-M projects, reducing time to market and verification efforts.
Enhanced software compatibility e by having a consistent software infrastructure (e.g., API for processor core access functions, system initialization method,
common style for defining peripherals), software from various sources can work
together, reducing the risk in integration.
Easy to learn e the CMSIS allows easy access to processor core features
from the C language. In addition, once you learn to use one Cortex-M
microcontroller product, starting to use another Cortex-M product is much easier
because of the consistency in software setup.
Toolchain independent e CMSIS-compliant device drivers can be used with
various compilation tools, providing much greater freedom.

47

48

CHAPTER 2 Introduction to Embedded Software Development

•

Openness e the source code for CMSIS core files can be downloaded and
accessed by everyone, and everyone can develop software products with CMSIS.

CMSIS is an evolving project. It started out as a way to establish consistency in
device-driver libraries for the Cortex-M microcontrollers, and this has become
CMSIS-Core. Since then additional CMSIS projects have started:
•

•

•

•

•

CMSIS-Core (Cortex-M processor support) e a set of APIs for application or
middleware developers to access the features on the Cortex-M processor regardless
of the microcontroller devices or toolchain used. Currently the CMSIS processor
support includes the Cortex-M0, Cortex-M0þ, Cortex-M3, and Cortex-M4 processors and SecurCore products like SC000 and SC300. Users of the Cortex-M1
can use the Cortex-M0 version because they share the same architecture.
CMSIS-DSP library e in 2010 the CMSIS DSP library was released, supporting
many common DSP operations such as FFT and filters. The CMSIS-DSP is
intended to allow software developers to create DSP applications on Cortex-M
microcontrollers easily.
CMSIS-SVD e the CMSIS System View Description is an XML-based file
format to describe peripheral set in microcontroller products. Debug tool vendors
can then use the CMSIS SVD files prepared by the microcontroller vendors to
construct peripheral viewers quickly.
CMSIS-RTOS e the CMSIS-RTOS is an API specification for embedded OS
running on Cortex-M microcontrollers. This allows middleware and application
code to be developed for multiple embedded OS platforms, and allows better
reusability and portability.
CMSIS-DAP e the CMSIS-DAP (Debug Access Port) is a reference design for a
debug interface adaptor, which supports USB to JTAG/Serial protocol conversions. This allows low-cost debug adaptors to be developed which work for
multiple development toolchains.

In this chapter we will first look at the processor support in CMSIS
(CMSIS-Core). The CMSIS DSP library will be covered in Chapter 22. The
CMSIS-SVD and CMSIS-DAP topics are beyond the scope of this book.

2.9.2 Areas of standardization in CMSIS-Core
From a software development point of view, the CMSIS-Core standardizes a number
of areas:
Standardized definitions for the processor’s peripherals e These include the registers in the Nested Vector Interrupt Controller (NVIC), a system tick timer in the
processor (SysTick), an optional Memory Protection Unit (MPU), various programmable registers in the System Control Block (SCB), and some software programmable registers related to debug features. Note: Some of the registers in the CortexÒM4 are not available in Cortex-M3, and similarly, some registers in Cortex-M3
and Cortex-M4 are not available in the Cortex-M0.

2.9 The CortexÒ microcontroller software interface standard (CMSIS)

Standardized access functions to access processor’s features e These include
various functions for interrupt control using NVIC, and functions for accessing special registers in the processors. It is still possible to access the registers directly if
needed, but for general programming using the access functions (or sometimes
referred as Application Programming Interface, API, in some literature) can help
software portability. More details of these functions are covered in Appendix E.
Standardized functions for accessing special instructions easily e The Cortex-M
processors support a number of instructions for special purposes (e.g., WaitFor-Interrupt, WFI, for entering sleep mode). These instructions cannot be generated
using generic IEC/ISO C1 language. Instead, CMSIS implements a set of functions
to allow these instructions to be accessed within C program code. Without these
functions, the users would have to rely on toolchain specific solutions such
as intrinsic functions or inline assembly to inject special instructions into the application, which make the software less reusable and might require certain in-depth
knowledge of the toolchain in order to handle them correctly. CMSIS provides a
standardized API for these features so that they can be easily used by application
developers.
Standardized function names for system exception handlers e A number of system exception types are presented in the architecture for the Cortex-M processors.
By giving the corresponding system exception handlers standardized names, it
makes it much easier to develop software solutions that can be applied to multiple
Cortex-M products. This is especially important for embedded OS developers, as
the embedded OS requires the use of several types of system exception.
Standardized functions for system initialization e Most modern feature-rich
microcontroller products require some configuration of clock circuitry and
power management registers before the application starts. In CMSIS-compliant
device-driver libraries, these configuration steps are placed in a function called
“SystemInit().” Obviously, the actual implementation of this function is device
specific and might need adaption for various project requirements. However,
having a standardized function name, a standardized way that this function is used
and a standardized location where this function can be found makes it much
easier for a designer to pick up and start using a new Cortex-M microcontroller device.
Standardized software variables for clock speed information e This might not be
obvious, but often our application code does need to know what clock frequency the
system is running at. For example, such information might be needed for setting up
the baud rate divider in a UART, or to initialize the SysTick timer for an embedded
OS. A software variable called “SystemCoreClock” (for CMSIS 1.3 or newer versions, or “SystemFreq” in older versions of CMSIS) is defined in the CMSIS-Core.
In addition, the CMSIS-Core also provides:
A common platform for device-driver libraries e Each device-driver library has
the same look and feel, making it easier for beginners to learn how to use the devices.
1

C/Cþþ features are specified in a standard document “ISO/IEC 14882” prepared by the International Organisation for Standards (ISO) and the International Electrotechnical Commission (IEC).

49

50

CHAPTER 2 Introduction to Embedded Software Development

This also makes it easier for software developers to develop software for multiple
Cortex-M microcontroller products.

2.9.3 Organization of CMSIS-Core
The CMSIS files are integrated into device-driver library packages from microcontroller vendors. Some of the files in the device-driver library are prepared by ARMÒ
and are common to various microcontroller vendors. Other files are vendor/device
specific. In a general sense, we can define the CMSIS into multiple layers:
•
•

•

Core Peripheral Access Layer e Name definitions, address definitions, and
helper functions to access core registers and core peripherals. This is processor
specific and is provided by ARM.
Device Peripheral Access Layer e Name definitions, address definitions of
peripheral registers, as well as system implementations including interrupt
assignments, exception vector definitions, etc. This is device specific (note:
multiple devices from the same vendor might use the same file set).
Access Functions for Peripherals e The driver code for peripheral accesses. This
is vendor specific and is optional. You can choose to develop your application
using the peripheral driver code provided by the microcontroller vendor, or you
can program the peripherals directly if you prefer.

There is also a proposed additional layer for peripheral accesses:
Middleware Access Layer e This layer does not exist in current version of
CMSIS. The idea is to develop a set of APIs for interfacing common peripherals
such as UART, SPI, and Ethernet. If this layer exists, developers of middleware
can develop their applications based on this layer to allow software to be ported
between devices easily.
The roles of the various layers are summarized in Figure 2.13.
Note that in some cases, the device-driver libraries might contain additional
vendor-specific functions for the NVIC implemented by the microcontroller vendor.
The aim of CMSIS is to provide a common starting point, and the microcontroller vendors can add additional functions if they prefer. But software using these functions will
need porting if the software design is to be reused on another microcontroller product.

2.9.4 How do I use CMSIS-Core?
The CMSIS files are included in the device-driver packages provided by the microcontroller vendors. So when you are using CMSIS-compliant device-driver libraries
provided by the microcontroller vendors, you are already using CMSIS.
Typically, you need to do the following.
•

Add source files to project. This includes:
• Device-specific, toolchain-specific startup code, in the form of assembly or C
• Device-specific device initialization code (e.g., system_.c)
• Additional vendor-specific source files for peripheral access functions. This is
optional.

2.9 The CortexÒ microcontroller software interface standard (CMSIS)

FIGURE 2.13
CMSIS-Core structure

•

• For CMSIS 2.00 or older versions of CMSIS-Core libraries, you might also
need to add a processor-specific C program file (e.g., core_cm3.c) to the
project for some of the core register access functions. This is not required
from CMSIS-Core 2.10.
Add header files into search path of the project. This includes:
• A device-specific header file for peripheral registers definitions and interrupt
assignment definitions. (e.g., .h)
• A device-specific header file for functions in device initialization code (e.g.,
system_.h)
• A number of processor-specific header files (e.g., core_cm3.h, core_cm4.h;
they are generic for all microcontroller vendors)
• Optionally additional vendor-specific header files for peripheral access
functions
• In some cases the development suites might also have some of the generic
CMSIS support files pre-installed.

Figure 2.14 shows a typical project setup using a CMSIS device-driver package.
Inside the device-driver package obtained from the microcontroller vendor, you will
find the various files you need, including the CMSIS generic files. The names of
some of these files depend on the actual microcontroller device name chosen by
the microcontroller vendor (indicated as  in the diagram).
When the device-specific header file is included in the application code,
it automatically includes additional header files, therefore you need to set up the
project search path for the header files in order to compile the project correctly.

51

52

Start up code
(including the
vector table)
Application
code

#include .h
int main(void)
{

core_cm3.c
(CMSIS 1.x to
2.0 only)

system_.c

Peripheral
driver files

core_cm0.h /
core_cm0plus.h /
core_cm3.h /
core_cm4.h
core_cmFunc.h

Header for special registers
access functions (CMSIS v2)

core_cmInstr.h

Header for special instruction
access functions (CMSIS v2)

core_cm4_simd.h

Header for Cortex-M4 SIMD
instructions (CMSIS v2)

system_.h

core_cm0.c /
core_cm3.c /
core_cm4.c

Using CMSIS-Core in a project

Core peripheral access
layer

.h
Interrupt
number and
peripheral
registers
definitions

Other header files

FIGURE 2.14

CMSIS compliant
device driver
library

Additional header files for
device peripheral access
layer
Functions for accessing
special instructions
(CMSIS 1.x to 2.0 only)

system_.c

System functions including
initialization

Peripheral driver
code

Device peripheral access
layer and additional access
functions

CHAPTER 2 Introduction to Embedded Software Development

Multiple startup files
for different tool
chains

Project

2.9 The CortexÒ microcontroller software interface standard (CMSIS)

In some cases, the Integrated Development Environment (IDE) automatically
sets up the startup code for you when you create a new project. Otherwise
you just need to add the startup code from the device-driver library to the project
manually. Startup code is required for the starting sequence of the processor, and
it also includes the exception vector table definition that is required for interrupt
handling.

2.9.5 Benefits of CMSIS-Core
So what does CMSIS mean to users?
The main advantage is much better software portability and reusability:
A project for a CortexÒ-M microcontroller device can be migrated to another
device from the same vendor with a different Cortex-M processor very easily.
Often microcontroller vendors provide devices with Cortex-M0/M0þ/M3/M4
with the same peripheral and same pin out, and the change required is just
replacing a couple of CMSIS files in the project.
• CMSIS-Core made it easier for a Cortex-M microcontroller project to be
migrated to another device from a different vendor. Obviously, peripheral setup
and access code will need to be modified, but processor core access functions are
based on the same CMSIS source code and do not require changes.
CMSIS allows software to be much more future proof because embedded software developed today can be reused on other Cortex-M products in the future.

•

The CMSIS-Core also allows faster time to market because:
•
•
•

•

It is easier to reuse software code from previous projects.
Since all CMSIS-compliant device drivers have a similar structure, learning to
use a new Cortex-M microcontroller is much easier.
The CMSIS code has been tested by many silicon vendors and software
developers around the world. It is compliant with Motor Industry Software
Reliability Association (MISRA). Therefore it reduces the validation effort
required, as there is no need to develop and test your own processor feature
access functions.
Starting from CMSIS 2.0, a DSP library is included that provides tested, optimized DSP functions. The DSP library code is available as a free download and
can be used by software developers free of charge.
There are also a number of other advantages:

•
•
•

CMSIS is supported by multiple compiler toolchain vendors.
CMSIS has a small memory footprint (less than 1KB for all core access functions
and a few bytes of RAM for several variables).
CMSIS files contain Doxygen tags (http://www.doxygen.org) to enable easy
automatic generation of documentation.

53

54

CHAPTER 2 Introduction to Embedded Software Development

FIGURE 2.15
CMSIS-Core avoids the need for middleware or OS to carry their own driver code

For developers of embedded OS and middleware, the advantage of CMSIS is
significant:
•
•
•

By using processor core access functions from CMSIS, embedded OS, and
middleware can work with device-driver libraries from various microcontroller
vendors, including future products that are yet to be released.
Since CMSIS is designed to work with various toolchains, many software
products can be designed to be toolchain independent.
Without CMSIS, middleware might need to include a small set of driver functions for accessing processor peripherals such as the interrupt controller. Such an
arrangement increases the program size, and might cause compatibility issues
with other software products (Figure 2.15).

2.9.6 Various versions of CMSIS
The CMSIS project is evolving. Over the last few years several versions of CMSIS
have been released, bringing wider processor support and improvements. Apart from
coding improvement, there have also been a number of other changes:
Version

Main Changes

1.0

Nov 2008
Initial release. Support CortexÒ-M3 processor only.
Feb 2009
Support for Cortex-M0 added.
May 2009
Add support for TASKING compiler.
Add more functions to manage priority settings in NVIC.

1.10
1.20

2.9 The CortexÒ microcontroller software interface standard (CMSIS)

Version

Main Changes

1.30

Oct 2009
The system initialization function SystemInit() is called in startup code
instead of beginning of main().
SystemFrequency variable renamed to SystemCoreClock to reflect the
processor clock definition. Additional functions “void
SystemCoreClockUpdate(void)” added.
Add support for data receive for debug communication. (Previous versions
use ITM for data output in debug communication.)
Add bit definition for processor’s peripheral registers.
Directory structure changed.
Nov 2010
Support for Cortex-M4 added.
Included a CMSIS DSP library (CMSIS-DSP) for Cortex-M4 and Cortex-M3.
New header files core_cm4_simd.h, core_cmFunc.h and core_cmInst.h
introduced, with a number of core access functions are moved to
these files and become inlined.
Add CMSIS System View Description
July 2011
CMSIS-DSP library for Cortex-M0 added.
Added big endian support for DSP library.
Directory structure simplified.
Processor specific C program files (e.g., core_cm3.c, core_cm4.c) are
no longer required and are removed.
Reworded CMSIS-DSP library example.
Documentation update.
October 2011
Added support for GNU Tools for ARM Embedded Processors.
Added function __ROR.
Added Register Mapping for TPIU, DWT.
Added support for SC000 and SC300 processors.
Corrected ITM_SendChar function.
Corrected the functions __STREXB, __STREXH, __STREXW for
the GNU GCC compiler section.
Documentation restructured.
March 2012
Added support for Cortex-M0þ processor.
Integration of CMSIS DSP Library version 1.1.0.

2.0

2.10

3.0

3.01

In normal cases, embedded applications can work with different versions of the
CMSIS source files without problems. Most microcontroller vendors keep their
device-driver library up to date with the most recent versions of CMSIS, but there
is always the chance that the device-driver library package from microcontroller
vendors could be a couple of releases behind the latest CMSIS version. This is
not usually a problem, as the functionalities of the driver functions remain
unchanged.

55

56

CHAPTER 2 Introduction to Embedded Software Development

In a few cases, application code might need to be updated to allow it to be used
with a newer version of the CMSIS driver package (e.g., when the “SystemFrequency”
variable is used, which is replaced by “SystemCoreClock” from CMSIS 1.3).
You can download the latest version of the CMSIS source package from
http://www.arm.com/cmsis.

CHAPTER

Technical Overview

CHAPTER OUTLINE

3

3.1 General information about the CortexÒ-M3 and Cortex-M4 processors ................... 57
3.1.1 Processor type................................................................................ 57
3.1.2 Processor architecture .................................................................... 58
3.1.3 Instruction set................................................................................ 59
3.1.4 Block diagram................................................................................ 61
3.1.5 Memory system .............................................................................. 63
3.1.6 Interrupt and exception support ....................................................... 64
3.2 Features of the CortexÒ-M3 and Cortex-M4 processors......................................... 64
3.2.1 Performance .................................................................................. 65
3.2.2 Code density .................................................................................. 65
3.2.3 Low power ..................................................................................... 66
3.2.4 Memory system .............................................................................. 67
3.2.5 Memory protection unit ................................................................... 67
3.2.6 Interrupt handling .......................................................................... 68
3.2.7 OS support and system level features ............................................... 69
3.2.8 CortexÒ-M4 specific features ........................................................... 69
3.2.9 Ease of use .................................................................................... 70
3.2.10 Debug support ............................................................................. 71
3.2.11 Scalability ................................................................................... 72
3.2.12 Compatibility ............................................................................... 73

3.1 General information about the CortexÒ-M3 and
Cortex-M4 processors
3.1.1 Processor type

All the ARMÒ CortexÒ-M processors are 32-bit RISC (Reduced Instruction Set
Computing) processors. They have:
•
•
•

32-bit registers
32-bit internal data path
32-bit bus interface

In addition to 32-bit data, the Cortex-M processors (as well as any other ARM
processors) can also handle 8-bit, and 16-bit data efficiently. The Cortex-M3 and
The Definitive Guide to ARMÒ CortexÒ-M3 and Cortex-M4 Processors. http://dx.doi.org/10.1016/B978-0-12-408082-9.00003-8
Copyright Ó 2014 Elsevier Inc. All rights reserved.

57

58

CHAPTER 3 Technical Overview

M4 processors also support a number of operations involving 64-bit data (e.g.,
multiply, accumulate).
The Cortex-M3 and Cortex-M4 processors both have a three-stage pipeline
design (instruction fetch, decode, and execution), and both have a Harvard bus architecture, which allows simultaneous instruction fetches and data accesses.
The memory system of the ARM Cortex-M processors uses 32-bit addressing,
which allows a maximum 4GB address space. The memory map is unified, which
means that although there can be multiple bus interfaces, there is only one 4GB
memory space. The memory space is used by the program code, data, peripherals,
and some of the debug support components inside the processors.
Just like any other ARM processors, the Cortex-M processors are based on a
load-store architecture. This means data needs to be loaded from the memory, processed, and then written back to memory using a number of separate instructions.
For example, to increment a data value stored in SRAM, the processor needs to
use one instruction to read the data from SRAM and put it in a register inside the
processor, a second instruction to increment the value of the register, and then a third
instruction to write the value back to memory. The details of the registers inside the
processors are commonly known as a programmer’s model.

3.1.2 Processor architecture
As explained in Chapter 1, the processor is only a part of a microcontroller chip. The
memory system, peripherals, and various interface features are developed by the
microcontroller vendors. As a result, you can find CortexÒ-M processors being
used in a wide range of devices, from low-cost microcontroller products to highend multi-processor products. But these devices share the same architecture. In
ARMÒ processors, the term architecture can refer to two areas:
•
•

Architecture: Instruction Set Architecture (ISA), programmer’s model (what the
software sees), and debug methodology (what the debugger sees).
Micro-architecture: Implementation-specific details such as interface signals,
instruction execution timing, pipeline stages. Micro-architecture is processor
design-specific.

Various versions of the ARM Architecture exist for the different ARM processors released over the years. For example, the Cortex-M3 and Cortex-M4 processors
are both implementations of ARMv7-M Architecture. An Instruction Set Architecture can be implemented with various implementations of micro-architecture; for
example, different numbers of pipeline stages, different types of bus interface
protocol, etc.
The details of the ARMv7-M architecture are documented in the ARMv7-M Architecture Reference Manual (also known as ARMv7-M ARM). This document
covers:
•
•

Instruction set details
Programmer’s model

3.1 General information about the CortexÒ-M3 and Cortex-M4 processors

•
•
•

Exception model
Memory model
Debug architecture

This document can be obtained from ARM after a simple registration process.
However, for general programming, it is not necessary to have the full architecture
reference manual. ARM provides alternate documents for software developers
called Cortex-M3/M4/M0 Devices Generic User Guides. This can be found on the
ARM website:
http://infocenter.arm.com
/ Cortex-M series processors
/ Cortex-M0/M0þ/M3/M4
/ Revision number
/ Cortex-M4/M3/M0/M0þ Devices Generic User Guide
Some of the microarchitecture information such as instruction execution timing
information can be found in the Technical Reference Manuals (TRM) of the CortexM processors, which can be found on the ARM website. Other microarchitecture
information like the processor interface details are documented in other CortexM product documentation, which is normally accessible only by silicon chip
designers.
Theoretically, a software developer does not necessarily need to know anything
about the micro-architecture to develop software for the Cortex-M products. But in
some cases, knowing some of the micro-architecture details could help. This is
particularly true for optimizing software or even C compilers for best performance.

3.1.3 Instruction set
The instruction set used by the CortexÒ-M processors is called Thumb (this covers
both the 16-bit Thumb instructions and the newer 32-bit ThumbÒ instructions). The
Cortex-M3 and Cortex-M4 processors incorporate ThumbÒ-2 Technology,1 which
allow mixture of 16-bit and 32-bit instructions for high code density and high
efficiency.
In classic ARMÒ processors, for example, the ARM7TDMIÔ , the processor has
two operation states: a 32-bit ARM state and a 16-bit Thumb state. In the ARM state,
the instructions are 32-bit and the core can execute all supported instructions with
very high performance. In Thumb state, the instructions are 16-bit, which provides
excellent code density, but Thumb instructions do not have all the functionality of
ARM instructions and more instructions may be needed to complete certain types
of operation.
To get the best of both worlds, many applications for classic ARM processors
have mixed ARM and Thumb code. However, the mixed-code arrangement does
not always work ideally. There is overhead (in terms of both execution time and
1

From trademark point of view, “Thumb-2” is a technology to support mixture of 16-bit and 32-bit
Thumb instructions. Officially the whole instruction set is called “Thumb.”

59

60

CHAPTER 3 Technical Overview

FIGURE 3.1
Switching between ARM code and Thumb code in class ARM processors such as the
ARM7TDMI

instruction count; see Figure 3.1) to switch between the states, and the separation
of two states can increase the complexity of the software compilation process and
make it harder for inexperienced developers to optimize the software.
With the introduction of Thumb-2 technology, the Thumb instruction set has
been extended to support both 16-bit and 32-bit instruction encoding. It is now
possible to handle all processing requirements without switching between the
two different operation states. In fact, the Cortex-M processors do not support
32-bit ARM instructions at all (Figure 3.2). Even interrupt processing is handled
entirely in Thumb state, whereas in classic ARM processors interrupt handlers
are entered in ARM state. With Thumb-2 technology, the Cortex-M processor
has a number of advantages over classic ARM processors, such as:

FIGURE 3.2
Instruction set comparison between Cortex-M processors and ARM7TDMI

3.1 General information about the CortexÒ-M3 and Cortex-M4 processors

Table 3.1 Range of Instructions in Different Cortex-M Processors

Instruction Groups
16-bit ARMv6-M instructions
32-bit Branch with Link
instruction
32-bit system instructions
16-bit ARMv7-M instructions
32-bit ARMv7-M instructions
DSP extensions
Floating point instructions

•
•
•
•

CortexM0, M1

CortexM3

CortexM4

CortexM4 with
FPU2

C
C

C
C

C
C

C
C

C

C
C
C

C
C
C
C

C
C
C
C
C

No state switching overhead, saving both execution time and instruction space.
No need to specify ARM state or Thumb state in source files, making software
development easier.
It is easier to get the best code density, efficiency, and performance at the same
time.
With Thumb-2 technology, the Thumb instruction set has been extended by a
wide margin when compared to a classic processor like the ARM7TDMI. Note
that although all of the Cortex-M processors support Thumb-2 technology, they
implement various subsets of the Thumb ISA (Table 3.1).

Some instructions defined in the Thumb instruction set are not available in the
current Cortex-M processors. For example, the co-processor instructions are not supported (though separate memory-mapped data processing engines could be added).
Also, a few other Thumb instructions from classic ARM processors are not supported, such as Branch with Link and Exchange (BLX) with immediate (used to
switch processor state from Thumb to ARM), a couple of change process state
(CPS) instructions, and the SETEND (Set Endian) instruction, which were introduced in architecture v6. For a complete list of supported instructions, refer to
Appendix A.

3.1.4 Block diagram
From a high-level point of view, CortexÒ-M3 and Cortex-M4 are very similar to
each other. Although there are significant differences in the internal data path
designs, some parts of the processors such as instruction fetch buffer, parts of the
instruction decode and execution stages, and the NVIC are similar to each other.
In addition, the components outside the “core” level are almost identical.
The Cortex-M3 and the Cortex-M4 processors contain the core of the processor,
the Nested Vectored Interrupt Controller (NVIC), the SysTick timer, and optionally
the floating point unit (for Cortex-M4). Apart from these, the processors also contain

61

62

CHAPTER 3 Technical Overview

FIGURE 3.3
Block diagram of the Cortex-M3 and Cortex-M4 processor

some internal bus systems, an optional Memory Protection Unit (MPU), and a set of
components to support software debug operations. The internal bus interconnect is
needed to route transfers from the processor and the debugger to various parts of the
design.
The Cortex-M3 and Cortex-M4 processors are highly configurable. For example,
the debug features are optional, allowing system-on-chip designers to remove debug
components if debug support is not required in the product. This allows the silicon
area of the design to be reduced significantly. In some cases, silicon designers can
also choose to reduce the number of hardware instruction breakpoint and data watchpoint comparators to reduce the gate count. Many system features like the number of
interrupt inputs, number of interrupt priority levels supported, and the MPU are also
configurable.
The integration level in Figure 3.3 is a reference design ARMÒ provides
to silicon designers. This integration level can be modified by the silicon vendors to customize debug support such as the debug interface and to support
device-specific low-power features (e.g., adding a customized Wake-up Interrupt
Controller).
The top level of the Cortex-M3 and Cortex-M4 processors has a number of bus
interfaces, as shown in Table 3.2.
2

In many ARM document, and in command line option switches for some C compilers, a CortexÒ-M4
processor with the floating point unit is referred to as Cortex-M4F.

3.1 General information about the CortexÒ-M3 and Cortex-M4 processors

Table 3.2 Various Bus Interfaces on the Cortex-M3 and Cortex-M4 Processors
Bus Interface

Descriptions

I-CODE

Primarily for program memory: Instruction fetch and vector fetch for
address 0x0 to 0x1FFFFFFF. Based on AMBA 3.0 AHB Lite bus
protocol.
Primarily for program memory: Data and debugger accesses for
address 0x0 to 0x1FFFFFFF. Based on AMBA 3.0 AHB Lite bus
protocol.
Primarily for RAM and peripherals: Any accesses from address
0x20000000 to 0xFFFFFFFF (apart from PPB regions). Based on
AMBA 3.0 AHB Lite bus protocol.
External Private Peripheral Bus (PPB): For private debug components
on system level from address 0xE0040000 to 0xE00FFFFF. Based on
AMBA 3.0 APB protocol.
Debug Access Port (DAP) interface: For debugger accesses generated
from the debug interface module to any memory locations including
system memory and debug components. Based on the ARM
CoreSightÔ debug architecture.

D-CODE

System

PPB

DAP

3.1.5 Memory system
The CortexÒ-M3 and M4 processors themselves do not include memories (i.e.,
they do not have program memory, SRAM, or cache). Instead, they come with a
generic on-chip bus interface, so microcontroller vendors can add their own memory system to their design. Typically, the microcontroller vendor will need to add
the following items to the memory system:
•
•
•

Program memory, typically flash
Data memory, typically SRAM
Peripherals

In this way, different microcontroller products can have different memory configurations, different memory sizes and types, and different peripherals.
The bus interfaces on the Cortex-M processors are 32-bit, and based on
the Advanced Microcontroller Bus Architecture (AMBAÒ) standard. AMBA contains a collection of several bus protocol specifications. The AMBA specifications can be downloaded from the ARMÒ website, and any silicon designers
can freely use these protocol standards. Due to the low hardware cost,
efficiency, and openness of these standards, they are very popular among silicon
designers.
The main bus interface protocol used by the Cortex-M3 and M4 processors is
the AHB Lite (Advanced High-performance Bus), which is used in program
memory and system bus interfaces. The AHB Lite protocol is a pipelined bus
protocol, allowing high operation frequency and low hardware area cost.
Another bus protocol used is the Advanced Peripheral Bus (APB) interface,

63

64

CHAPTER 3 Technical Overview

commonly used in the peripheral systems of ARMÒ-based microcontrollers. In
addition, the APB protocol is used inside the Cortex-M3 and Cortex-M4 processor for debug support.
Unlike off-chip bus protocols, the AHB Lite and APB protocols are fairly simple,
as the hardware configuration inside a chip is fixed, so there is no need to have a
complex initialization protocol to handle various possible configurations (e.g., no
need for “plug and play” support as in computer technology).
The use of an open and generic bus architecture allows various silicon designers
to develop peripherals, memory controllers, and on-chip memory modules for
ARM processors. These designs are commonly referred to as IP, and microcontroller vendors can use their own peripheral designs as well as IP licensed from other
companies in their microcontroller products. With a standardized bus protocol,
these IPs can easily be integrated together in a large-scale design. Today, the
AMBA protocols are a de facto standard for on-chip bus systems. You can find
these designs in many system-on-chip devices, including those that use processors
from other processor design companies.
For a software developer writing software for the Cortex-M microcontrollers,
there is no need to understand the bus protocol details. However, their nature can
affect the programmer’s view in certain ways such as in data alignment and cycling
timing.

3.1.6 Interrupt and exception support
The CortexÒ-M3 and Cortex-M4 processors include an interrupt controller called
the Nested Vectored Interrupt Controller (NVIC). It is programmable and its registers are memory mapped. The address location of the NVIC is fixed and the programmer’s model of the NVIC is consistent across all Cortex-M processors.
Beside interrupts from peripherals and other external inputs, the NVIC also supports a number of system exceptions, including a Non-Maskable Interrupt (NMI)
and other exception sources within the processor.
The Cortex-M3 and Cortex-M4 processors are configurable. Microcontroller vendors can determine how many interrupt signals the NVIC should provide, and how
many programmable interrupt priority levels are supported in the NVIC design.
Although some of the details of NVIC in different Cortex-M3/M4 microcontrollers
can be different, the handling of interrupt/exception and the programmer’s model
of NVIC are the same and are defined in the architecture reference manual.

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors
Today, most major microcontroller vendors ship microcontrollers based on the
ARMÒ CortexÒ-M3/M4 processors. What are the advantages of the CortexÒ-M
processors which have made them so popular? The strength of the Cortex-M3/M4
processors and their benefits are summarized in this section.

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors

3.2.1 Performance
The CortexÒ-M processors deliver high performance in microcontroller products.
•

•
•
•
•

The three-stage pipeline allows most instructions, including multiply, to execute
in a single cycle, and at the same time allows high clock frequencies for microcontroller devices e typically over 100 MHz, and up to approx 200 MHz3 in
modern semiconductor manufacturing processes. Even when running at the same
clock frequency as most other processor products, the Cortex-M3 and Cortex-M4
processors have a better Clock Per Instruction (CPI) ratio. This allows more work
to be done per MHz or allows the designs to run at lower clock frequency for
reduced power consumption.
Multiple bus interfaces allow simultaneous instruction and data accesses to be
performed.
The pipelined bus interface allows a higher clock frequency in the memory system.
The highly efficient instruction set allows complex operations to be carried out in
a low numbers of instructions.
Each instruction fetch is 32-bit, and most instructions are 16-bit. Therefore up to
two instructions can be fetched at a time, allowing extra bandwidth on the
memory interface for better performance and better energy efficiency.

With the current compiler technologies, the performance of the Cortex-M3 and
Cortex-M4 processors are given in Table 3.3.
This high performance makes it possible to develop products that previously
couldn’t be done with legacy 8-bit/16-bit low-cost microcontroller products.
For example, it is now possible to add low-cost graphical interfaces to embedded devices without switching to a high-end microprocessor.

3.2.2 Code density
The Thumb instruction set used on the ARMÒ CortexÒ-M processors provides
excellent code density compared to other processor architectures. Many software developers migrating from 8-bit microcontrollers will see a significant reduction in the
required program size, while performance will also be improved significantly. The
Table 3.3 Performance of the Cortex-M Processors in Commonly Used Benchmark4

3

Processor

Dhrystone 2.1/MHz

CoreMark/MHz

Cortex-M3
Cortex-M4
Cortex-M4 with FPU

1.25 DMIPS/MHz
1.25 DMIPS/MHz
1.25 DMIPS/MHz

3.32
3.38
3.38

When this book was written the maximum clock speed available was 204 MHz.
Certified CoreMark result in December 2012.

4

65

66

CHAPTER 3 Technical Overview

code density of the Cortex-M processors is also better than many commonly used
16-bit and 32-bit architectures. There are also additional advantages:
•
•
•
•
•
•
•

Thumb-2 technology allows 16-bit instructions and 32-bit instructions to work
together without any state switching overhead. Most simple operations can be
carried out with a 16-bit instruction.
Various memory addressing modes for efficient data accesses
Multiple memory accesses can be carried out in a single instruction
Support for hardware divide instructions and Multiply-and-Accumulate (MAC)
instructions exist in both Cortex-M3 and Cortex-M4
Instructions for bit field processing in Cortex-M3/M4
Single Instruction, multiple data (SIMD) instruction support exists in Cortex-M4
Optional single precision floating point instructions are available in Cortex-M4

Besides lower system cost, high code density also reduces power consumption,
because you can use a device with less flash memory. You can also copy some parts
of the program code (e.g., interrupt handlers) into SRAM for high speed execution
without worrying that this will take up too much SRAM space.

3.2.3 Low power
The CortexÒ-M processors are designed for low power implementations. Many
Cortex-M3 and Cortex-M4 microcontroller products can run at under 200 mA/
MHz (approximately 0.36 mW/MHz for a supply voltage of 1.8 volt) and some of
them can even run at under 100 mA/MHz. Low power characteristics of Cortex-M
processors includes:
•

•
•

•

The Cortex-M3 is designed to target low-cost microcontrollers in which a
small silicon area (low gate count) is essential. The Cortex-M4 is slightly
larger due to the additional SIMD instructions and the optional floating point
unit. The three-stage pipeline design provides a good balance between performance and silicon die size.
The high code density of the Cortex-M processor allows software developers to
use a microcontroller device with smaller program memory to implement their
products to reduce power consumption.
The Cortex-M processors provide a number of low power features. These include
multiple sleep modes defined in the architecture, and integrated architectural
clock gating support, which allows clock circuits for parts of the processor to be
deactivated when the section is not in use.
The fully static, synchronous, and synthesizable design enables the processors
to be manufactured using any low power or standard semiconductor process
technology. Starting in revision 2 of the Cortex-M3 design, and on all current
revisions of Cortex-M4, the processors also have additional optional hardware
support called the Wakeup Interrupt Controller (WIC) to enable advanced low
power technologies such as State Retention Power Gating (SRPG). This is
covered in Chapter 9.

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors

With all of these low power features, the Cortex-M processors are very popular
with embedded product designers, who are constantly looking for new ways to
improve battery life in their portable products.
In addition to longer battery life, lower power in the microcontroller can also
help reduce Electro-Magnetic Interference (EMI), and potentially simplify the
power supply (or reduce the battery size) and hence reduce system cost.

3.2.4 Memory system
The CortexÒ-M3/M4 processors support a wide range of memory features:
•
•

•

•

•
•

•

Total of 4GB of addressable memory space with linear 32-bit addressing, with
no need to use memory paging.
Architectural memory map definition consistency across all Cortex-M processors. The predefined memory map allows processor designs to be optimized
for Harvard bus architecture, and allows easy access to memory-mapped peripherals (such as the NVIC) inside the processors.
Pipelined AHB Lite bus interface that allows high speed, low latency transfers.
The AHB Lite interface supports efficient transfers of 32-bit, 16-bit, and 8-bit
data. The bus protocol also allows insertion of wait states, supports bus error
conditions, and allows multiple bus masters to share a bus.
Optional bit band feature: two bit addressable regions in SRAM and peripheral
regions. Bit value modifications via bit band alias addresses are converted into
atomic Read-Modify-Write operations to bit band regions. (See section 6.7 for
details.)
Exclusive accesses for multi-processor system designs. This is important for
semaphore operation in multi-processor systems.
Support of little endian or big endian memory systems. The Cortex-M3/M4
processors can operate in both little endian or big endian mode. However, almost
all microcontrollers will be designed for either little endian or big endian, but not
both. The majority of the Cortex-M microcontroller products use little endian.
Optional Memory Protection Unit (MPU). (See the next section.)

3.2.5 Memory protection unit
The MPU is an optional feature available on the CortexÒ-M3 and Cortex-M4 processors. Microcontroller vendors can decide whether to include the MPU or not.
The MPU is a programmable device that monitors the bus transactions and needs
to be configured by software, typically an embedded OS. If an MPU is included,
applications can divide the memory space into a number of regions and define the
access permissions for each of them. When an access rule is violated, a fault exception is generated and the fault exception handler will be able to analyze the problem
and, if possible, correct it.
The MPU can be used in various ways. In common scenarios, an OS can set up
the MPU to protect data used by the OS kernel and other privileged tasks, preventing

67

68

CHAPTER 3 Technical Overview

untrusted user programs from corrupting them. Optionally, the OS can also isolate
memory regions between different user tasks. These measures allow better detection
of system failures and allow systems to be more robust in handling error conditions.
The MPU can also be used to make memory regions read-only, to prevent accidental erasure of data in SRAM or overwriting of instruction code.
By default the MPU is disabled and applications that do not require a memory
protection feature do not have to initialize it.

3.2.6 Interrupt handling
The CortexÒ-M3 and Cortex-M4 processors come with a sophisticated interrupt
controller called the Nested Vectored Interrupt Controller (NVIC). The NVIC provides a number of features:
•
•

•
•
•
•
•
•
•

Supports up to 240 interrupt inputs, a Non-Maskable Interrupt (NMI) input, and a
number of system exceptions. Each interrupt (apart from the NMI) can be
individually enabled or disabled.
Programmable priority levels for interrupts and a number of system exceptions.
In Cortex-M3 and Cortex-M4, the priority levels can be changed dynamically
at run time (note: dynamic changing of priority level is not supported in the
Cortex-M0/M0þ).
Automatic handling of interrupt/exception prioritization and nested interrupt/
exception handling.
Vectored interrupt/exception. This means the processor automatically fetches
interrupt/exception vectors without the need for software to determine
which interrupt/exception needs to be served.
Vector table can be relocated to various areas in the memory.
Low interrupt latency. With zero wait state memory system, the interrupt latency
is only 12 cycles.
Interrupts and a number of exceptions can be triggered by software.
Various optimizations to reduce interrupt processing overhead when switching
between different exception contexts.
Interrupt/exception masking facilities allow all interrupts and exceptions (apart
from the NMI) to be masked, or to mask interrupt/exceptions below a certain
priority level.

In order to support these features, the NVIC has a number of programmable registers. These registers are memory mapped, and CMSIS-Core provides the required
register definitions and access functions (API) for most common interrupt control
tasks. These access functions are very easy to use and most can be used on other
Cortex-M processors such as the Cortex-M0.
The vector table, which holds the starting addresses of interrupts and system exceptions, is a part of the system memory. By default the vector table is located at the beginning of the memory space (address 0x0), but the vector table offset can be changed at
runtime if needed. In most applications, the vector table can be set up during compiletime as a part of the application program image and remain unchanged at runtime.

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors

The number of interrupts supported by each Cortex-M3 or Cortex-M4 device is
determined by the microcontroller vendors when the chips are designed.

3.2.7 OS support and system level features
The CortexÒ-M3 and Cortex-M4 processors are designed to support embedded OSs
efficiently. They have a built-in system tick timer called SysTick, which can be set
up to generate regular timer interrupts for OS timekeeping. Since the SysTick timer
is available in all Cortex-M3 and Cortex-M4 devices, source code for the embedded
OS can easily be used on all of these devices without modification for devicespecific timers.
The Cortex-M3 and Cortex-M4 processors also have banked stacked pointers:
for OS kernel and interrupts, the Main Stack Pointer (MSP) is used; for application
tasks, the Process Stack Pointer (PSP) is used. In this way, the stack used by the OS
kernel can be separated from that use by application tasks, enabling better reliability
as well as allowing optimum stack space usage. For simple applications without an
OS, the MSP can be used all the time.
To improve system reliability further, the Cortex-M3 and Cortex-M4 processors
support the separation of privileged and non-privileged operation modes. By default,
the processors start in privileged mode. When an OS is used and user tasks are
executed, the execution of user tasks can be carried out in non-privileged operation
mode so that certain restrictions can be enforced, such as blocking access to some
NVIC registers. The separation of privileged and non-privileged operation modes
can also be used with the MPU to prevent non-privileged tasks from accessing
certain memory regions. In this way a user task cannot corrupt data used by the
OS kernel or other tasks, thus enhancing the system’s stability.
Most simple applications do not require the use of non-privileged mode at all.
But when building an embedded system that requires high reliability, the separation
of privileged and non-privileged tasks may allow the system to continue operation
even if a non-privileged task has failed.
The Cortex-M processors also have a number of fault handlers. When a fault is
detected (e.g., accessing of invalid memory address), a fault exception will be triggered and this can be used as a measure to prevent further system failures, and to
diagnose the problem.

3.2.8 CortexÒ-M4 specific features
The CortexÒ-M4 processor is very similar to Cortex-M3 in many aspects. However,
it has a number of features that the Cortex-M3 does not. This includes the DSP extensions and the optional single precision floating point unit.
The DSP extensions of the Cortex-M4 cover:
•

8-bit and 16-bit Single Instruction Multiple Data (SIMD) instructions. These
instructions allow multiple data operations to be carried out in parallel. The most
common application of SIMD is audio processing, where the calculations for the
left and right channel can be carried out at the same time. It can also be used in

69

70

CHAPTER 3 Technical Overview

•
•

image processing, where R-G-B or C-M-Y-K elements of image pixels can be
represented as an 8-bit SIMD data set and processed in parallel.
A number of saturated arithmetic instructions including SIMD versions are
also supported. This prevents massive distortion of calculation results when
overflow/underflow occurs.
Single-cycle 16-bit, dual 16-bit, and 32-bit Multiply and Accumulate (MAC).
While the Cortex-M3 also supports a couple of MAC instructions, the MAC
instructions in Cortex-M4 provide more options, including multiplication for
various combinations of upper and lower 16-bits in the registers and a SIMD
version of 16-bit MAC. In addition, the MAC operation can be carried out in a
single cycle in the Cortex-M4 processor, while in the Cortex-M3 it takes multiple
cycles.
The optional floating point unit (FPU) in the Cortex-M4 covers:

•

•
•

A single precision floating point unit compliant to IEEE 754 standard. In
order to support floating point operations, the Cortex-M4 processor supports a
number of floating point instructions. There are also a number of instructions to
convert between single precision and half precision floating point data.
The floating point unit supports fused MAC operations; this allows better
precision in the MAC result.
The floating point unit can be disabled when not in use, allowing for a reduction
in power consumption.

In order to support the additional instructions and the high performance DSP
requirements, the internal data path of the Cortex-M4 is different from that of the
Cortex-M3 processor. As a result of these differences, some of the instructions
take fewer clock cycles in the Cortex-M4.
In order to allow users access to the full potential of the Cortex-M4 DSP capability, ARMÒ provides a DSP library though the CMSIS-DSP project. This DSP
library is free and can be used on Cortex-M4, Cortex-M3 processors, and even the
Cortex-M0þ and Cortex-M0 processors. More details of the DSP library are covered
in Chapter 22.

3.2.9 Ease of use
Compared to other 32-bit processor architectures, the CortexÒ-M processors are
very easy to use. The programmer’s model and the instruction set is very
C-friendly. Therefore you can develop your applications entirely in C code without
using any assembly and yet very easily get high performance. With the help of
CMSIS compliant device driver libraries, developing your application is even easier.
For example, system initialization code is provided by microcontroller vendors and
usually interrupt controller functions are embedded in the CMSIS-Core files as part
of the device driver libraries.
Most of the features of the Cortex-M3 and Cortex-M4 processors are controlled
by memory-mapped registers. Therefore you can access almost all of the features via

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors

C pointers. Because there is no need to use compiler specific data types or directives
to access these features, the program code is extremely portable.
In the Cortex-M processors, interrupt handlers can be written as normal C
functions. Since interrupt prioritization and nesting of interrupts is handled by
the NVIC and the exception entry is vectored, there is no need to use the software
to check which interrupt needs to be served, or to handle nested interrupts explicitly. All you need to do is to assign a priority level to each interrupt and system
exception.

3.2.10 Debug support
The CortexÒ-M3 and Cortex-M4 processors come with comprehensive debug features to make software development much easier. Besides standard debug features
like halting and single stepping, you can also use various trace features to find out
details of the program execution without using expensive equipment.
To start with, the Cortex-M3 and Cortex-M4 processors support up to eight hardware comparators for breakpoints (six for instruction addresses, two for literal data
addresses) in the Flash Patch and BreakPoint Unit (FPB). When triggered, the processor can be halted or the transfers can be remapped to a SRAM location. The
remapping feature allows a read-only program memory location to be modified;
for example, to patch the program in a masked ROM with a small programmable
memory. This enables bugs to be rectified or enhancements made even when the
main code is in masked ROM. (See section 23.10 for details.)
The Cortex-M3 and Cortex-M4 processors also have up to four hardware data
watchpoint comparators in the Data Watchpoint and Trace (DWT) unit. These can
be used to generate watchpoint events to halt the processor when selected data is
accessed, or to generate trace information that can be collected by the trace interface
without stopping the processor. The data value and additional information can then
be presented by the debugger in an Integrated Development Environment (IDE) to
visualize the change of data values over time. The DWT can also be used to generate
exception event traces and basic profiling information, which is again output through
the trace interface.
The Cortex-M3 and Cortex-M4 processors also have an optional Embedded
Trace Macrocell (ETM) module that can be used to generate instruction traces.
This allows full visibility of the program flow during execution, which is very useful
for debugging complex software issues and also can be used for detailed profiling
and code coverage analysis.
Debugging the Cortex-M3 and Cortex-M4 processors can be handled by a JTAG
connection, or a two-wire interface called a Serial-Wire Debug (SWD) interface.
Both JTAG and SWD protocols are widely supported by many development tool vendors. Trace information can be collected using a single wire Serial-Wire Viewer
(SWV) interface, or a trace port interface (typically 5-pin) if high-trace bandwidth
is required (e.g., when instruction trace is used). The debug and trace interfaces
can be combined into a single connector (see Appendix H).

71

72

CHAPTER 3 Technical Overview

3.2.11 Scalability
The CortexÒ-M processors are not just for low-cost microcontroller products.
Nowadays you can find various multi-processor products containing Cortex-M3 or
Cortex-M4 processors. These include:
•
•

•
•
•

Microcontrollers with multiple Cortex-M processors, such as LPC4300 from NXP.
High-end Digital Signal Processing devices with one or more Cortex-M processors as the main processor and an additional DSP for the data processing
engine. For example, the Concerto product series from Texas Instruments, which
combines a Cortex-M3 processor with a DSP core.
Complex System-on-Chips with one or more Cortex-M processors as companion
processors. For example, the Texas Instrument OMAP5 combines a Cortex-A15
and two Cortex-M4 processors into a single device.
Complex System-on-Chips with one or more Cortex-M processors for power
management and system control.
Complex System-on-Chips with one or more Cortex-M processors for Finite
State Machine (FSM) replacement.

Using various AMBAÒ bus infrastructure solutions such as the Cortex-M System
Design Kit from ARMÒ, the bus system of the Cortex-M processors can be
expanded to support multi-processor systems. In addition, the Cortex-M3 and
Cortex-M4 support the following features to support multi-processor system design:
•

•
•

Exclusive access instructions e The Cortex-M3 and Cortex-M4 processors
support a number of exclusive access instructions. These are special memory
access instructions that work in pairs for load and store operations of variables
for semaphore or mutually exclusive operations. With additional hardware
support in the bus infrastructure, the processor can determine if it has successfully carried out exclusive access to a shared data memory area (i.e., no another
processor has accessed the same area during the operation).
Scalable debug support e The debug systems of the Cortex-M processors are
based on the CoreSightÔ Architecture. This can be expanded to support multiple
processors, sharing just one debug connection and one trace interface.
Event communication interface e The Cortex-M3 and Cortex-M4 processors
support a simple event communication interface. This allows multi-processor
systems to reduce power by having some of the processors enter sleep mode and
wake up when certain events have occurred, such as completion of semaphore
operations in one of the processors.

Another aspect of scalability is the range of microcontroller products you can
find with the Cortex-M processor. Since all the Cortex-M processors are very similar
to each other in terms of the programmer’s model, interrupt handling, and software
development including debug, you can easily switch between different processors
for your embedded systems to satisfy different performance requirements, system
level requirements, and price levels.

3.2 Features of the CortexÒ-M3 and Cortex-M4 processors

FIGURE 3.4
Cortex-M compatibility

3.2.12 Compatibility
One advantage of using the ARMÒ CortexÒ-M3 and Cortex-M4 processors is that
they have great compatibility with a wide family of other ARM devices (Figure 3.4).
For instance, there are thousands of different Cortex-M3 and Cortex-M4 devices
from different microcontroller vendors for you to choose from. If you want to reduce
cost, you can easily transfer your program code to a microcontroller based on the
Cortex-M0 processor. If your application needs a bit more processing power, often
you can find a faster Cortex-M3/M4 microcontroller, or even migrate your design to
a Cortex-R or Cortex-A processor based product.
Besides easy migration between ARM Cortex processor families, you can
also reuse software developed for ARM7Ô and ARM9Ô microcontrollers. For C
source code, often you only need to recompile the code to target Cortex-M3 or
Cortex-M4. Some assembly code files can be reused with minor modifications.
This includes some codec applications developed for the ARM9E processor family,
which often contain optimized assembly code for Digital Signal Processing (DSP).
There is still some software porting work to be done when migrating from ARM7
or ARM9 to Cortex-M microcontrollers: due to differences in the processor architecture, such as processor modes and interrupt/exception models, interrupt handlers
need to be changed and some assembly code will need to be changed or removed.
Normally the migration to Cortex-M processors will simplify the application code
as the initialization is simpler and nested exceptions/interrupts are automatically
handled by hardware.
Besides migration of hardware, ARM has a well-established software architecture that allows different software development tools to work together. In addition,
the CMSIS ecosystem also enables applications developed for the Cortex-M
microcontrollers to be compiled using various different tool chains with no or little
modification, which further protects your software IP investment.

73

CHAPTER

Architecture
CHAPTER OUTLINE

4

4.1 Introduction to the architecture........................................................................... 76
4.2 Programmer’s model........................................................................................... 76
4.2.1 Operation modes and states ............................................................ 76
Operation states .................................................................................... 76
Operation modes................................................................................... 77
4.2.2 Registers ....................................................................................... 78
R0 e R12 ............................................................................................. 79
R13, stack pointer (SP) ......................................................................... 79
R14, link register (LR) ........................................................................... 80
R15, program counter (PC) ................................................................... 80
Register names in programming ............................................................ 81
4.2.3 Special registers............................................................................. 81
Program status registers ........................................................................ 81
PRIMASK, FAULTMASK, and BASEPRI registers ................................... 83
CONTROL register................................................................................. 86
4.2.4 Floating point registers ................................................................... 90
S0 to S31/D0 to D15 ............................................................................. 90
Floating point status and control register (FPSCR).................................. 91
Memory-mapped floating point unit control registers .............................. 91
4.3 Behavior of the application program status register (APSR) .................................. 92
4.3.1 Integer status flags ......................................................................... 93
4.3.2 Q status flag................................................................................... 94
4.3.3 GE bits .......................................................................................... 95
4.4 Memory system .................................................................................................. 97
4.4.1 Memory system features.................................................................. 97
4.4.2 Memory map .................................................................................. 98
4.4.3 Stack memory ................................................................................ 99
4.4.4 Memory protection unit (MPU)....................................................... 103
4.5 Exceptions and interrupts ................................................................................. 104
4.5.1 What are exceptions? .................................................................... 104
4.5.2 Nested vectored interrupt controller (NVIC)..................................... 106
Flexible exception and interrupt management ...................................... 106
Nested exception/interrupt support ...................................................... 106
Vectored exception/interrupt entry ....................................................... 107
Interrupt masking................................................................................ 107
The Definitive Guide to ARMÒ CortexÒ-M3 and Cortex-M4 Processors. http://dx.doi.org/10.1016/B978-0-12-408082-9.00004-X
Copyright Ó 2014 Elsevier Inc. All rights reserved.

75

76

CHAPTER 4 Architecture

4.5.3 Vector table ................................................................................. 107
4.5.4 Fault handling.............................................................................. 108
4.6 System control block (SCB) .............................................................................. 109
4.7 Debug.............................................................................................................. 109
4.8 Reset and reset sequence................................................................................. 113

4.1 Introduction to the architecture

The CortexÒ-M3 and Cortex-M4 Processors are based on the ARMv7-M architecture. The original ARMv7-M architecture was defined when the Cortex-M3 processor was developed, and when the Cortex-M4 was released, the architecture was
extended to included additional instructions and architectural features. The extended
architecture is sometimes called ARMv7E-M architecture. Both ARMv7-M and
ARMv7E-M features are documented in the same architecture specification document: the ARMv7-M Architecture Reference Manual (reference 1).
The ARMv7-M Architecture Reference Manual is a massive document of over
1000 pages. It provides very detailed architectural requirements of the processor’s
behavior, from instruction set, memory system to debug support. While it is useful
for experts like processor designers, designers of C compilers, and development tools,
this document is not easy to read, especially for readers new to the ARMÒ architecture.
To use a Cortex-M microcontroller in typical applications, there is no need to
have detailed knowledge of this architecture. You only need to have a basic understanding of the programmer’s model, how exceptions (such as interrupts) are
handled, the memory map, how to use the peripherals, and how to use the software
driver library files from the microcontroller vendors.
In the next few chapters of this book, we will look at the architecture from software developer’s point of view. First, we will look at the programmer’s model of the
processor, which covers operation modes, register banks, and special registers.

4.2 Programmer’s model

4.2.1 Operation modes and states
The CortexÒ-M3 and Cortex-M4 processors have two operation states and two
modes. In addition, the processors can have privileged and unprivileged access levels.
These are shown in Figure 4.1. The privileged access level can access all resources in
the processor, while unprivileged access level means some memory regions are inaccessible, and a few operations cannot be used. In some documents, the unprivileged
access level might also be referred as “User” state, a term inherited from
ARM7TDMIÔ .

Operation states
•

Debug state: When the processor is halted (e.g., by the debugger, or after hitting a
breakpoint), it enters debug state and stops executing instructions.

4.2 Programmer’s model

FIGURE 4.1
Operation states and modes

•

Thumb state: If the processor is running program code (Thumb instructions), it
is in the Thumb state. Unlike classic ARMÒ processors like ARM7TDMI, there
is no ARM state because the Cortex-M processors do not support the ARM instruction set.

Operation modes
•
•

Handler mode: When executing an exception handler such as an Interrupt Service Routine (ISR). When in handler mode, the processor always has privileged
access level.
Thread mode: When executing normal application code, the processor can be
either in privileged access level or unprivileged access level. This is controlled by
a special register called “CONTROL.” We will cover this more in section 4.2.3.

Software can switch the processor in privileged Thread mode to unprivileged
Thread mode. However, it cannot switch itself back from unprivileged to privileged.
If this is needed, the processor has to use the exception mechanism to handle the switch.
The separation of privileged and unprivileged access levels allows system
designers to develop robust embedded systems by providing a mechanism to safeguard memory accesses to critical regions and by providing a basic security model.
For example, a system can contain an embedded OS kernel that executes in privileged access level, and application tasks which execute in unprivileged access level.
In this way, we can set up memory access permissions using the Memory Protection
Unit (MPU) to prevent an application task from corrupting memory and peripherals
used by the OS kernel and other tasks. If an application task crashes, the remaining
application tasks and the OS kernel can still continue to run.
Besides the differences in memory access permission and access to several
special instructions, the programmer’s model of the privileged access level and
unprivileged access level are almost the same. Note that almost all of the NVIC registers are privileged access only.

77

78

CHAPTER 4 Architecture

FIGURE 4.2
In simple applications, the unprivileged Thread mode can be unused

Similarly, Thread mode and Handler mode have very similar programmer’s
models. However, Thread mode can switch to using a separate shadowed Stack
Pointer (SP). Again, this allows the stack memory for application tasks to be separated from the stack used by the OS kernel, thus allowing better system reliability.
By default, the Cortex-M processors start in privileged Thread mode and in
Thumb state. In many simple applications, there is no need to use the unprivileged
Thread model and the shadowed SP at all (see Figure 4.2). Unprivileged Thread
model is not available in the Cortex-M0 processor, but is optional in the CortexM0þ processor.
The debug state is used for debugging operations only. This state is entered by a
halt request from the debugger, or by debug events generated from debug components
in the processor. This state allows the debugger to access or change the processor register values. The system memory, including peripherals inside and outside the processor, can be accessed by the debugger in either Thumb state or debug state.

4.2.2 Registers
Similarly to almost all other processors, the CortexÒ-M3 and Cortex-M4 processors
have a number of registers inside the processor core to perform data processing and
control. Most of these registers are grouped in a unit called the register bank. Each
data processing instruction specifies the operation required, the source register(s),
and the destination register(s) if applicable. In the ARMÒ architecture, if data in
memory is to be processed, it has to be loaded from the memory to registers in
the register bank, processed inside the processor, and then written back to the memory, if needed. This is commonly called a “load-store architecture.” By having a sufficient number of registers in the register bank, this arrangement is easy to use, and
allows efficient program code to be generated using C compilers. For instance, a
number of data variables can be stored in the register bank for a short period of
time while other data processing takes place, without the need to be updated to
the system memory and read back every time they are used.

4.2 Programmer’s model

FIGURE 4.3
Registers in the register bank

The register bank in the Cortex-M3 and Cortex-M4 processors has 16 registers.
Thirteen of them are general purpose 32-bit registers, and the other three have special uses, as can be seen in Figure 4.3.

R0 e R12

Registers R0 to R12 are general purpose registers. The first eight (R0 e R7) are also
called low registers. Due to the limited available space in the instruction set, many
16-bit instructions can only access the low registers. The high registers (R8 e R12)
can be used with 32-bit instructions, and a few with 16-bit instructions, like MOV
(move). The initial values of R0 to R12 are undefined.

R13, stack pointer (SP)
R13 is the Stack Pointer. It is used for accessing the stack memory via PUSH and POP
operations. Physically there are two different Stack Pointers: the Main Stack Pointer
(MSP, or SP_main in some ARM documentation) is the default Stack Pointer. It
is selected after reset, or when the processor is in Handler Mode. The other Stack
Pointer is called the Process Stack Pointer (PSP, or SP_process in some ARM

79

80

CHAPTER 4 Architecture

documentation). The PSP can only be used in Thread Mode. The selection of Stack
Pointer is determined by a special register called CONTROL, which will be explained
in section 4.2.3. In normal programs, only one of these Stack Pointers will be visible.
Both MSP and PSP are 32-bit, but the lowest two bits of the Stack Pointers (either
MSP or PSP) are always zero, and writes to these two bits are ignored. In ARM
Cortex-M processors, PUSH and POP are always 32-bit, and the addresses of the
transfers in stack operations must be aligned to 32-bit word boundaries.
For most cases, it is not necessary to use the PSP if the application doesn’t
require an embedded OS. Many simple applications can rely on the MSP completely.
The PSP is normally used when an embedded OS is involved, where the stack for the
OS kernel and application tasks are separated. The initial value of PSP is undefined,
and the initial value of MSP is taken from the first word of the memory during the
reset sequence.

R14, link register (LR)
R14 is also called the Link Register (LR). This is used for holding the return address
when calling a function or subroutine. At the end of the function or subroutine, the
program control can return to the calling program and resume by loading the value of
LR into the Program Counter (PC). When a function or subroutine call is made, the
value of LR is updated automatically. If a function needs to call another function or
subroutine, it needs to save the value of LR in the stack first. Otherwise, the current
value in LR will be lost when the function call is made.
During exception handling, the LR is also updated automatically to a special
EXC_RETURN (Exception Return) value, which is then used for triggering the
exception return at the end of the exception handler. This will be covered in more
depth in Chapter 8.
Although the return address values in the Cortex-M processors are always even
(bit 0 is zero because the instructions must be aligned to half-word addresses), bit
0 of LR is readable and writeable. Some of the branch/call operations require that
bit zero of LR (or any register being used) be set to 1 to indicate Thumb state.

R15, program counter (PC)
R15 is the Program Counter (PC). It is readable and writeable: a read returns the
current instruction address plus 4 (this is due to the pipeline nature of the design,
and compatibility requirement with the ARM7TDMIÔ processor). Writing to PC
(e.g., using data transfer/processing instructions) causes a branch operation.
Since the instructions must be aligned to half-word or word addresses, the Least
Significant Bit (LSB) of the PC is zero. However, when using some of the branch/
memory read instructions to update the PC, you need to set the LSB of the new PC
value to 1 to indicate the Thumb state. Otherwise, a fault exception can be triggered,
as it indicates an attempt to switch to use ARM instructions (i.e., 32-bit ARM instructions as in ARM7TDMI), which is not supported. In high-level programming
languages (including C, Cþþ), the setting of LSB in branch targets is handled by
the compiler automatically.

4.2 Programmer’s model

Table 4.1 Allowed Register Names as Assembly Code
Register

Possible Register Names

R0-R12
R13

R0, R1 . R12, r0, r1 . r12
R13, r13, SP, sp

R14
R15

R14, r14, LR, lr
R15, r15, PC, pc

Notes

Register name MSP and PSP are
used in special register access
instructions (MRS, MSR)

In most cases, branches and calls are handled by instructions dedicated to such
operations. It is less common to use data processing instructions to update the PC.
However, the value of PC is useful for accessing literal data stored in program memory. So you can frequently find memory read operations with PC as base address register with address offsets generated by immediate values in the instructions.

Register names in programming
With most assembly tools, you can use a variety of names for accessing the registers
in the register bank. In some assembly tools, such as the ARM assembly (supported
in DS-5Ô Professional, KeilÔ MDK-ARM), you can use either uppercase, or lowercase, or mixed cases (Table 4.1).

4.2.3 Special registers
Besides the registers in the register bank, there are a number of special registers
(Figure 4.4). These registers contain the processor status and define the operation
states and interrupt/exception masking. In the development of simple applications
with high level programming languages such as C, there are not many scenarios
that require access to these registers. However, they are needed for development
of an embedded OS, or when advanced interrupt masking features are needed.
Special registers are not memory mapped, and can be accessed using special register access instructions such as MSR and MRS.
MRS , ; Read special register into register
MSR , ; write to special register

CMSIS-Core also provides a number of C functions that can be used to access
special registers. Do not confuse these special registers with “special function registers (SFR)” in other microcontroller architectures, which are commonly referred to
as registers for I/O control.

Program status registers
The Program Status Register is composed of three status registers:
•
•
•

Application PSR (APSR)
Execution PSR (EPSR)
Interrupt PSR (IPSR)

81

82

CHAPTER 4 Architecture

FIGURE 4.4
Special Registers

APSR

31

30

29

28

27

N

Z

C

V

Q

26:25

24

23:20

19:16

15:10

9

8

7

6

5

4:0

GE*

IPSR

Exception Number

EPSR

ICI/IT

T

ICI/IT

*GE is available in ARMv7E-M processors such as the Cortex-M4. It is not available in the Cortex-M3 processor.

FIGURE 4.5
APSR, IPSR, and EPSR

xPSR

31

30

29

28

27

26:25

24

N

Z

C

V

Q

ICI/IT

T

23:20

19:16

15:10

GE*

ICI/IT

9

8

7

6

5

4:0

Exception Number

*GE is available in ARMv7E-M processors such as the Cortex-M4. It is not available in the Cortex-M3 processor.

FIGURE 4.6
Combined xPSR

These three registers (Figure 4.5) can be accessed as one combined register,
referred to as xPSR in some documentation. In ARMÒ assembler, when accessing
xPSR (Figure 4.6), the symbol PSR is used. For example:
MRS
MSR

r0, PSR
PSR, r0

; Read the combined program status word
; Write combined program state word

4.2 Programmer’s model

Table 4.2 Bit Fields in Program Status Registers
Bit

Description

N
Z
C
V
Q
GE[3:0]

Negative flag
Zero flag
Carry (or NOT borrow) flag
Overflow flag
Sticky saturation flag (not available in ARMv6-M)
Greater-Than or Equal flags for each byte lane (ARMv7E-M
only; not available in ARMv6-M or CortexÒ-M3).
Interrupt-Continuable Instruction (ICI) bits, IF-THEN
instruction status bit for conditional execution (not available
in ARMv6-M).
Thumb state, always 1; trying to clear this bit will cause a fault
exception.
Indicates which exception the processor is handling.

ICI/IT

T
Exception Number

You can also access an individual PSR (Figure 4.5). For example:
MRS
MRS
MSR

r0, APSR
r0, IPSR
APSR, r0

; Read Flag state into R0
; Read Exception/Interrupt state
; Write Flag state

Please note:
•
•

The ERSR cannot be accessed by software code directly using MRS (read as
zero) or MSR
The IPSR is read only and can be read from combined PSR (xPSR).

Figure 4.5 shows the definition of the various PSRs in ARMv7-M, and Table 4.2
lists the definition of the bit fields in the PSRs.
Please note that some of the bit fields in the APSR and EPSR are not available in
ARMv6-M architecture (e.g., the CortexÒ-M0 processor). Also, it is quite different
from classic ARM processors such as the ARM7TDMIÔ . If you compare this with
the Current Program Status Register (CPSR) in ARM7Ô , you might find that some
of the bit fields used in ARM7 are gone. The Mode (M) bit field is gone because the
Cortex-M3 does not have the operation mode as defined in ARM-7. Thumb-bit (T)
is moved to bit 24. Interrupt status (I and F) bits are replaced by the new interrupt
mask registers (PRIMASKs), which are separated from PSR. For comparison, the
CPSR in traditional ARM processors is shown in Figure 4.7.
Detailed behavior of the APSR is covered in a later part of this chapter (section 4.3).

PRIMASK, FAULTMASK, and BASEPRI registers
The PRIMASK, FAULTMASK, and BASEPRI registers are all used for exception or
interrupt masking. Each exception (including interrupts) has a priority level where a

83

84

CHAPTER 4 Architecture

31

30

29

28

27

26:25

24

ARM general
(Cortex-A/R)

N

Z

C

V

Q

IT

J

ARM7TDMI
(ARMv4)

N

Z

C

V

ARMv7-M
(Cortex-M3)

N

Z

C

V

Q

ICI/IT

T

ARMv7E-M
(Cortex-M4)

N

Z

C

V

Q

ICI/IT

T

ARMv6-M
(Cortex-M0)

N

Z

C

V

23:20

19:16

Reserved GE[3:0]

15:10

9

8

7

6

5

4:0

IT

E

A

I

F

T

M[4:0]

I

F

T

M[4:0]

Reserved

GE[3:0]

ICI/IT

Exception Number

ICI/IT

Exception Number

T

Exception Number

FIGURE 4.7
Comparing PSR of various ARM architectures

31:8

7:1

0

PRIMASK
FAULTMASK
BASEPRI

3 bits to 8 bits
0 bit to 5 bits

FIGURE 4.8
PRIMASK, FAULTMASK, and BASEPRI registers

smaller number is a higher priority and a larger number is a lower priority. These
special registers are used to mask exceptions based on priority levels. They can
only be accessed in the privileged access level (in unprivileged state writes to these
registers are ignored and reads return zero). By default, they are all zero, which
means the masking (disabling of exception/interrupt) is not active. Figure 4.8 shows
the programmer’s model of these registers.
The PRIMASK register is a 1-bit wide interrupt mask register. When set, it
blocks all exceptions (including interrupts) apart from the Non-Maskable Interrupt
(NMI) and the HardFault exception. Effectively it raises the current exception priority level to 0, which is the highest level for a programmable exception/interrupt.
The most common usage for PRIMASK is to disable all interrupts for a time critical
process. After the time critical process is completed, the PRIMASK needs to be cleared
to re-enable interrupts. Details for using PRIMASK are given in section 7.10.1.
The FAULTMASK register is very similar to PRIMASK, but it also blocks
the HardFault exception, which effectively raises the current exception priority
level to [minus]1. FAULTMASK can be used by fault handling code to suppress

4.2 Programmer’s model

the triggering of further faults (only several types of them) during fault handling.
For example, FAULTMASK can be used to bypass MPU or suppress bus fault
(these are configurable). This potentially makes it easier for fault handling
code to carry out remedial actions. Unlike PRIMASK, FAULTMASK is cleared
automatically at exception return. Details for using FAULTMASK are given in
section 7.10.2.
In order to allow more flexible interrupt masking, the ARMv7-M architecture
also provides the BASEPRI, which masks exceptions or interrupts based on priority
level. The width of the BASEPRI register depends on how many priority levels are
implemented in the design, which is determined by the microcontroller vendors.
Most Cortex-M3 or Cortex-M4 microcontrollers have eight programmable exception priority levels (3-bit width) or 16 levels, and in these cases the width of BASEPRI will be 3 bits or 4 bits, respectively. When BASEPRI is set to 0, it is disabled.
When it is set to a non-zero value, it blocks exceptions (including interrupts) that
have the same or lower priority level, while still allowing exceptions with a higher
priority level to be accepted by the processor. Details of using BASEPRI are covered
in section 7.10.3.
CMSIS-Core provides a number of functions for accessing the PRIMASK,
FAULTMASK, and BASEPRI registers in the C programming environment (note:
these registers can only be accessed in the privileged access level).
x = __get_BASEPRI();

// Read BASEPRI register

x = __get_PRIMARK();

// Read PRIMASK register

x = __get_FAULTMASK();

// Read FAULTMASK register

__set_BASEPRI(x);

// Set new value for BASEPRI

__set_PRIMASK(x);
__set_FAULTMASK(x);

// Set new value for PRIMASK
// Set new value for FAULTMASK

__disable_irq();

// Set PRIMASK, disable IRQ

__enable_irq();

// Clear PRIMASK, enable IRQ

Alternatively, you can also access these exception masking registers with assembly code:
MRS
MRS
MRS
MSR
MSR
MSR

r0, BASEPRI ; Read BASEPRI register into R0
r0, PRIMASK ; Read PRIMASK register into R0
r0, FAULTMASK ; Read FAULTMASK register into R0
BASEPRI, r0 ; Write R0 into BASEPRI register
PRIMASK, r0 ; Write R0 into PRIMASK register
FAULTMASK, r0 ; Write R0 into FAULTMASK register

In addition, the Change Processor State (CPS) instructions allow the value of the
PRIMASK and FAULTMASK to be set or cleared with a simple instruction.
CPSIE i
CPSID i
CPSIE f
CPSID f

; Enable interrupt (clear PRIMASK)
; Disable interrupt (set PRIMASK)
; Enable interrupt (clear FAULTMASK)
; Disable interrupt (set FAULTMASK)

85

86

CHAPTER 4 Architecture

31:3
Cortex-M3
Cortex-M4

CONTROL

31:3
ARMv6-M
(e.g. Cortex-M0)

1

0

SPSEL

nPRIV

2

1

0

FPCA

SPSEL

nPRIV

2

1

0

SPSEL

nPRIV

CONTROL

31:3
Cortex-M4
with FPU

2

CONTROL

FIGURE 4.9
CONTROL register in Cortex-M3, Cortex-M4, Cortex-M4 with FPU. The bit nPRIV is not
available in the Cortex-M0 and is optional in the Cortex-M0þ processor

Note: The FAULTMASK and BASEPRI registers are not available in ARMv6-M
(e.g., Cortex-M0).

CONTROL register
The CONTROL register (Figure 4.9) defines:
•
•

The selection of stack pointer (Main Stack Point/Process Stack Pointer)
Access level in Thread mode (Privileged/Unprivileged)

In addition, for Cortex-M4 processor with a floating point unit, one bit of the
CONTROL register indicates if the current context (currently executed code) uses
the floating point unit or not.
Note: The CONTROL register for ARMv6-M (e.g., Cortex-M0) is also shown
for comparison. In ARMv6-M, support of nPRIV and unprivileged access level is
implementation dependent, and is not available in the first generation of the
Cortex-M0 products and Cortex-M1 products. It is optional in the Cortex-M0þ
processor.
The CONTROL register can only be modified in the privileged access level and
can be read in both privileged and unprivileged access levels. The definition of each
bit field in the CONTROL register is shown in Table 4.3.
After reset, the CONTROL register is 0. This means the Thread mode uses the
Main Stack Pointer as Stack Pointer and Thread mode has privileged accesses. Programs in privileged Thread mode can switch the Stack Pointer selection or switch
to unprivileged access level by writing to CONTROL (Figure 4.10). However,
once nPRIV (CONTROL bit 0) is set, the program running in Thread can no longer
access the CONTROL register.

4.2 Programmer’s model

Table 4.3 Bit Fields in CONTROL Register
Bit

Function

nPRIV (bit 0)

Defines the privileged level in Thread mode:
When this bit is 0 (default), it is privileged level when in Thread mode.
When this bit is 1, it is unprivileged when in Thread mode.
In Handler mode, the processor is always in privileged access level.
Defines the Stack Pointer selection:
When this bit is 0 (default), Thread mode uses Main Stack Pointer
(MSP).
When this bit is 1, Thread mode uses Process Stack Pointer (PSP).
In Handler mode, this bit is always 0 and write to this bit is ignored.
Floating Point Context Active – This bit is only available in Cortex-M4
with floating point unit implemented. The exception handling
mechanism uses this bit to determine if registers in the floating point
unit need to be saved when an exception has occurred.
When this bit is 0 (default), the floating point unit has not been used in
the current context and therefore there is no need to save floating point
registers.
When this bit is 1, the current context has used floating point
instructions and therefore need to save floating point registers.
The FPCA bit is set automatically when a floating point instruction is
executed. This bit is clear by hardware on exception entry.
There are several options for handling saving of floating point registers.
This will be covered in Chapter 13.

SPSEL (bit 1)

FPCA (bit 2)

Thumb State
Exception
request

Handler Mode
Executing exception handler

Exception
return

SPSEL = 0
MSP selected
Thread Mode
Executing normal code

Start

SPSEL = 0
MSP selected

SPSEL = 1
PSP selected

FIGURE 4.10
Stack Pointer selection

A program in unprivileged access level cannot switch itself back to privileged
access level. This is essential in order to provide a basic security usage model.
For example, an embedded system might contain untrusted applications running
in unprivileged access level and the access permission of these applications must

87

88

CHAPTER 4 Architecture

FIGURE 4.11
Switching between privileged thread mode and unprivileged thread mode

be restricted to prevent security breaches or to prevent an unreliable application from
crashing the whole system.
If it is necessary to switch the processor back to using privileged access level in
Thread mode, then the exception mechanism is needed. During exception handling,
the exception handler can clear the nPRIV bit (Figure 4.11). When returning to
Thread mode, the processor will be in privileged access level.
When an embedded OS is used, the CONTROL register could be reprogrammed
at each context switch to allow some application tasks to run with privileged access
level and the others to run with unprivileged access level.
The settings of nPRIV and SPSEL are orthogonal. Four different combinations of nPRIV and SPSEL are possible, although only three of them are
commonly used in real world applications, as shown in Table 4.4.
In most simple applications without an embedded OS, there is no need to change
the value of the CONTROL register. The whole application can run in privileged access level and use only the MSP (Figure 4.12).
To access the CONTROL register in C, the following functions are available in
CMSIS-compliant device-driver libraries:
x = __get_CONTROL(); // Read the current value of CONTROL
__set_CONTROL(x);

// Set the CONTROL value to x

There are two points that you need to be aware of when changing the value of the
CONTROL register:
•

For the Cortex-M4 processor with floating point unit (FPU), or any variant of
ARMv7-M processors with (FPU), the FPCA bit can be set automatically due to
the presence of floating point instructions. If the program contains floating point
operations and the FPCA bit is cleared accidentally, and subsequently an interrupt occurs, the data in registers in the floating point unit will not be saved by the
exception entry sequence and could be overwritten by the interrupt handler. In
this case, the program will not be able to continue correct processing when
resuming the interrupted task.

4.2 Programmer’s model

Table 4.4 Different Combinations of nPRIV and SPSEL
nPRIV

SPSEL

Usage Scenario

0

0

0

1

1

1

1

0

Simple applications – the whole application is running in
privileged access level. Only one stack is used by the main
program and interrupt handlers. Only the Main Stack
Pointer (MSP) is used.
Applications with an embedded OS, with current executing
task running in privileged Thread mode. The Process Stack
Pointer (PSP) is selected in current task, and the MSP is
used by OS Kernel and exception handlers.
Applications with an embedded OS, with current executing
task running in unprivileged Thread mode. The Process
Stack Pointer (PSP) is selected in current task, and the
MSP is used by OS Kernel and exception handlers.
Thread mode tasks running with unprivileged access level
and use MSP. This can be observed in Handler mode but is
less likely to be used for user tasks because in most
embedded OS, the stack for application tasks is separated
from the stack used by OS kernel and exception handlers.

Privileged
handler

Privileged
thread

Exception
handler

Starting
code

Privileged
thread

Exception

Exception
handler

Privileged
thread

Exception

Privileged
thread

Unprivileged
thread

FIGURE 4.12
Simple applications do not require unprivileged Thread mode

•

After modifying the CONTROL register, architecturally an Instruction Synchronization Barrier (ISB) instruction (or __ISB() function in CMSIS compliant
driver) should be used to ensure the effect of the change applies to subsequent
code. Due to the simple nature of the Cortex-M3, Cortex-M4, Cortex-M0þ,
Cortex-M0, and Cortex-M1 pipeline, omission of the ISB instruction does not
cause any problem.

To access the Control register in assembly, the MRS and MSR instructions are
used:
MRS
MSR

r0, CONTROL ; Read CONTROL register into R0
CONTROL, r0 ; Write R0 into CONTROL register

89

90

CHAPTER 4 Architecture

You can detect if the current execution level is privileged by checking the value
of IPSR and CONTROL:
int in_privileged(void)
{
if (__get_IPSR() != 0) return 1; // True
else
if ((__get_CONTROL() & 0x1)==0) return 1; // True
else return 0; // False
}

4.2.4 Floating point registers
The Cortex-M4 processor has an optional floating point unit. This provides additional registers for floating point data processing, as well as a Floating Point Status
and Control Register (FPSCR) (Figure 4.13).

S0 to S31/D0 to D15
Each of the 32-bit registers S0 to S31 (“S” for single precision) can be accessed
using floating point instructions, or accessed as a pair, in the symbol of D0 to
D15 (“D” for double-word/double-precision). For example, S1 and S0 are paired
together to become D0, and S3 and S2 are paired together to become D1. Although

FIGURE 4.13
Registers in the floating point unit

4.2 Programmer’s model

FPSCR

31

30

29

28

N

Z

C

V

27

26

25

AHP DN

24

23:22

21:8

7

6:5

4

3

2

1

0

FZ RMode Reserved IDC Reserved IXC UFC OFC DZC IOC

Reserved

FIGURE 4.14
Bit field in FPSCR

the floating point unit in the Cortex-M4 does not support double precision floating
point calculations, you can still use floating point instructions for transferring double
precision data.

Floating point status and control register (FPSCR)
The FPSCR contains various bit fields (Figure 4.14) for a couple of reasons:
•
•

To define some of the floating point operation behaviors
To provide status information about the floating point operation results

By default, the behavior is configured to be compliant with IEEE 754 single precision operation. In normal applications there is no need to modify the settings of the
floating point operation control. Table 4.5 lists the descriptions for the bit fields in
FPSCR.
Note: The exception bits in FPSCR can be used by software to detect abnormalities in floating point operations. Bit fields in FPSCR are covered in Chapter 13.

Memory-mapped floating point unit control registers
In addition to the floating point register bank and FPSCR, the floating point unit also
introduces several additional memory-mapped registers into the system. For
example, the Coprocessor Access Control Register (CPACR) is used to enable or
disable the floating point unit. By default the floating point unit is disabled to reduce
power consumption. Before using any floating point instructions, the floating point
unit must be enabled by programming the CPACR register (Figure 4.15).
In the C programming environment with a CMSIS-compliant device-driver:
SCB->CPACR j= 0xF << 20; // Enable full access to the FPU

In assembly language programming environment, you can use the following
code:
LDR R0,=0xE000ED88 ; R0 set to address of CPACR
LDR R1,=0x00F00000 ; R1 = 0xF << 20
LDR R2 [R0]
; Read current value of CPACR
ORRS R2, R2, R1
; Set bit
STR R2,[R0]
; Write back modified value to CPACR

The other memory-mapped floating point unit registers will be covered in
Chapter 13, which also covers details of the floating point unit.

91

92

CHAPTER 4 Architecture

Table 4.5 Bit Fields in FPSCR
Bit

Description

N
Z
C
V
AHP

Negative flag (update by floating point comparison operations)
Zero flag (update by floating point comparison operations)
Carry/borrow flag (update by floating point comparison operations)
Overflow flag (update by floating point comparison operations)
Alternate half-precision control bit:
0 – IEEE half-precision format (default)
1 – Alternative half-precision format
Default NaN (Not a Number) mode control bit:
0 – NaN operands propagate through to the output of a floating point
operation (default)
1 – Any operation involving one or more NaN(s) returns the default NaN
Flush-to-zero model control bit:
0 – Flush-to-zero mode disabled (default). (IEEE 754 standard compliant)
1 – Flush-to-zero mode enabled
Rounding Mode Control field. The specified rounding mode is used by
almost all floating-point instructions:
00 – Round to Nearest (RN) mode (default)
01 – Round towards Plus Infinity (RP) mode
10 – Round towards Minus Infinity (RM) mode
11 – Round towards Zero (RZ) mode
Input Denormal cumulative exception bit. Set to 1 when floating point
exception occurred, clear by writing 0 to this bit. (Result not within
normalized value range; see section 13.1.2.)
Inexact cumulative exception bit. Set to 1 when floating point exception
occurred, clear by writing 0 to this bit.
Underflow cumulative exception bit. Set to 1 when floating point exception
occurred, clear by writing 0 to this bit.
Overflow cumulative exception bit. Set to 1 when floating point exception
occurred, clear by writing 0 to this bit.
Division by Zero cumulative exception bit. Set to 1 when floating point
exception occurred, clear by writing 0 to this bit.
Invalid Operation cumulative exception bit. Set to 1 when floating point
exception occurred, clear by writing 0 to this bit.

DN

FZ

RMode

IDC

IXC
UFC
OFC
DZC
IOC

4.3 Behavior of the application program status
register (APSR)
The APSR contains several groups of status flags:
•
•
•

Status flags for integer operations (N-Z-C-V bits)
Status flags for saturation arithmetic (Q bit)
Status flags for SIMD operations (GE bits)

4.3 Behavior of the application program status register (APSR)

CPACR

31:24

23:22

21:20

19:0

Reserved

CP11

CP10

Reserved

Bit field encoding:
00 – Access denied
01 – Privileged access only
10 – Reserved (unpredictable)
11 – Full accesses

FIGURE 4.15
Bit field in CPACR

4.3.1 Integer status flags
The integer status flags are very similar to ALU status flags in many other processor
architectures. These flags are affected by general data processing instructions, and
are essential for controlling conditional branches and conditional executions. In
addition, one of the APSR flags, the C (Carry) bit, can also be used in add and subtract operations.
There are four integer flags in the CortexÒ-M processors, shown in Table 4.6.
A few examples of the ALU flag results are shown in Table 4.7.
In the ARMv7-M and ARMv7E-M architecture, most of the 16-bit instructions affect these four ALU flags. In most of the 32-bit instructions one of the
bits in the instruction encoding defines if the APSR flags should be updated
or not. Note that some of these instructions do not update the V flag or the C
flag. For example, the MULS (multiply) instruction only changes the N flag
and the Z flag.

Table 4.6 ALU Flags on the Cortex-M Processors
Flag

Descriptions

N (bit 31)

Set to bit[31] of the result of the executed instruction. When it
is “1,” the result has a negative value (when interpreted as a
signed integer). When it is “0,” the result has a positive value
or equal zero.
Set to “1” if the result of the executed instruction is zero. It can
also be set to “1” after a compare instruction is executed if the
two values are the same.
Carry flag of the result. For unsigned addition, this bit is set to “1”
if an unsigned overflow occurred. For unsigned subtract
operations, this bit is the inverse of the borrow output status.
This bit is also updated by shift and rotate operations.
Overflow of the result. For signed addition or subtraction, this bit
is set to “1” if a signed overflow occurred.

Z (bit 30)

C (bit 29)

V (bit 28)

93

94

CHAPTER 4 Architecture

Table 4.7 ALU Flags Example
Operation

Results, Flags

0x70000000 + 0x70000000

Result ¼ 0xE0000000, N¼ 1, Z¼0, C ¼ 0, V ¼ 1

0x90000000 + 0x90000000

Result ¼ 0x30000000, N¼ 0, Z¼0, C ¼ 1, V ¼ 1

0x80000000 + 0x80000000

Result ¼ 0x00000000, N¼ 0, Z¼1, C ¼ 1, V ¼ 1

0x00001234  0x00001000

Result ¼ 0x00000234, N¼ 0, Z¼0, C ¼ 1, V ¼ 0

0x00000004  0x00000005

Result ¼ 0xFFFFFFFF, N¼ 1, Z¼0, C ¼ 0, V ¼ 0

0xFFFFFFFF  0xFFFFFFFC

Result ¼ 0x00000003, N¼ 0, Z¼0, C ¼ 1, V ¼ 0

0x80000005  0x80000004

Result ¼ 0x00000001, N¼ 0, Z¼0, C ¼ 1, V ¼ 0

0x70000000  0xF0000000

Result ¼ 0x80000000, N¼ 1, Z¼0, C ¼ 0, V ¼ 1

0xA0000000  0xA0000000

Result ¼ 0x00000000, N¼ 0, Z¼1, C ¼ 1, V ¼ 0

In addition to conditional branch or conditional execution code, the Carry bit
of APSR can also be used to extend add and subtract operations to over 32 bits.
For example, when adding two 64-bit integers together, we can use the carry bit
from the lower 32-bit add operation as an extra input for the upper 32-bit add
operation:
// Calculating Z = X + Y, where X, Y and Z are all 64-bit
Z[31:0] = X[31:0] + Y[31:0]; // Calculate lower word addition,
// carry flag get updated
Z[63:32] = X[63:32] + Y[63:32] + Carry; // Calculate upper
// word addition

The N-Z-C-V flags are available in all ARMÒ processors including the CortexM0 processor.

4.3.2 Q status flag
The Q is used to indicate an occurrence of saturation during saturation arithmetic
operations or saturation adjustment operations. It is available in ARMv7-M (e.g.,
CortexÒ-M3 and Cortex-M4 processors), but not ARMv6-M (e.g., Cortex-M0 processor). After this bit is set, it remains set until a software write to the APSR clears
the Q bit. Saturation arithmetic/adjustment operations do not clear this bit. As a
result, you can use this bit to determine if saturation occurred at the end of a
sequence of Saturation arithmetic/adjustment operations, without the need to check
the saturation status during each step.
Saturation arithmetic is useful for digital signal processing. In some cases, the
destination register used to hold a calculation result might not have sufficient bit
width and as a result, overflow or underflow occurs. If normal data arithmetic
instructions are used, the MSB of the result would be lost and can cause a serious
distortion in the output. Instead of just cutting off the MSB, saturation arithmetic

4.3 Behavior of the application program status register (APSR)

Output with
signed
saturation

Ideal
result

Maximum possible
value of result

Output with
unsigned
saturation

Ideal
result

Maximum possible
value of result
Saturated (Q bit set)

Saturated (Q bit set)

Actual result

Actual result

Result of calculation
Result of calculation
Minimum possible value
of result
Saturated (Q bit set)

Minimum
possible value
of result
Saturated (Q bit set)

FIGURE 4.16
Signed saturation and unsigned saturation

forces the result to the maximum value (in case of overflow) or minimum value
(in case of underflow) to reduce the impact of signal distortion (figure 4.16).
The actual maximum and minimum values that trigger the saturation depend on
the instructions being used. In most cases, the instructions for saturation arithmetic
are mnemonic starting with “Q,” for example “QADD16.” If saturation occurred, the
Q bit is set; otherwise, the value of the Q bit is unchanged.
The Cortex-M3 processor provides a couple of saturation adjustment instructions, and the Cortex-M4 provides a full set of saturation arithmetic instructions,
as well as those saturation adjustment instructions available in the Cotex-M3
processor.

4.3.3 GE bits
The “Greater-Equal” (GE) is a 4-bit wide field in the APSR in the CortexÒ-M4, and
is not available in the Cortex-M3 processor. It is updated by a number of SIMD
instructions where, in most cases, each bit represents positive or overflow of
SIMD operations for each byte (Table 4.8). For SIMD instructions with 16-bit
data, bit 0 and bit 1 are controlled by the result or lower half-word, and bit 2 and
bit 3 are controlled by the result of upper half-word.
The GE flags are used by the SEL instruction(Figure 4.17), which multiplexes
the byte values from two source registers based on each GE bit. When combining
SIMD instructions with the SEL instruction, simple conditional data selection can
be created in SIMD arrangement for better performance.
You can also read back the GE bits by reading APSR into a general purpose register for additional processing. More details of the SIMD and SEL instructions are
given in Chapter 5.

95

96

CHAPTER 4 Architecture

Table 4.8 GE Flags Results
SIMD Operation

Results

SADD16, SSUB16,
USUB16, SASX,
SSAX

If lower half-word result >¼ 0 then GE[1:0] ¼ 2’b11 else GE[1:0]
¼ 2’b00
If upper half-word result >¼ 0 then GE[3:2] ¼ 2’b11 else GE[3:2]
¼ 2’b00
If lower half-word result >¼ 0x10000 then GE[1:0] ¼ 2’b11 else
GE[1:0] ¼ 2’b00
If upper half-word result >¼ 0x10000 then GE[3:2] ¼ 2’b11 else
GE[3:2] ¼ 2’b00
If byte 0 result >¼ 0 then GE[0] ¼ 1’b1 else GE[0] ¼ 1’b0
If byte 1 result >¼ 0 then GE[1] ¼ 1’b1 else GE[1] ¼ 1’b0
If byte 2 result >¼ 0 then GE[2] ¼ 1’b1 else GE[2] ¼ 1’b0
If byte 3 result >¼ 0 then GE[3] ¼ 1’b1 else GE[3] ¼ 1’b0
If byte 0 result >¼ 0x100 then GE[0] ¼ 1’b1 else GE[0] ¼ 1’b0
If byte 1 result >¼ 0x100 then GE[1] ¼ 1’b1 else GE[1] ¼ 1’b0
If byte 2 result >¼ 0x100 then GE[2] ¼ 1’b1 else GE[2] ¼ 1’b0
If byte 3 result >¼0x100 then GE[3] ¼ 1’b1 else GE[3] ¼ 1’b0
If lower half-word result >¼ 0 then GE[1:0] ¼ 2’b11 else GE[1:0]
¼ 2’b00
If upper half-word result >¼ 0x10000 then GE[3:2] ¼ 2’b11 else
GE[3:2] ¼ 2’b00
If lower half-word result >¼ 0x10000 then GE[1:0] ¼ 2’b11 else
GE[1:0] ¼ 2’b00
If upper half-word result >¼ 0x0 then GE[3:2] ¼ 2’b11 else GE
[3:2] ¼ 2’b00

UADD16

SADD8, SSUB8,
USUB8

UADD8

UASX

USAX

R0[31:24]
R0[23:16]
R0[15:8]
R0[7:0]

R1[31:24]
R1[23:16]
R1[15:8]
R1[7:0]

0
1
0
1
0
1
0
1

R2[31:24]

R2[23:16]

R2[15:8]

R2[7:0]

GE[3]
GE[2]
GE[1]
GE[0]

SEL R2, R1, R0 ; Operands are , , 
FIGURE 4.17
SEL operation

4.4 Memory system

4.4 Memory system

4.4.1 Memory system features
The CortexÒ-M3 and Cortex-M4 processors have the following memory system
features:
•

•

•

•

•

•

•

4GB linear address space e With 32-bit addressing, the ARMÒ processors can
access up to 4GB of memory space. While many embedded systems do not need
more than 1MB of memory, the 32-bit addressing capability ensures future
upgrade and expansion possibilities. The Cortex-M3 and Cortex-M4 processors
provide 32-bit buses using a generic bus protocol called AHB LITE. The bus
allows connections to 32/16/8-bit memory devices with suitable memory
interface controllers.
Architecturally defined memory map e The 4GB memory space is divided into a
number of regions for various predefined memory and peripheral uses. This allows the processor design to be optimized for performance. For example, the
Cortex-M3 and Cortex-M4 processors have multiple bus interfaces to allow
simultaneous access from the CODE region for program code and data operations to SRAM or peripheral regions.
Support for little endian and big endian memory systems e The Cortex-M4 and
Cortex-M4 processors can work with either little endian or big endian memory
systems. In practice, a microcontroller product is normally designed with just
one endian configuration.
Bit band accesses (optional) e When the bit-band feature is included (determined by microcontroller/System-on-Chip vendors), two 1MB regions in the
memory map are bit addressable via two bit-band regions. This allows atomic
access to individual bits in SRAM or peripheral address space.
Write buffer e When a write transfer to a bufferable memory region will take
multiple cycles, the transfer can be buffered by the internal write buffer in the
Cortex-M3 or Cortex-M4 processor so that the processor can continue to execute
the next instruction, if possible. This allows higher program execution speed.
Memory Protection Unit (Optional) e The MPU is a programmable unit which
defines access permissions for various memory regions. The MPU in the CortexM3 and Cortex-M4 processor supports eight programmable regions, and can be
used with an embedded OS to provide a robust system.
Unaligned transfer support e All processors supporting ARMv7-M architecture
(including Cortex-M3 and Cortex-M4 processors) support unaligned data
transfers.

The bus interfaces on the Cortex-M processors are generic bus interfaces, and
can be connected to different types and sizes of memory via different memory controllers. The memory systems in microcontrollers often contain two or more types of
memories: flash memory for program code, static RAM (SRAM) for data, and in
some cases Electrical Erasable Read Only Memory (EEPROM). In most cases, these

97

98

CHAPTER 4 Architecture

memories are on-chip and the actual memory interface details are transparent to software developers. Hence, software developers only need to know the address and size
of the program memory and SRAM.

4.4.2 Memory map
The 4GB address space of the CortexÒ-M processors is partitioned into a number of
memory regions (Figure 4.18). The partitioning is based on typical usages so that
different areas are designed to be used primarily for:
•
•
•
•

Program code accesses (e.g., CODE region)
Data accesses (e.g., SRAM region)
Peripherals (e.g., Peripheral region)
Processor’s internal control and debug components (e.g., Private Peripheral
Bus)

The architecture also allows high flexibility to allow memory regions to be used
for other purposes. For example, programs can be executed from the CODE as well
as the SRAM region, and a microcontroller can also integrate SRAM blocks in
CODE region.
In practice, many microcontroller devices only use a small portion of each region
for program flash, SRAM, and peripherals. Some of the regions can be unused.
Different microcontrollers have different memory sizes and peripheral address locations. This information is usually outlined in user manuals or datasheets from microcontroller vendors.

Private peripherals including
built-in interrupt controller
(NVIC) and debug
components

System
0xE0000000

Mainly used for external
peripherals.

External Device

1GB

External RAM

1GB

0xA0000000
0x9FFFFFFF
Mainly used for external
memory.
0x60000000
0x5FFFFFFF
0x40000000
0x3FFFFFFF

Mainly used for data memory
(e.g. static RAM.)
0x20000000
Mainly used for program
0x1FFFFFFF
code. Also used for exception
0x00000000
vector table

FIGURE 4.18
Memory map

0xE000EFFF

Private
Peripheral Bus
(PPB)

System Control
Space (SCS)

0xE0000000

0xE000E000

Private Peripheral Bus

0xDFFFFFFF

Mainly used for peripherals.

0xE00FFFFF
0xFFFFFFFF

Peripherals

0.5GB

SRAM

0.5GB

CODE

0.5GB

4.4 Memory system

The memory map arrangement is consistent between all of the Cortex-M processors. For example, the PPB address space hosts the registers for the Nested Vectored
Interrupt Controller (NVIC), processor’s configuration registers, as well as registers
for debug components. This is the same across all Cortex-M devices. This makes it
easier to port software from one Cortex-M device to another, and allows better
software reusability. It also makes it easier for tool vendors, as the debug control
for the Cortex-M3 and Cortex-M4 devices work in the same way.

4.4.3 Stack memory
As in almost all processor architectures, the CortexÒ-M processors need stack memory to operate and have stack pointers (R13). Stack is a kind of memory usage mechanism that allows a portion of memory to be used as Last-In-First-Out data storage
buffer. ARMÒ processors use the main system memory for stack memory operations, and have the PUSH instruction to store data in stack and the POP instruction
to retrieve data from stack. The current selected stack pointer is automatically
adjusted for each PUSH and POP operation.
Stack can be used for:
•

•
•
•

Temporary storage of original data when a function being executed needs to use
registers (in the register bank) for data processing. The values can be restored at
the end of the function so the program that called the function will not lose its
data.
Passing of information to functions or subroutines.
For storing local variables.
To hold processor status and register values in the case of exceptions such as an
interrupt.

The Cortex-M processors use a stack memory model called “full-descending
stack.” When the processor is started, the SP is set to the end of the memory space
reserved for stack memory. For each PUSH operation, the processor first decrements
the SP, then stores the value in the memory location pointed by SP. During operations, the SP points to the memory location where the last data was pushed to the
stack (Figure 4.19).
In a POP operation, the value of the memory location pointed by SP is read, and
then the value of SP is incremented automatically.
The most common uses for PUSH and POP instructions are to save contents
of register banks when a function/subroutine call is made. At the beginning of
the function call, the contents of some of the registers can be saved to the stack
using the PUSH instruction, and then restored to their original values at the end
of the function using the POP instruction. For example, in Figure 4.20 a simple
function/subroutine named function1 is called from the main program. Since function1 needs to use and modify R4, R5, and R6 for data processing, and these registers hold values that the main program need later, they are saved to the stack
using PUSH, and restored using POP at the end of function1. In this way, the

99

100

CHAPTER 4 Architecture

FIGURE 4.19
Stack PUSH and POP

FIGURE 4.20
Simple PUSH and POP usage in functions e one register in each stack operation

program code that called the function will not lose any data and can continue to
execute. Note that for each PUSH (store to memory) operation, there must be
a corresponding POP (read from memory), and the address of the POP should match
that of the PUSH operation.
Each PUSH and POP instruction can transfer multiple data to/from the stack
memory. This is shown in Figure 4.21. Since the registers in the register bank are
32 bits, each memory transfer generated by stack PUSH and stack POP transfers
at least 1 word (4 bytes) of data, and the addresses are always aligned to 4-byte
boundaries. The lowest two bits of the SP are always zero.

4.4 Memory system

FIGURE 4.21
Simple PUSH and POP usage in functions e Multiple register stack operations

FIGURE 4.22
Combining stack POP and return

You can also combine the return with a POP operation. This is done by first pushing the value of LR (R14) to the stack memory, and popping it back to PC (R15) at
the end of the subroutine/function, as shown in Figure 4.22.
Physically there are two stack pointers in the Cortex-M processors. They are the:
•
•

Main Stack Pointer (MSP) e This is the default stack pointer used after reset, and
is used for all exception handlers.
Process Stack Pointer (PSP) e This is an alternate stack point that can only be
used in Thread mode. It is usually used for application tasks in embedded systems running an embedded OS.

As mentioned previously (Table 4.3 and Figure 4.10), the selection between MSP
and PSP can be controlled by the value of SPSEL in bit 1 of the CONTROL register.
If this bit is 0, Thread mode uses MSP for the stack operation. Otherwise, Thread

101

102

CHAPTER 4 Architecture

Interrupt
Exit
Interrupt
event
Main
program

Interrupt Service
Routine (ISR)

Stacking

Thread Mode
(Use MSP)

Unstacking
Time
Thread Mode
(Use MSP)

Handler Mode
(Use MSP)

FIGURE 4.23
SPSEL ¼ 0. Both Thread Level and Handler use the Main Stack Pointer

mode uses the PSP. In addition, during exception return from Handler mode to
Thread mode, the selection can be controlled by the value of EXC_RETURN
(exception return) value. In that case the value of SPSEL will be updated by the processor hardware accordingly.
In simple applications without an OS, both Thread mode and Handler mode can
use MSP only. This is shown in Figure 4.23: After an interrupt event is triggered, the
processor first pushes a number of registers into the stack before entering the Interrupt Service Routine (ISR). This register state saving operation is called “Stacking,”
and at the end of the ISR, these registers are restored to the register bank and this
operation is called “Unstacking.”
When embedded systems use an embedded OS, they often use separate memory
areas for application stack and the kernel stack. As a result, the PSP is used and
switching of SP selection takes place in exception entry and exception exit. This
is shown in Figure 4.24. Note that the automatic “Stacking” and “Unstacking” stages
use PSP. The separating stack arrangement can prevent a stack corruption or error in

Interrupt
Exit
Interrupt
event
Main
program

Thread Mode
(Use PSP)

Interrupt Service
Routine (ISR)

Stacking

Unstacking

Handler Mode
(Use MSP)

Time
Thread Mode
(Use PSP)

FIGURE 4.24
SPSEL ¼ 1. Thread Level uses the Process Stack and Handler uses the Main Stack

4.4 Memory system

an application task from damaging the stack use by the OS. It also simplifies the OS
design and hence allows faster context switching.
Although only one of the SPs is visible at a time (when using SP or R13 to access
it), it is possible to read/write directly to the MSP and PSP, without any confusion
over which SP/R13 you are referring to. Provided that you are in privileged level,
you can access MSP and PSP using the following CMSIS functions:
x = __get_MSP(); // Read the value of MSP
__set_MSP(x); // Set the value of MSP
x = __get_PSP(); // Read the value of PSP
__set_PSP(x); // Set the value of PSP

In general it is not recommended to change the value of the current selected SP in
a C function, as part of the stack memory could be used for storing local variables or
other data. To access MSP and PSP in assembly code, you can use the MSR and
MRS instructions:
MRS R0, MSP ; Read Main Stack Pointer to R0
MSR MSP, R0 ; Write R0 to Main Stack Pointer
MRS R0, PSP ; Read Process Stack Pointer to R0
MSR PSP, R0 ; Write R0 to Process Stack Pointer

Most application code does not need to access MSP and PSP explicitly. Access to
MSP and PSP is often required for embedded OSs. For example, by reading the PSP
value using an MRS instruction, the OS can read data pushed to the stack from API
calls in application tasks (such as register contents before execution of an SVC
instruction). Also, the value of PSP is updated by context switching code in the
OS during context switching.
After power up, the processor hardware automatically initializes the MSP by
reading the vector table. More information on vector tables will be covered in section 4.5.3. The PSP is not initialized automatically and must be initialized by the
software before being used.

4.4.4 Memory protection unit (MPU)
The MPU is optional in the Cortex-M3 and CortexÒ-M4 processors. Therefore not
all Cortex-M3 or Cortex-M4 microcontrollers have the MPU feature. In the majority
of applications, the MPU is not used and can be ignored. In embedded systems that
require high reliability, the MPU can be used to protect memory regions by means of
defining access permissions in privileged and unprivileged access states.
The MPU is programmable, and the MPU design in the Cortex-M3 and Cortex-M4
processors supports eight programmable regions. The MPU can be used in different
ways. In some cases the MPU is controlled by an embedded OS, and memory permissions are configured for each task. In other cases the MPU is configured just to protect
a certain memory region; for example, to make a memory range read only.
More information about the MPU is covered in Chapter 11.

103

104

CHAPTER 4 Architecture

4.5 Exceptions and interrupts
4.5.1 What are exceptions?

Exceptions are events that cause changes to program flow. When one happens, the processor suspends the current executing task and executes a part of the program called
the exception handler. After the execution of the exception handler is completed, the
processor then resumes normal program execution. In the ARMÒ architecture, interrupts are one type of exception. Interrupts are usually generated from peripheral or
external inputs, and in some cases they can be triggered by software. The exception
handlers for interrupts are also referred to as Interrupt Service Routines (ISR).
In CortexÒ-M processors, there are a number of exception sources:
Exceptions are processed by the NVIC. The NVIC can handle a number of Interrupt Requests (IRQs) and a Non-Maskable Interrupt (NMI) request. Usually IRQs are
generated by on-chip peripherals or from external interrupt inputs though I/O ports.
The NMI could be used by a watchdog timer or brownout detector (a voltage monitoring unit that warns the processor when the supply voltage drops below a certain
level). Inside the processor there is also a timer called SysTick, which can generate
a periodic timer interrupt request, which can be used by embedded OSs for timekeeping, or for simple timing control in applications that don’t require an OS.
The processor itself is also a source of exception events. These could be fault
events that indicate system error conditions, or exceptions generated by software
to support embedded OS operations. The exception types are listed in Table 4.9.
Each exception source has an exception number. Exception numbers 1 to 15 are
classified as system exceptions, and exceptions 16 and above are for interrupts. The
design of the NVIC in the Cortex-M3 and Cortex-M4 processors can support up to
240 interrupt inputs. However, in practice the number of interrupt inputs implemented in the design is far less, typically in the range of 16 to 100. In this way
the silicon size of the design can be reduced, which also reduces power
consumption.
The exception number is reflected in various registers, including the IPSR, and it
is used to determine the exception vector addresses. Exception vectors are stored in a
vector table, and the processor reads this table to determine the starting address of an

FIGURE 4.25
Various exception sources

Table 4.9 Exception Types
CMSIS
Interrupt
Number

Exception
Type

Priority

Function

1
2
3

d
14
13

Reset
NMI
HardFault

3 (Highest)
2
1

4

12

MemManage

Settable

5

11

BusFault

Settable

6

10

Usage fault

Settable

7–10
11
12

d
5
4

d
Settable
Settable

13
14
15
16–255

d
2
1
0–239

d
SVC
Debug
monitor
d
PendSV
SYSTICK
IRQ

Reset
Non-Maskable interrupt
All classes of fault, when the corresponding fault
handler cannot be activated because it is currently
disabled or masked by exception masking
Memory Management fault; caused by MPU violation
or invalid accesses (such as an instruction fetch from a
non-executable region)
Error response received from the bus system; caused
by an instruction prefetch abort or data access error
Usage fault; typical causes are invalid instructions or
invalid state transition attempts (such as trying to
switch to ARM state in the Cortex-M3)
Reserved
Supervisor Call via SVC instruction
Debug monitor – for software based debug (often not
used)
Reserved
Pendable request for System Service
System Tick Timer
IRQ input #0–239

d
Settable
Settable
Settable

4.5 Exceptions and interrupts

Exception
Number

105

106

CHAPTER 4 Architecture

exception handler during the exception entrance sequence. Note that the exception
number definitions are different from interrupt numbers in the CMSIS device-driver
library. In the CMSIS device-driver library, interrupt numbers start from 0, and system exception numbers have negative values.
As opposed to classic ARM processors such as the ARM7TDMIÔ , there is no
FIQ (Fast Interrupt) in the Cortex-M processor. However, the interrupt latency of
the Cortex-M3 and Corex-M4 is very low, only 12 clock cycles, so this does not
cause problems.
Reset is a special kind of exception. When the processor exits from a reset, it
executes the reset handler in Thread mode (rather than Handler mode as in other
exceptions). Also the exception number in IPSR is read as zero.

4.5.2 Nested vectored interrupt controller (NVIC)
The NVIC is a part of the CortexÒ-M processor. It is programmable and its registers
are located in the System Control Space (SCS) of the memory map (see Figure 4.18).
The NVIC handles the exceptions and interrupt configurations, prioritization, and
interrupt masking. The NVIC has the following features:
•
•
•
•

Flexible exception and interrupt management
Nested exception/interrupt support
Vectored exception/interrupt entry
Interrupt masking

Flexible exception and interrupt management
Each interrupt (apart from the NMI) can be enabled or disabled and can have its
pending status set or cleared by software. The NVIC can handle various types of
interrupt sources:
•
•

Pulsed interrupt request e the interrupt request is at least one clock cycle long.
When the NVIC receives a pulse at its interrupt input, the pending status is set
and held until the interrupt gets serviced.
Level triggered interrupt request e the interrupt source holds the request high
until the interrupt is serviced.

The signal level at the NVIC input is active high. However, the actual external
interrupt input on the microcontroller could be designed differently and is converted
to an active high signal level by on-chip logic.

Nested exception/interrupt support
Each exception has a priority level. Some exceptions, such as interrupts, have programmable priority levels and some others (e.g., NMI) have a fixed priority level.
When an exception occurs, the NVIC will compare the priority level of this exception to the current level. If the new exception has a higher priority, the current
running task will be suspended. Some of the registers will be stored on the stack

4.5 Exceptions and interrupts

memory, and the processor will start executing the exception handler of the new
exception. This process is called “preemption.” When the higher priority exception
handler is complete, it is terminated with an exception return operation and the processor automatically restores the registers from stack and resumes the task that was
running previously. This mechanism allows nesting of exception services without
any software overhead.

Vectored exception/interrupt entry
When an exception occurs, the processor will need to locate the starting point of the
corresponding exception handler. Traditionally, in ARMÒ processors such as the
ARM7TDMIÔ , software handles this step. The Cortex-M processors automatically
locate the starting point of the exception handler from a vector table in the memory.
As a result, the delays from the start of the exception to the execution of the exception handlers are reduced.

Interrupt masking
The NVIC in the Cortex-M3 and Cortex-M4 processors provide several interrupt
masking registers such as the PRIMASK special register. Using the PRIMASK register you can disable all exceptions, excluding HardFault and NMI. This masking is
useful for operations that should not be interrupted, like time critical control tasks or
real-time multimedia codecs. Alternatively you can also use the BASEPRI register
to select mask exceptions or interrupts which are below a certain priority level.
The CMSIS-Core provides a set of functions to make it easy to access various
interrupt control functions. The flexibility and capability of the NVIC also make
the Cortex-M processors very easy to use, and provide better a system response
by reducing the software overhead in interrupt processing, which also leads to
smaller code size.

4.5.3 Vector table
When an exception event takes place and is accepted by the processor core, the corresponding exception handler is executed. To determine the starting address of the
exception handler, a vector table mechanism is used. The vector table is an array
of word data inside the system memory, each representing the starting address of
one exception type (Figure 4.26). The vector table is relocatable and the relocation
is controlled by a programmable register in the NVIC called the Vector Table Offset
Register (VTOR). After reset, the VTOR is reset to 0; therefore, the vector table is
located at address 0x0 after reset.
For example, if the reset is exception type 1, the address of the reset vector is 1
times 4 (each word is 4 bytes), which equals 0x00000004, and the NMI vector (type
2) is located at 2 x 4 ¼ 0x00000008. The address 0x00000000 is used to store the
starting value of the MSP.
The LSB of each exception vector indicates whether the exception is to be
executed in the Thumb state. Since the CortexÒ-M processors can support only
Thumb instructions, the LSB of all the exception vectors should be set to 1.

107

108

CHAPTER 4 Architecture

Exception
Type

CMSIS
Interrupt
Number

Address Offset

Vectors

18 - 255
17
16
15
14
NA
12
11
NA
NA
NA
NA
6
4
4
3
2
1
NA

2 - 239
1
0
-1
-2
NA
-4
-5
NA
NA
NA
NA
-10
-11
-12
-13
-14
NA
NA

0x48 – 0x3FF
0x44
0x40
0x3C
0x38
0x34
0x30
0x2C
0x28
0x24
0x20
0x1C
0x18
0x14
0x10
0x0C
0x08
0x04
0x00

IRQ #2 - #239

1

IRQ #1

1

IRQ #0

1

SysTick

1

PendSV

1

Reserved
Debug Monitor

1

SVC

1

Reserved
Reserved
Reserved
Reserved
Usage fault

1

Bus Fault

1

MemManage Fault

1

HardFault

1

NMI

1

Reset

1

Initial value of MPS

FIGURE 4.26
Exception types (LSB of exception vectors should be set to 1 to indicate Thumb state)

4.5.4 Fault handling
Several types of exceptions in the Cortex-M3 and CortexÒ-M4 processors are fault
handling exceptions. Fault exceptions are triggered when the processor detects an
error such as the execution of an undefined instruction, or when the bus system
returns an error response to a memory access. The fault exception mechanism allows
errors to be detected quickly, and potentially allows the software to carry out remedial actions (Figure 4.27).
By default the Bus Fault, Usage Fault, and Memory Management Fault are
disabled and all fault events trigger the HardFault exception. However, the configurations are programmable and you can enable the three programmable fault exceptions individually to handle different types of faults. The HardFault exception is
always enabled.
Fault exceptions can also be useful for debugging software issues. For example,
the fault handler can automatically collect information and report to the user or other
systems that an error has occurred and provide debug information. A number of fault

4.7 Debug

FIGURE 4.27
Fault exceptions usages

status registers are available in the Cortex-M3 and Cortex-M4 processors, which
provide hints about the error sources. Software developers can also examine these
fault status registers using the debugger during software development.

4.6 System control block (SCB)
One part of the processor that is merged into the NVIC unit is the SCB. The SCB
contains various registers for:
•
•
•

Controlling processor configurations (e.g., low power modes)
Providing fault status information (fault status registers)
Vector table relocation (VTOR)

The SCB is memory-mapped. Similar to the NVIC registers, the SCB registers
are accessible from the System Control Space (SCS). More information about
SCB registers is covered in Chapters 7 and 9.

4.7 Debug
As software gets more complex, debug features are becoming more and more
important in modern processor architectures. Although their designs are compact,
the CortexÒ-M3 and Cortex-M4 processors include comprehensive debugging features such as program execution controls, including halting and stepping, instruction breakpoints, data watchpoints, registers and memory accesses, profiling, and
traces.
There are two types of interfaces provided in the Cortex-M processors: debug
and trace.
The debug interface allows a debug adaptor to connect to a Cortex-M microcontroller to control the debug features and access the memory space on the chip. The

109

110

CHAPTER 4 Architecture

Cortex-M processor supports the traditional JTAG protocol, which uses either 4 or 5
pins, or a newer 2-pin protocol called Serial Wire Debug (SWD). The SWD protocol
was developed by ARMÒ, and can handle the same debug features as in JTAG in just
two pins, without any loss of debug performance. Many commercially available
debug adaptors, such as the ULINK 2 or ULINK Pro products from KeilÔ , support
both protocols. The two protocols can use the same connector, with JTAG TCK
shared with the Serial Wire clock, and JTAG TMS shared with the Serial Wire
Data, which is bidirectional (Figure 4.28). Both protocols are widely supported by
different debug adaptors from different companies.
The trace interface is used to collect information from the processor during runtime such as data, event, profiling information, or even complete details of program
execution. Two types of trace interface are supported: a single pin protocol called
Serial Wire Viewer (SWV) and a multi-pin protocol called Trace Port (Figure 4.29).
SWV is a low-cost solution that has a lower trace data bandwidth limit. However,
the bandwidth is still large enough to handle capturing of selective data trace, event
trace, and basic profiling. The output signal, which is called Serial Wire Output
(SWO), can be shared with the JTAG TDO pin so that you only need one standard
JTAG/SWD connector for both debug and trace. (Obviously, the trace data can only
be captured when the two-pin SWD protocol is used for debugging.)
The Trace Port mode requires one clock pin and several data pins. The number of
data pins used is configurable, and in most cases the Cortex-M3 or Cortex-M4
microcontrollers support a maximum of four data pins (a total of five pins including
the clock). The Trace Port mode supports a much higher trace data bandwidth than
SWV. You can also use Trace Port mode with fewer pins if needed; for example,
when some of the Trace Data pins are multiplexed with I/O functions and you
need to use some of these I/O pins for your application.
The high trace data bandwidth of the Trace Port model allows real-time
recording of program execution information, in addition to the other trace information you can collect using SWV. The real-time program trace requires a companion
component called Embedded Trace Macrocell (ETM) in the chip. This is an optional
component for the Cortex-M3 and Cortex-M4 processors. Some of the Cortex-M3
and Cortex-M4 microcontrollers do not have ETM and therefore do not provide program/instruction trace.
To capture trace data, you can use a low-cost debug adaptor such as Keil ULINK-2
or Segger J-Link, which can capture data through the SWV interface. Or you can use
advanced products such as Keil ULINK Pro or Segger J-Trace to capture trace data
in trace port mode.
There are a number of other debug components inside the Cortex-M3 and
Cortex-M4 processors. For example, the Instrumentation Trace Macrocell (ITM)
allows program code running on the microcontroller to generate data to be output
through the trace interface. The data can then be displayed on a debugger window.
More information about various debug features are covered in Chapter 14. Appendix H
also provides information about standard debug connectors used by various debug
adaptors.

JTAG connection

Serial-Wire connection

nTRST

not used

TCK

Serial-Wire clock

TDI

not used

TMS

Serial-Wire data

TDO

not used

USB

Com

USB

Run

KEIL
Microcontroller
Development Kit

ARM
Cortex-M
Flat cable

In-Circuit Debugger

IDC
connector

FIGURE 4.28

4.7 Debug

Debug connection

111

112
CHAPTER 4 Architecture

Trace using Serial Wire Viewer
Not used (nTRST)
Serial-Wire clock (TCK)
Not used (TDI)
Serial-Wire data (TMS)
SWO (TDO)

USB

Run

USB

Com

KEIL
Microcontroller
Development Kit

IDC connector
ARM
Cortex-M
Flat cable

In-Circuit Debugger

TRACECLK
TRACEDATA[0]
TRACEDATA[1]
TRACEDATA[2]
TRACEDATA[3]

FIGURE 4.29
Trace connection (SWO or Trace Port mode)

Trace using
Trace Port mode

4.8 Reset and reset sequence

4.8 Reset and reset sequence

In typical CortexÒ-M microcontrollers, there can be three types of reset:
•
•
•

Power on reset e reset everything in the microcontroller. This includes the
processor and its debug support component and peripherals.
System reset e reset just the processor and peripherals, but not the debug support
component of the processor.
Processor reset e reset the processor only.

During system debug or processor reset operations, the debug components in the
Cortex-M3 or Cortex-M4 processors are not reset so that the connection between the
debug host (e.g., debugger software running on a computer) and the microcontroller
can be maintained. The debug host can generate a system reset or processor reset via
a register in the System Control Block (SCB). This is covered in section 7.9.4.
The duration of Power on reset and System reset depends on the microcontroller
design. In some cases the reset lasts a number of milli seconds as the reset controller
needs to wait for a clock source such as a crystal oscillator to stabilize.
After reset and before the processor starts executing the program, the Cortex-M
processors read the first two words from the memory (Figure 4.30). The beginning of
the memory space contains the vector table, and the first two words in the vector table are the initial value for the Main Stack Pointer (MSP), and the reset vector, which
is the starting address of the reset handler (as described in section 4.5.3 and
Figure 4.26). After these two words are read by the processor, the processor then
sets up the MSP and the Program Counter (PC) with these values.
The setup of the MSP is necessary because some exceptions such as the NMI or
HardFault handler could potentially occur shortly after the reset, and the stack memory and hence the MSP will then be needed to push some of the processor status to
the stack before exception handling.
Note that for most C development environments, the C startup code will also update the value of the MSP before entering the main program main(). This two-step
stack initialization allows a microcontroller device with external memory to use the
external memory for the stack. For example, it can boot up with the stack placed in a

FIGURE 4.30
Reset sequence

113

114

CHAPTER 4 Architecture

small internal on-chip SRAM, and initialize an external memory controller while in
the reset handler, and then execute the C startup code, which then sets up the stack
memory to the external memory.
The Stack Pointer initialization behavior is different from classic ARMÒ processors such as the ARM7TDMIÔ , where upon reset the processor executes instructions
from address zero, and the stack pointers must be initialized by software. In classic
ARM processors, the vector table holds instruction code rather than address values.
Because the stack operations in the Cortex-M3 or Cortex-M4 processors are
based on full descending stack (SP decrement before store), the initial SP value
should be set to the first memory after the top of the stack region. For example, if
you have a stack memory range from 0x20007C00 to 0x20007FFF (1Kbytes), the
initial stack value should be set to 0x20008000, as shown in Figure 4.31.
Notice that in the Cortex-M processors, vector addresses in the vector table
should have their LSB set to 1 to indicate that they are Thumb code. For that reason,
the example in Figure 4.31 has 0x101 in the reset vector, whereas the boot code starts
at address 0x100. After the reset vector is fetched, the Cortex-M processor can then

Other memory
0x20008000
0x20007FFC
0x20007FF8

Initial SP value
0x20008000
1st stacked item
2nd stacked item
Stack
Memory

Stack grows
downwards

0x20007C00

SRAM
0x20000000

Flash

0x00000100

Boot code
Other exception
vectors

0x00000004
0x00000000

Reset
vector

0x00000101
0x20008000

FIGURE 4.31
Initial Stack Pointer value and Initial Program Counter value example

4.8 Reset and reset sequence

start to execute the program from the reset vector address and begin normal
operations.
Various software development tools might have different ways to specify the
starting stack pointer value and reset vector. If you need more information on this
topic, it’s best to look at project examples provided with the development tools.
Some information is provided in section 15.9 (for KeilÔ MDK-ARM) and section
16.9 (for IAR toolchain) of this book.

115

CHAPTER

Instruction Set
CHAPTER OUTLINE
5.1
5.2
5.3
5.4
5.5
5.6

5

Background to the instruction set in ARMÒ CortexÒ-M processors....................... 118
Comparison of the instruction set in ARMÒ CortexÒ-M processors....................... 120
Understanding the assembly language syntax .................................................... 123
Use of a suffix in instructions............................................................................ 128
Unified assembly language (UAL) ...................................................................... 129
Instruction set.................................................................................................. 131
5.6.1 Moving data within the processor ................................................... 132
5.6.2 Memory access instructions........................................................... 134
Immediate offset (pre-index)................................................................ 135
PC-related addressing (Literal)............................................................. 136
Register offset (pre-index) ................................................................... 138
Post-index........................................................................................... 138
Multiple load and multiple store ........................................................... 139
Stack push and pop ............................................................................ 142
SP-relative addressing ......................................................................... 143
Load and store with unprivileged access level ...................................... 144
Exclusive accesses .............................................................................. 145
5.6.3 Arithmetic operations.................................................................... 146
5.6.4 Logic operations ........................................................................... 148
5.6.5 Shift and rotate instructions .......................................................... 148
5.6.6 Data conversion operations (extend and reverse ordering)................. 150
5.6.7 Bit-field processing instructions..................................................... 152
5.6.8 Compare and test ......................................................................... 154
5.6.9 Program flow control ..................................................................... 154
Branches ............................................................................................ 155
Function calls...................................................................................... 155
Conditional branches........................................................................... 156
Compare and branches ....................................................................... 158
Conditional execution (IF-THEN instruction) ......................................... 159
Table branches ................................................................................... 161
5.6.10 Saturation operations.................................................................. 164
5.6.11 Exception-related instructions ..................................................... 165
5.6.12 Sleep mode-related instructions................................................... 168
5.6.13 Memory barrier instructions......................................................... 169
5.6.14 Other instructions....................................................................... 170
5.6.15 Unsupported instructions ............................................................ 172

The Definitive Guide to ARMÒ CortexÒ-M3 and Cortex-M4 Processors. http://dx.doi.org/10.1016/B978-0-12-408082-9.00005-1
Copyright Ó 2014 Elsevier Inc. All rights reserved.

117

118

CHAPTER 5 Instruction Set

5.7 CortexÒ-M4-specific instructions ...................................................................... 173
5.7.1 Overview of enhanced DSP extension in Cortex-M4.......................... 173
5.7.2 SIMD and saturating instructions................................................... 175
5.7.3 Multiply and MAC instructions....................................................... 175
5.7.4 Packing and unpacking ................................................................. 179
5.7.5 Floating point instructions............................................................. 181
5.8 Barrel shifter.................................................................................................... 184
5.9 Accessing special instructions and special registers in programming ................. 189
5.9.1 Overview...................................................................................... 189
5.9.2 Intrinsic functions ........................................................................ 190
CMSIS-core intrinsic functions ............................................................. 190
Compiler-specific intrinsic functions..................................................... 190
5.9.3 Inline assembler and embedded assembler..................................... 190
5.9.4 Using other compiler-specific features............................................ 191
5.9.5 Access to special registers ............................................................ 191

5.1 Background to the instruction set in ARMÒ CortexÒ-M
processors
The design of the instruction set is one of the most important parts of a processor’s
architecture. In ARM’s terminology, it is commonly referred as the Instruction Set
Architecture (ISA). All the ARMÒ CortexÒ-M processors are based on ThumbÒ-2
technology, which allows a mixture of 16-bit and 32-bit instructions to be used within
one operating state. This is different from classic ARM processors such as the
ARM7TDMIÔ . To help understand the differences between the different instruction
sets available in the ARM processors, we include a quick review of the history of the
ARM ISA.
Early ARM processors (prior to the ARM7TDMI processor) supported a 32-bit
instruction set called the ARM instruction set. It evolved for a few years, progressing
from ARM architecture version 1 to version 4. It is a powerful instruction set, which
supports conditional execution of most instructions and provides good performance.
However, it often requires more program memory when compared to 8-bit and 16-bit
architecture. As demand for 32-bit processors started to increase in mobile phone
applications, where power and cost are often both critical, a solution was needed
to reduce the program size.
In 1995, ARM introduced the ARM7TDMI processor, which supports a new operation state that runs a new 16-bit instruction set (Figure 5.1). This 16-bit instruction set
is called “Thumb” (it is a play on words to indicate that it has smaller size than the
ARM instruction set). The ARM7TDMI can operate in the ARM state, the default
state, and also in the Thumb state. During operation, the processor switches between
ARM state and Thumb state under software control. Parts of the application program
are compiled with ARM instructions for higher performance, and the remaining parts
are compiled as Thumb instructions for better code density. By providing this twostate mechanism, the applications can be squeezed into a smaller program size, while

5.1 Background to the instruction set in ARMÒ CortexÒ-M processors

FIGURE 5.1
Evolution of the ARM Instruction Set Architecture

119

120

CHAPTER 5 Instruction Set

maintaining high performance when needed. In some cases, the Thumb code provides
a code size reduction of 30% compared to the equivalent ARM code.
The Thumb instruction set provides a subset of the ARM instruction set. In the
ARM7TDMI processor design, a mapping function is used to translate Thumb instructions into ARM instructions for decoding so that only one instruction decoder
is needed. The two states of operation are still supported in newer ARM processors,
such as the Cortex-A processor family and the Cortex-R processor family.
Although the Thumb instruction set can provide most of the same commonly
used functionality as the ARM instructions, it does have some limitations, such as
restrictions on the register choices for operations, available addressing modes, or
a reduced range of immediate values for data or addresses.
In 2003, ARM announced Thumb-2 technology, a method to combine 16-bit and
32-bit instruction sets in one operation state. In Thumb-2, a new superset of the Thumb
instructions were introduced, with many as 32-bit size, hence they can handle most of
the operations previously only possible in the ARM instruction set. However, they have
different instruction encoding to the ARM instruction set. The first processor supporting
the Thumb-2 technology was the ARM1156T-2 processor.
In 2006, ARM released the Cortex-M3 processor, which utilizes Thumb-2
technology and supports just the Thumb operation state. Unlike earlier ARM
processors, it does not support the ARM instruction set. Since then, more CortexM processors have been introduced, implementing different ranges of the Thumb instruction set for different markets. Since the Cortex-M processors do not support
ARM instructions, they are not backward compatible with classic ARM processors
such as the ARM7TDMI. In other words, you cannot run a binary image for
ARM7TDMI processors on a Cortex-M3 processor. Nevertheless, the Thumb
instruction set in the Cortex-M3 processor (ARMv7-M) is a superset of the Thumb
instructions in ARM7TDMI (ARMv4T), and many ARM instructions can be ported
to equivalent 32-bit Thumb instructions, making application porting fairly easy.
The evolution of the ARM ISA is a continuing process. In 2011, ARM
announced the ARMv8 architecture, which has a new instruction set for 64-bit operations. Currently the support for the ARMv8 architecture is limited to Cortex-A
processors only, and does not cover Cortex-M processors.

5.2 Comparison of the instruction set in ARMÒ CortexÒ-M
processors

One of the differences between the CortexÒ-M processors is the instruction set features.
In order to reduce the circuit size to a minimum, the Cortex-M0, Cortex-M0þ and the
Cortex-M1 processors only support most of the 16-bit Thumb instructions and a few
32-bit Thumb instructions. The Cortex-M3 processor supports more 32-bit instructions,
and a few more 16-bit instructions. The Cortex-M4 processor supports the remaining
DSP enhancing instructions such as SIMD (Single Instruction Multiple Data), MAC
(Multiply Accumulate), and the optional floating point instructions. The instruction
set support in the current Cortex-M processors is illustrated in Figure 5.2.

Instruction set of the Cortex-M processors

121

FIGURE 5.2

5.2 Comparison of the instruction set in ARMÒ CortexÒ-M processors

122

CHAPTER 5 Instruction Set

As you can see in Figure 5.2, the instruction set design of the Cortex-M processors is upward compatible from Cortex-M0, to Cortex-M3, and then to the
Cortex-M4. Therefore code compiled for the Cortex-M0/M0þ/M1 processor can
run on the Cortex-M3 or Cortex-M4 processors, and code compiled for CortexM3 can also run on the Cortex-M4 processor.
Another observation which can be made about Figure 5.2 is that most of the instructions in ARMv6-M are 16-bit, and some are available in both 16-bit and 32-bit
format. When an operation can be carried out in 16-bit, the compiler will normally
choose the 16-bit version to give a smaller code size. The 32-bit version might support a greater choice of registers (e.g., high registers), larger immediate data, longer
address range, or a larger choice of addressing modes. However, for the same operation, the 16-bit version and the 32-bit version of an instruction will take the same
amount of time to execute.
As you can see, there are lots of instructions in the Thumb instruction set, and
different Cortex-M processors support different ranges of these instructions. So
what does this mean for embedded software developers? Figure 5.3 gives a simplified view of what it means to users.
For general data processing and I/O control tasks, the Cortex-M0 and the CortexM0þ processors are entirely adequate. For example, the Cortex-M0þ processor can
deliver 2.15 CoreMark/MHz, which is approximately double that of other 16-bit
microcontrollers at the same operating frequency. If your application needs to

FIGURE 5.3
Simplified view of the instruction sets supported by Cortex-M processors

5.3 Understanding the assembly language syntax

process more complex data, perform faster divide operations, or requires the data
processing to be done faster, then you might need to upgrade to the Cortex-M3 or
Cortex-M4 processor. If you need to have the best performance in DSP applications
or floating point operations, then the Cortex-M4 is a better choice.
Although there are quite a lot of instructions in the Cortex-M processors, there is
no need to learn them all in detail, as C compilers are good enough to generate efficient code. Also, the free CMSIS-DSP library and various middleware (e.g., software libraries) help software developers to implement high-performance DSP
applications without the need to dig into the details of each instruction.
In the rest of this chapter we will briefly go through the instruction set, which can
be useful for helping you to understand a program when debugging your projects.
Appendix A also provides a summary of each of the instructions.

5.3 Understanding the assembly language syntax
In most situations, application code will be written in C or other high-level languages and therefore it is not necessary for most software developers to know the
details of the instruction set. However, it is still useful to have a general overview
of what instructions are available, and of assembly language syntax; for example,
knowledge in this area can be very useful for debugging. Most of the assembly
examples in this book are written in ARMÒ assembler (armasm), which is used in
the KeilÔ Microcontroller Development Kit for ARM (MDK-ARM). Assembly
tools from different vendors (e.g., the GNU toolchain) have different syntaxes. In
most cases, the mnemonics of the assembly instructions are the same, but assembly
directives, definitions, labeling, and comment syntax can be different.
With ARM assembly (applies to ARM RealViewÒ Compilation Toolchain,
DS-5Ô , and Keil Microcontroller Development Kit), the following instruction
formatting is used:
label
mnemonic operand1, operand2, . ; Comments

The “label” is used as a reference to an address location. It is optional; some
instructions might have a label in front of them so that the address of the instruction
can be obtained by using the label. Labels can also be used to reference data addresses. For example, you can put a label for a lookup table inside the program. After
the “label” you can find the “mnemonic,” which is the name of the instruction, followed by a number of operands:
•
•
•

For data processing instructions written for the ARM assembler, the first operand
is the destination of the operation.
For a memory read instruction (except multiple load instructions), the first
operand is the register which data is loaded into.
For a memory write instruction (except multiple store instructions), the first
operand is the register that holds the data to be written to memory.
Instructions that handle multiple loads and stores have a different syntax.

123

124

CHAPTER 5 Instruction Set

The number of operands for each instruction depends on the instruction type.
Some instructions do not need any operand and some might need just one.
Note that some mnemonics can be used with different types of operands, and this
can result in different instruction encodings. For example, the MOV (move) instruction can be used to transfer data between two registers, or it can be used to put an
immediate constant value into a register.
The number of operands in an instruction depends on what type of instruction it
is, and the syntax for the operands can also be different in each case. For example,
immediate data are usually prefixed with “#”:
MOVS R0, #0x12 ; Set R0 = 0x12 (hexadecimal)
MOVS R1, #’A’ ; Set R1 = ASCII character A

The text after each semicolon “;” is a comment. Comments do not affect program
operation, but should make programs easier for humans to understand.
In the GNU toolchain, the common assembly syntax is:
label:
mnemonic operand1, operand2,. /* Comments */

The opcode and operands are the same as the ARM assembler syntax, but the
syntax for labels and comments are different. For the same instructions as above,
the GNU version is:
MOVS R0, #0x12 /* Set R0 = 0x12 (hexadecimal) */
MOVS R1, #’A’

/* Set R1 = ASCII character A */

An alternate way to insert comments in gcc is to make use of the inline comment
character “@.” For example:
MOVS R0, #0x12 @ Set R0 = 0x12 (hexadecimal)
MOVS R1, #’A’

@ Set R1 = ASCII character A

One of the commonly required features in assembly code is the ability to define
constants. By using constant definitions, the program code can be made more readable and this can make code maintenance much easier. In ARM assembly, an
example of defining a constant is:
NVIC_IRQ_SETEN

EQU 0xE 000E100

NVIC_IRQ0_ENABLE EQU 0x1
.
LDR R0,=NVIC_IRQ_SETEN

; Put 0xE000E100 into R0

; LDR here is a pseudo instruction that will be converted
; to a PC relative literal data load by the assembler
MOVS R1, #NVIC_IRQ0_ENABLE ; Put immediate data (0x1) into
; register R1
STR R1, [R0] ; Store 0x1 to 0xE000E100, this enable external
; interrupt IRQ#0

5.3 Understanding the assembly language syntax

In the code above, the address value of an NVIC register is loaded into register
R0 using the pseudo instruction LDR. The assembler will place the constant value
into a location in the program code, and insert a memory read instruction to read
the value into R0. The use of a pseudo instruction is needed because the value is
too large to be encoded in a single move immediate instruction. When using LDR
pseudo instructions to load a value into a register, the value requires an “¼” prefix.
In the normal case of loading an immediate data into a register (e.g., with MOV), the
value should be prefixed by “#.”
Similarly, the same code can be written in GNU toolchain assembler syntax:
.equ NVIC_IRQ_SETEN,

0xE000E100

.equ NVIC_IRQ0_ENABLE, 0x1
.
LDR R0,=NVIC_IRQ_SETEN /* Put 0xE000E100 into R0
LDR here is a pseudo instruction that will be
converted to a PC relative load by the assembler */
MOVS R1, #NVIC_IRQ0_ENABLE /* Put immediate data (0x1) into
register R1 */
STR R1, [R0] /* Store 0x1 to 0xE000E100, this enable
external interrupt IRQ#0 */

Another typical feature of most assembly tools is allowing data to be inserted inside the program. For example, we can define data in a certain location in the program memory and access it with memory read instructions. In ARM assembler, an
example is:
LDR R3,=MY_NUMBER ; Get the memory location of MY_NUMBER
LDR R4, [R3]

; Read the value 0x12345678 into R4

.
LDR R0,=HELLO_TEXT ; Get the starting address of HELLO_TEXT
BL

PrintText

; Call a function called PrintText to
; display string

.
ALIGN 4
MY_NUMBER

DCD 0x12345678

HELLO_TEXT DCB “Hello\n”, 0 ; Null terminated string

In the above example, “DCD” is used to insert a word-sized data item, and
“DCB” is used to insert byte-size data into the program. When inserting wordsize data in program, we should use the “ALIGN” directive before the data.
The number after the ALIGN directive determines the alignment size. In
this case, the value 4 forces the following data to be aligned to a word boundary. By ensuring the data placed at MY_NUMBER is word aligned, the
program will be able to access the data with just a single bus transfer, and
the code can be more portable (unaligned accesses are not supported in the
CortexÒ-M0/M0+/M1 processors).

125

126

CHAPTER 5 Instruction Set

Again, this example can be rewritten in GNU toolchain assembler syntax:
LDR R3,=MY_NUMBER /* Get the memory location of MY_NUMBER */
LDR R4, [R3]

/* Read the value 0x12345678 into R4 */

.
LDR R0,=HELLO_TEXT /* Get the starting address of
HELLO_TEXT */
BL

PrintText

/* Call a function called PrintText to
display string */

.
.align 4
MY_NUMBER:
.word 0x12345678
HELLO_TEXT:
.asciz “Hello\n”

/* Null terminated string */

A number of different directives are available in both ARM assembler and GNU
assembler for inserting data into a program. Table 5.1 gives a few commonly used
examples.

Table 5.1 Commonly Used Directives for Inserting Data Into a Program
Type of Data to
Insert

ARM Assembler
(e.g., Keil MDK-ARM)

Byte

DCB
E.g., DCB 0x12
DCW
E.g., DCW 0x1234
DCD
E.g., DCD 0x01234567
DCQ
E.g., DCQ
0x12345678FF0055AA
DCFS
E.g., DCFS 1E3
DCFD
E.g., DCFD 3.14159
DCB
E.g., DCB “Hello\n” 0,

Half-word
Word
Double-word

Floating point
(single precision)
Floating point
(double precision)
String

Instruction

DCI
E.g., DCI 0xBE00 ;
Breakpoint (BKPT 0)

GNU Assembler
.byte
E.g., .byte 0x012
.hword / .2byte
E.g., .hword 0x01234
.word / .4byte
E.g., .word 0x01234567
.quad/.octa
E.g., .quad
0x12345678FF0055AA
.float
E.g., .float 1E3
.double
E.g., .double 3f14159
.ascii / .asciz (with NULL
termination)
E.g., .ascii “Hello\n”
.byte 0 /* add NULL character */
E.g., .asciz “Hello\n”
.word / .hword
E.g., .hword 0xBE00
/* Breakpoint (BKPT 0) */

5.3 Understanding the assembly language syntax

In most cases, you can also add a label before the directive so that the addresses
of the data can be determined using the label.
There are a number of other useful directives that are often used in assembly language programming. For example, some of the ARM assembler directives given in
Table 5.2 are commonly used and some are used in the examples in this book.
Additional information about directives in ARM assembler can be found in the
“ARM Compiler Toolchain Assembler Reference,” (reference 6, section 6.3, Data,
Data Definition Directives1).

Table 5.2 Commonly Used Directives
Directive
(GNU assembler equivalent)
THUMB
(.thumb)
CODE16
(.code 16)
AREA {,}
{,attr}.
(.section )
SPACE 
(.zero )
FILL {, 
{, }}
(.fill {, 
{, }})
ALIGN {{,{,
{,}}}}
(.align {,{,
(.global )
IMPORT 
LTORG
(.pool)

1

ARM Assembler
Specify assembly code as Thumb instruction in
Unified Assembly Language (UAL) format.
Specify assembly code as Thumb instruction in
legacy pre-UAL syntax.
Instructs the assembler to assemble a new
code or data section. Sections are
independent, named, indivisible chunks of
code or data that are manipulated by the linker.
Reserves a block of memory and fills it with
zeros.
Reserves a block of memory and fills it with the
specified value. The size of the value can be
byte, half-word, or word, specified by
value_sizes (1/2/4).
Aligns the current location to a specified
boundary by padding with zeros or NOP
instructions. E.g.,
ALIGN 8 ; make sure the next instruction or
; data is aligned to 8 byte boundary
Declare a symbol that can be used by the linker
to resolve symbol references in separate object
or library files.
Declare a symbol reference in separate object
or library files that is to be resolved by linker.
Instructs the assembler to assemble the
current literal pool immediately. Literal pool
contains data such as constant values for LDR
pseudo instruction.

http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Cacgadfj.html

127

128

CHAPTER 5 Instruction Set

5.4 Use of a suffix in instructions

In assembler for ARMÒ processors, some instructions can be followed by suffixes.
For CortexÒ-M processors, the available suffixes are shown in Table 5.3.
For the Cortex-M3/M4 processors, a data processing instruction can optionally
update the APSR (flags). If using the Unified Assembly Language (UAL) syntax,
we can specify if the APSR update should be carried out or not. For example,
when moving a data from one register to another, it is possible to use
MOVS R0, R1

; Move R1 into R0 and update APSR

Or
MOV

R0, R1 ; Move R1 into R0, and not update APSR

The second type of suffix is for conditional execution of instructions. The
Cortex-M3 and Cortex-M4 processors support conditional branches, as well as
conditional execution of instructions by putting the conditional instructions in an
IF-THEN (IT) instruction block. By updating the APSR using data operations, or instructions like test (TST) or compare (CMP), the program flow can be controlled
based on conditions of operation results.
Table 5.3 Suffixes for Cortex-M Assembly Language
Suffixes

Descriptions

S

Update APSR (Application Program Status Register,
such as Carry, Overflow, Zero and Negative flags); for
example:
ADDS R0, R1 ; this ADD operation will update APSR
Conditional execution. EQ ¼ Equal, NE ¼ Not Equal,
LT ¼ Less Than, GT ¼ Greater Than, etc. On the CortexM processors these conditions can be applied to
conditional branches; for example:

EQ, NE, CS, CC, MI, PL,
VS, VC, HI, LS, GE, LT,
GT, LE

BEQ label ; Branch to label if previous operation
result in
; equal status
or conditionally executed instructions (see IF-THEN
instruction in section 5.6.9); for example:
ADDEQ R0, R1, R2 ; Carry out the add operation if
the previous
; operation results in equal status

.N, .W
.32, .F32
.64, F64

Specify the use of 16-bit (narrow) instruction or 32-bit
(wide) instruction.
Specify the operation is for 32-bit single-precision data. In
most toolchains, the .32 suffix is optional.
Specify the operation is for 64-bit double-precision data.
In most toolchains, the .64 suffix is optional.

5.5 Unified assembly language (UAL)

5.5 Unified assembly language (UAL)

Several years ago, before ThumbÒ-2 technology was developed, the features available in the Thumb instruction set were limited, and the Thumb instruction syntax
was more relaxed. For example, in ARM7TDMIÔ , almost all data processing instructions in Thumb mode will update the APSR anyway, so the “S” suffix is not
strictly required for the Thumb instruction, and omitting it would still result in an
instruction that updates the APSR.
When Thumb-2 technology arrived, almost all Thumb instructions were available in a version that updates APSR and a version that does not. As a result, traditional Thumb syntax can be problematic in Thumb-2 software development.
In order to allow better portability between architectures, and to use a single Assembly language syntax in ARMÒ processors with various architectures, recent
ARM development tools have been updated to support the Unified Assembler Language (UAL). For users who have been using ARM7TDMI in the past, the most
noticeable differences are:
•
•

Some data operation instructions use three operands even when the destination
register is the same as one of the source registers. In the past (pre-UAL), the
syntax might only use two operands for these instructions.
The “S” suffix becomes more explicit. In the past, when an assembly program
file was assembled into Thumb code, most data operations are encoded as instructions that update the APSR. As a result, the “S” suffix was not essential.
With the UAL syntax, instructions that update the APSR should have the “S”
suffix to clearly indicate the expected operation. This prevents program code
failing when being ported from one architecture to another.
For example, a pre-UAL ADD instruction for 16-bit Thumb code is
ADD R0, R1 ; R0 = R0 + R1, update APSR

In UAL syntax, this should be written as follows, being more specific about register usage and APSR update operations:
ADDS R0, R0, R1 ; R0 = R0 + R1, update APSR

However, in most cases (depending on the toolchain being used), you can still
write the instruction with a pre-UAL style (only two operands), but the use of “S”
suffix will be more explicit:
ADDS R0, R1 ; R0 = R0 + R1, update APSR

The pre-UAL syntax is currently still accepted by most development tools,
including the KeilÔ Microcontroller Development Kit for ARM (MDK-ARM) and
the ARM Compiler toolchain. However, using UAL is recommended in new projects.
For assembly development with Keil MDK, you can specify the use of UAL syntax
with the “THUMB” directive, and pre-UAL syntax with the “CODE16” directive.

129

130

CHAPTER 5 Instruction Set

The choice of assembler syntax depends on which tool you use. Please refer to the
documentation of your development suite to determine which syntax is suitable.
One thing you need to be careful about when reusing code with traditional Thumb
is that some instructions change the flags in APSR, even if the S suffix is not used.
However, if you copy and paste the same instruction to a project using UAL syntax,
the instruction becomes one that does not change the flags in APSR. For example:
CODE16
.
AND R0, R1 ; R0=R0 AND R1, update APSR (Traditional Thumb syntax)

If this line of code is used in a project using UAL, the result will become R0¼R0
AND R1 with no APSR update.
With the new instructions in Thumb-2 technology, some of the operations can be
handled by either a Thumb instruction or a Thumb-2 instruction. For example, R0
¼ R0 þ 1 can be implemented as a 16-bit Thumb instruction or a 32-bit Thumb-2
instruction. With UAL, you can specify which instruction you want by adding suffixes:
ADDS R0, #1

; Use 16-bit Thumb instruction by default
; for smaller size

ADDS.N R0, #1 ; Use 16-bit Thumb instruction (N=Narrow)
ADDS.W R0, #1 ; Use 32-bit Thumb-2 instruction (W=wide)

The .W (wide) suffix specifies a 32-bit instruction. If no suffix is given, the
assembler tool can choose either instruction but usually defaults to the smaller
option to get the best code density. Depending on tool support, you may also use
the .N (narrow) suffix to specify a 16-bit Thumb instruction.
Again, this syntax is for ARM assembler tools. Other assemblers might have
slightly different syntax. If no suffix is given, the assembler might choose the instruction for you which gives the minimum code size.
In most cases, applications will be coded in C, and the C compilers will use
16-bit instructions if possible due to their smaller code size. However, when the immediate data exceeds a certain range, or when the operation can be better handled
with a 32-bit Thumb-2 instruction, the 32-bit instruction will be used. When the
compilation is optimized for speed, the C compiler might also use 32-bit instructions
to adjust the branch target addresses to 32-bit aligned for better performance.
32-bit Thumb-2 instructions can be half-word aligned. For example, you can
have a 32-bit instruction located in a half-word location (unaligned) (Figure 5.4):
0x1000 : LDR r0,[r1] ;a 16-bit instructions (occupy 0x1000-0x1001)
0x1002 : RBIT.W r0

;a 32-bit Thumb-2 instruction (occupy
; 0x1002-0x1005)

Most 16-bit instructions can only access registers R0 to R7; 32-bit Thumb-2 instructions do not have this limitation. However, use of PC (R15) might not be
allowed in some of the instructions. Refer to the ARM v7-M Architecture Reference
Manual (reference 1) or CortexÒ-M3/M4 Devices Generic User Guides (section
3.3.2, references 2 and 3) if you need to find out more detail in this area.

5.6 Instruction set

FIGURE 5.4
An unaligned 32-bit instruction

5.6 Instruction set

The instructions in the CortexÒ-M3 and Cortex-M4 processors can be divided into
various groups based on functionality:
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Moving data within the processor
Memory accesses
Arithmetic operations
Logic operations
Shift and Rotate operations
Conversion (extend and reverse ordering) operations
Bit field processing instructions
Program flow control (branch, conditional branch, conditional execution, and
function calls)
Multiply accumulate (MAC) instructions
Divide instructions
Memory barrier instructions
Exception-related instructions
Sleep mode-related instructions
Other functions
In addition, the Cortex-M4 processor supports the Enhanced DSP instructions:

•
•
•
•

SIMD operations and packing instructions
Adding fast multiply and MAC instructions
Saturation algorithms
Floating point instructions (if the floating point unit is present)

Details of each instruction are covered in the Cortex-M3/Cortex-M4 Devices
Generic User Guides (reference 2 & 3, available on the ARMÒ website). In the

131

132

CHAPTER 5 Instruction Set

rest of this section we will look into some of the basic concepts of assembly language programming.
To make it easier for beginners, in this part we will skip the conditional suffix for
now. Most of the instructions can be executed conditionally when used together with
the IF-THEN (IT) instruction, which will require the suffix to indicate the condition.

5.6.1 Moving data within the processor
The most basic operation in a microprocessor is to move data around inside the processor. For example, you might want to:
•
•
•

Move data from one register to another
Move data between a register and a special register (e.g., CONTROL,
PRIMASK, FAULTMASK, BASEPRI)
Move an immediate constant into a register
For the CortexÒ-M4 processor with the floating point unit, you can also:

•
•
•
•

Move data between a register in the core register bank and a register in the
floating point unit register bank
Move data between registers in the floating point register bank
Move data between a floating point system register (such as the FPSCR e
Floating point Status and Control Register) and a core register
Move immediate data into a floating point register

Table 5.4 shows some examples of these operations.
The instructions in Table 5.5 are available for Cortex-M4 with floating point
unit only.
Table 5.4 Instructions for Transferring Data within the Processor
Instruction

Dest

Source

Operations

MOV
MOVS

R4,
R4,

R0
R0

MRS

R7,

PRIMASK

MSR

CONTROL,

R2

MOV
MOVS

R3,
R3,

#0x34
#0x34

MOVW
MOVT

R6,
R6,

#0x1234
#0x8765

MVN

R3,

R7

; Copy value from R0 to R4
; Copy value from R0 to R4 with
APSR (flags) update
; Copy value of PRIMASK (special
register) to R7
; Copy value of R2 into CONTROL
(special register)
; Set R3 value to 0x34
; Set R3 value to 0x34 with APSR
update
; Set R6 to a 16-bit constant 0x1234
; Set the upper 16-bit of R6 to
0x8765
; Move negative value of R7 into R3

5.6 Instruction set

Table 5.5 Instructions for Transferring Data between the Floating Point Unit and Core
Registers
Instruction

Dest

Source

Operations

VMOV

R0,

S0

VMOV

S0,

R0

VMOV

S0,

S1

VMRS.F32

R0,

FPSCR

VMRS

APSR_nzcv,

FPSCR

VMSR

FPSCR,

R3

VMOV.F32

S0,

#1.0

; Copy floating point register S0
to general purpose register R0
; Copy general purpose register
R0 to floating point register S0
; Copy floating point register S1
to S0 (single precision)
; Copy value in FPSCR, a
floating point unit system
register to R0
; Copy flags from FPSCR to the
flags in APSR
; Copy R3 to FPSCR, a floating
point unit system register
; Move single-precision value
into floating point register S0

The MOVS instruction is similar to the MOV instruction, apart from the fact that
it updates the flags in the APSR, hence the “S” suffix is used. For setting a register in
the general purpose register bank to an 8-bit immediate value, the MOVS instruction
is sufficient and can be carried out with a 16-bit Thumb instruction if the destination
register is a low register (R0 to R7). For moving an immediate value into a high
register, or if the APSR must not be updated, the 32-bit version of the MOV/
MOVS instructions would be used.
To set a register to a larger immediate value (between 9-bit and 16-bit), the
MOVW instruction can be used. Depending on the assembler tool you are using,
it might automatically convert a MOV or MOVS instruction into MOVW if the
immediate data is between 9-bit and 16-bit.
If you need to set a register to a 32-bit immediate data value, there are several
ways of doing this.
The most common method is to use a pseudo instruction called “LDR”; for
example:
LDR R0, =0x12345678 ; Set R0 to 0x12345678

This is not a real instruction. The assembler converts this instruction into a memory transfer instruction and a literal data item stored in the program image:
LDR R0, [PC, #offset]
..
DCD 0x12345678

The LDR instruction reads the memory at [PCþoffset] and stores the value into
R0. Note that due to the pipeline nature of the processor, the value of PC is not

133

134

CHAPTER 5 Instruction Set

exactly the address of the LDR instruction. However, the assembler will calculate
the offset for you so you don’t have to worry about it.

LITERAL POOL
Usually the assembler groups various literal data (e.g., DCD 0x12345678 in the above example)
together into data blocks called literal pools. Since the value of the offset in the LDR instruction is
limited, a program will often need a number of literal pools so that the LDR instruction can access
the literal data. Therefore we need to insert assembler directives like LTORG (or .pool) to tell the
assembler where it can insert literal pools. Otherwise the assembler will try to put all the literal data
after the end of the program code, which might be too far away for the LDR instruction to access it.

If the operation needs to set the register to an address in the program code within
a certain address range, you can use the ADR pseudo instruction, which will be converted into a single instruction, or ADRL pseudo instruction, which can provide a
wider address range but needs two instructions to implement. For example:
ADR R0, DataTable
.
ALIGN
DataTable
DCD 0, 245, 132, .

The ADR instruction will be converted into an “add” or “subtract” operation
based on the program counter value.
Another way to generate a 32-bit immediate data value is to use a combination of
MOVW and MOVT instructions. For example:
MOVW R0, #0x789A ; Set R0 to 0x0000789A
MOVT R0, #0x3456 ; Set upper 16-bit of R0 to 0x3456,
; now R0 = 0x3456789A

When comparing this method to using the LDR pseudo instruction, the LDR
method gives better readability, and the assembler might be able to reduce code
size by reusing the same literal data, if the same constant value is used in several
places of the assembly code. However, depending on the memory system design,
in some cases the MOVW þ MOWT method can result in faster code if a
system-level cache is used and if the LDR resulted in a data cache miss.

5.6.2 Memory access instructions
There are large numbers of memory access instructions in the CortexÒ-M3 and
Cortex-M4 processors. This is due to the combination of support of various addressing modes, as well as data size and data transfer direction. For normal data transfers,
the instructions available are given in Table 5.6

5.6 Instruction set

Table 5.6 Memory Access Instructions for Various Data Sizes
Data Type

Load (Read from
Memory)

Store (Write to
Memory)

8-bit unsigned
8-bit signed
16-bit unsigned
16-bit signed
32-bit
Multiple 32-bit
Double-word (64-bit)
Stack operations (32-bit)

LDRB
LDRSB
LDRH
LDRSH
LDR
LDM
LDRD
POP

STRB
STRB
STRH
STRH
STR
STM
STRD
PUSH

Note: The LDRSB and the LDRSH automatically perform a sign extend operation on the loaded data to convert it to a signed 32-bit value. For example, if 0x83 is
read in a LDRB instruction, the value is converted into 0xFFFFFF83 before being
placed in the destination register.
If the floating point unit is present, the instructions in Table 5.7 are also available
to transfer data between the register bank in the floating point unit and memory.
There are also a number of addressing modes available. In some of these modes,
you can also optionally update the register holding the address (write back).

Immediate offset (pre-index)
The memory address of the data transfer is the sum of a register value and an immediate constant value (offset). Sometimes this is referred to as “pre-index” addressing.
For example:
LDRB R0, [R1, #0x3] ; Read a byte value from address R1+0x3, and
store the read data in R0.

The offset value can be positive or negative. Table 5.8 shows a list of commonly
used load and store instructions.

Table 5.7 Memory Access Instructions for the Floating Point Unit
Data Type

Read from Memory
(Load)

Write to
Memory (Store)

Single-precision data (32-bit)
Double-precision data (64-bit)
Multiple data
Stack operations

VLDR.32
VLDR.64
VLDM
VPOP

VSTR.32
VSTR.64
VSTM
VPUSH

135

136

CHAPTER 5 Instruction Set

Table 5.8 Memory Access Instructions with Immediate Offset
Example of Pre-index Accesses
Note: the #offset field is optional

Description

LDRB Rd, [Rn, #offset]

Read byte from memory location Rn þ offset

LDRSB Rd, [Rn, #offset]

Read and signed extend byte from memory
location Rn þ offset
Read half-word from memory location
Rn þ offset
Read and signed extended half-word from
memory location Rn þ offset
Read word from memory location Rn þ offset

LDRH Rd, [Rn, #offset]
LDRSH Rd, [Rn, #offset]
LDR

Rd, [Rn, #offset]

STRB Rd, [Rn, #offset]

Read double-word from memory location
Rn þ offset
Store byte to memory location Rn þ offset

STRH Rd, [Rn, #offset]

Store half-word to memory location Rn þ offset

STR

Rd, [Rn, #offset]

Store word to memory location Rn þ offset

STRD

Rd1,Rd2, [Rn, #offset]

Store double-word to memory location
Rn þ offset

LDRD Rd1,Rd2, [Rn, #offset]

This addressing mode supports write back of the register holding the address. For
example:
LDR R0, [R1, #0x8]! ; After the access to memory[R1+0x8], R1 is
updated to R1+0x8

The exclamation mark (!) in the instruction specifies whether the register holding
the address should be updated (write back) when the instruction is completed. The
address used for the data transfer uses the sum of R1þ0x8 calculated regardless of
whether the exclamation mark (!) is stated. The write back operation can be used
with a number of load and store instructions as shown in Table 5.9.
Please note that some of these instructions cannot be used with R15(PC) or
R14(SP). In addition, the 16-bit versions of these instructions only support low registers (R0-R7) and do not provide write back.
If the floating point unit is present, the instructions in Table 5.10 are also available to perform LDM and STM operations to the registers in the floating point unit.
Note that many floating point instructions use the .32 and .64 suffixes to specify
the floating data type. In most toolchains, the .32 and .64 suffixes are optional.

PC-related addressing (Literal)
A memory access can generate the address value from the current PC value and an
offset value (Table 5.11). This is commonly needed for loading immediate values
into a register, also known as literal pool accesses, as mentioned earlier in this chapter (LDR pseudo instruction).
If the floating point unit is present, the instructions in Table 5.12 are also available.

5.6 Instruction set

Table 5.9 Memory Access Instructions with Immediate Offset and Write Back
Example of Pre-index with Write Back
Note: the #offset field is optional

Description

LDRB Rd, [Rn, #offset]!

Read byte with write back

LDRSB Rd, [Rn, #offset]!

Read and signed extend byte with write
back
Read half-word with write back

LDRH Rd, [Rn, #offset]!

LDR Rd, [Rn, #offset]!

Read and signed extended half-word with
write back
Read word with write back

LDRD Rd1,Rd2, [Rn, #offset]!

Read double-word with write back

STRB Rd, [Rn, #offset]!

Store byte to memory with write back

STRH Rd, [Rn, #offset]!

Store half-word to memory with write back

STR Rd, [Rn, #offset]!

Store word to memory with write back

STRD Rd1,Rd2, [Rn, #offset]!

Store double-word to memory with write
back

LDRSH Rd, [Rn, #offset]!

Table 5.10 Memory Access Instructions for Floating Point Unit
Examples
Note: the #offset field is optional
VLDR.32 Sd, [Rn, #offset]
VLDR.64 Dd, [Rn, #offset]
VSTR.32 Sd, [Rn, #offset]
VSTR.64 Dd, [Rn, #offset]

Description
Read single-precision data from memory to singleprecision register Sd
Read double-precision data from memory to
double-precision register Dd
Write single-precision data from single-precision
register Sd to memory
Write double-precision data from double precision
register Dd to memory

Table 5.11 Memory Access Instructions with PC Related Addressing
Example of Literal Read

Description

LDRB Rt,[PC, #offset]

Load unsigned byte into Rt using PC offset

LDRSB Rt,[PC, #offset]

Load and signed extend a byte data into Rt using PC
offset
Load unsigned half-word into Rt using PC offset

LDRH Rt,[PC, #offset]

LDR Rt, [PC, #offset]

Load and signed extend a half-word data into Rt
using PC offset
Load a word data into Rt using PC offset

LDRD Rt,Rt2,[PC, #offset]

Load a double-word into Rt and Rt2 using PC offset

LDRSH Rt,[PC, #offset]

137

138

CHAPTER 5 Instruction Set

Table 5.12 Floating Point Unit Memory Access Instructions with PC-related
Addressing
Example of Literal Read

Description

VLDR.32 Sd,[PC, #offset]

Load single-precision data into single-precision
register Sd using PC offset
Load double-precision data into doubleprecision register Dd using PC offset

VLDR.64 Dd,[PC, #offset]

Register offset (pre-index)
Another useful address mode is the register offset. This is often used in the processing of data arrays where the address is a combination of a base address and an offset
calculated from an index value. To make this address calculation even more efficient, the index value can be shifted by a distance of 0 to 3 bits before being added
to the base register. For example:
LDR R3, [R0, R2, LSL #2] ; Read memory[R0+(R2 << 2)] into R3

The shift operation is optional. You can have a simple operation like
STR R5, [R0,R7] ; Write R5 into memory[R0+R7]

Similarly to immediate offset, there are various forms for different data size, as
shown in Table 5.13.

Post-index
Memory access instructions with post-index addressing mode also have an immediate offset value. However, the offset is not used during the memory access, but is
Table 5.13 Memory Access Instructions with Register Offset
Example of Register Offset
Accesses

Description

LDRB Rd, [Rn, Rm{, LSL #n}]

Read byte from memory location Rn þ (Rm << n)

LDRSB Rd, [Rn, Rm{, LSL #n}]

LDR Rd, [Rn, Rm{, LSL #n}]

Read and signed extend byte from memory location
Rn þ (Rm << n)
Read half-word from memory location Rn þ
(Rm << n)
Read and signed extended half-word from memory
location Rn þ (Rm << n)
Read word from memory location Rn þ (Rm << n)

STRB Rd, [Rn, Rm{, LSL #n}]

Store byte to memory location Rn þ (Rm << n)

STRH Rd, [Rn, Rm{, LSL #n}]

Store half-word to memory location Rn þ (Rm << n)

STR Rd, [Rn, Rm{, LSL #n}]

Store word to memory location Rn þ (Rm << n)

LDRH Rd, [Rn, Rm{, LSL #n}]
LDRSH Rd, [Rn, Rm{, LSL #n}]

5.6 Instruction set

used to update the address register after the data transfer is completed. For
example:
LDR R0, [R1], #offset ; Read memory[R1], then R1 updated to R1+offset

When the post-index memory addressing mode is used, there is no need to use
the exclamation mark (!) sign because the base address register is always updated
if the data transfer is completed successfully. Table 5.14 lists various form of post
indexing memory access instructions.
The post-index address mode can be very useful for processing data in an array.
As soon as an element in the array is accessed, the address register can be adjusted to
the next element automatically to save code size and execution time.
Please note that post-index instructions cannot be used with R15(PC) o R14(SP).
The post-index memory access instructions are 32-bit. The offset value can be positive or negative.

Multiple load and multiple store
One of the key advantages of the ARM architecture is that it allows you to read or
write multiple data that are contiguous in memory. The LDM (Load Multiple
Table 5.14 Memory Access Instructions with Post-Indexing
Example of Post Index
Accesses
LDRB Rd,[Rn], #offset
LDRSB Rd,[Rn], #offset
LDRH Rd,[Rn], #offset
LDRSH Rd,[Rn], #offset
LDR

Rd,[Rn], #offset

LDRD Rd1,Rd2,[Rn], #offset
STRB Rd,[Rn], #offset
STRH Rd,[Rn], #offset
STR

Rd,[Rn], #offset

STRD Rd1,Rd2,[Rn], #offset

Description
Read byte from memory[Rn] to Rd, then update Rn
to Rnþoffset
Read and signed extended byte from memory[Rn] to
Rd, then update Rn to Rnþoffset
Read half-word from memory[Rn] to Rd, then
update Rn to Rnþoffset
Read and signed extended half-word from memory
[Rn] to Rd, then update Rn to Rnþoffset
Read word from memory[Rn] to Rd, then update Rn
to Rnþoffset
Read double-word from memory[Rn] to Rd1, Rd2,
then update Rn to Rnþoffset
Store byte to memory[Rn] then update Rn to
Rnþoffset
Store half-word to memory[Rn] then update Rn to
Rnþoffset
Store word to memory[Rn] then update Rn to
Rnþoffset
Store double-word to memory[Rn] then update Rn
to Rnþoffset

139

140

CHAPTER 5 Instruction Set

registers) and STM (Store Multiple registers) instructions only support 32-bit data.
They support two types of pre-indexing:
•
•

IA: Increment address After each read/write
DB: Decrement address Before each read/write

The LDM and STM instructions can be used without base address write back
(Table 5.15).
The  in Table 5.15 is the register list. It contains at least one register, and:
•
•
•

Start with “{“ and end with “}”
Use “-“ (hypen) to indicate range. For example, R0-R4 means R0, R1, R2, R3
and R4.
Use “,” (comma) to separate each register

For example, the following instructions read address 0x20000000 to
0x2000000F (four words) into R0 to R3:
LDR

R4,=0x20000000 ; Set R4 to 0x20000000 (address)

LDMIA R4, {R0-R3}

; Read 4 words and store them to R0 - R3

The register list can be non-contiguous such as {R1, R3, R5-R7, R9, R11-12},
which contains R1, R3, R5, R6, R7, R8, R11, R12.
Similar to other load/store instructions, you can use write back with STM and
LDM. For example:
LDR

R8,=0x8000

; Set R8 to 0x8000 (address)

STMIA R8!, {R0-R3} ; R8 change to 0x8010 after the store

Instructions with multiple Load/Store memory access instructions with write
back are listed in Table 5.16. The 16-bit versions of the LDM and STM instructions
are limited to low registers only and always have write back enabled, except when
the base register is one of the destination registers to be updated by the memory read.
If the floating point unit is present, the instructions in Table 5.17 are also available to perform load multiple and store multiple operations to the registers in the
floating point unit.
Table 5.15 Multiple Load/Store Memory Access Instructions
Examples of Multiple
Load/Store
LDMIA Rn,
LDMDB Rn,
STMIA Rn,
STMDB Rn,

Description
Read multiple words from memory location specified
by Rn. Address Increment After (IA) each read.
Read multiple words from memory location specified
by Rn. Address Decrement Before (DB) each read.
Write multiple words to memory location specified by
Rn. Address increment after each write.
Write multiple words to memory location specified by
Rn. Address Decrement Before each write.

5.6 Instruction set

Table 5.16 Multiple Load/Store Memory Access Instructions with Write Back
Example of Multiple Load /
Store with Write Back
LDMIA Rn!,

LDMDB Rn!,

STMIA Rn!,

STMDB Rn!,

Description
Read multiple words from memory location specified
by Rd. Address Increment After (IA) each read. Rn
writes back after the transfer is done.
Read multiple words from memory location specified
by Rd. Address Decrement Before (DB) each read. Rn
writes back after the transfer is done.
Write multiple words to memory location specified by
Rd. Address increment after each write. Rn writes
back after the transfer is done.
Write multiple words to memory location specified by
Rd. Address Decrement Before each write Rn writes
back after the transfer is done.

Table 5.17 Multiple Load/Store Memory Access Instructions for Floating Point Unit
with Write Back
Example of Stack Operations

Description

VLDMIA.32 Rn, 

Read multiple single-precision data. Address
Increment After (IA) each read.
Read multiple single-precision data. Address
Decrement Before (DB) each read.
Read multiple double-precision data. Address
Increment After (IA) each read.
Read multiple double-precision data. Address
Decrement Before (DB) each read.
Write multiple single-precision data. Address
increment after each write.
Write multiple single-precision data. Address
decrement before each write.
Write multiple double-precision data. Address
increment after each write.
Write multiple double-precision data. Address
decrement before each write.
Read multiple single-precision data. Address
Increment After (IA) each read. Rn writes back after
the transfer is done.
Read multiple single-precision data. Address
Decrement Before (DB) each read. Rn writes back
after the transfer is done.
Read multiple double-precision data. Address
Increment After (IA) each read. Rn writes back after
the transfer is done.

VLDMDB.32 Rn, 
VLDMIA.64 Rn, 
VLDMDB.64 Rn, 
VSTMIA.32 Rn, 
VSTMDB.32 Rn, 
VSTMIA.64 Rn, 
VSTMDB.64 Rn, 
VLDMIA.32 Rn!, 

VLDMDB.32 Rn!, 

VLDMIA.64 Rn!, 

(Continued)

141

142

CHAPTER 5 Instruction Set

Table 5.17 Multiple Load/Store Memory Access Instructions for Floating Point Unit
with Write BackdCont’d
Example of Stack Operations

Description

VLDMDB.64 Rn!, 

Read multiple double-precision data. Address
Decrement Before (DB) each read. Rn writes back
after the transfer is done.
Write multiple single-precision data. Address
increment after each write. Rn writes back after the
transfer is done.
Write multiple single-precision data. Address
decrement before each write. Rn writes back after
the transfer is done.
Write multiple double-precision data. Address
increment after each write. Rn writes back after the
transfer is done.
Write multiple double-precision data. Address
decrement before each write. Rn writes back after
the transfer is done.

VSTMIA.32 Rn!, 

VSTMDB.32 Rn!, 

VSTMIA.64 Rn!, 

VSTMDB.64 Rn!, 

Table 5.18 Stack Push and Stack POP Instructions for Core Registers
Example of Stack Operations

Description

PUSH 

Store register(s) in stack.

POP 

Restore register(s) from stack.

Stack push and pop
Stack push and pop are another form of the store multiple and load multiple. They
use the currently selected stack pointer for address generation. The currently
selected stack pointer can either be the Main Stack Pointer (MSP), or the Process
Stack Pointer (PSP), depending on the current mode of the processor and the value
in the CONTROL special register (see Chapter 4). Instructions for stack push and
stack pop are shown in Table 5.18.
The register list syntax is the same as LDM and STM. For example:
PUSH {R0, R4-R7, R9} ; PUSH R0, R4, R5, R6, R7, R9 into stack
POP

{R2, R3}

; POP R2 and R3 from stack

Usually a PUSH instruction will have a corresponding POP with the same register list, but this is not always necessary. For example, a common exception is
when POP is used as a function return:
PUSH {R4eR6, LR} ; Save R4 to R6 and LR (Link Register) at the
; beginning of a subroutine. LR contains the
; return address
.

; processing in the subroutine

5.6 Instruction set

Table 5.19 Stack Push and Stack POP Instructions for Floating Point Unit Registers
Example of Stack
Operations

Description

VPUSH.32 

Store single-precision register(s) in stack. (i.e., s0-s31)

VPUSH.64 

Store double-precision register(s) in stack. (i.e., d0-d15)

VPOP.32 

Restore single-precision register(s) from stack.

VPOP.64 

Restore double-precision register(s) from stack.

POP {R4-R6, PC} ; POP R4 to R6, and return address from stack.
; the return address is stores into PC directly,
; this triggers a branch (subroutine return)

Instead of popping the return address into LR, and then writing it to the program
counter (PC), we can write the return address directly to PC to save instruction count
and cycle count.
The 16-bit versions of PUSH and POP are limited to low registers (R0 to R7), LR
(for PUSH), and PC (for POP). Therefore if a high register is modified in a function
and the contents of the register need to be saved, you need to use a pair of 32-bit
PUSH and POP instructions.
If the floating point unit is present, the instructions in Table 5.19 are also available to perform stack operations to the registers in the floating point unit.
Unlike PUSH and POP, VPUSH and VPOP instructions require that:
•
•

The registers in the register list are consecutive
The maximum number of registers stacked/unstacked for each VPUSH or VPOP
is 16

If it is necessary to save more than 16 single-precision floating point registers,
you can use double-precision instruction, or use two pairs of VPUSH and VPOP.

SP-relative addressing
Besides being used for the temporary storage of registers in functions or subroutines,
the stack memory is very often also used for local variables, and accessing these variables requires SP-relative addressing. There is no special 32-bit version of
SP-relative addressing as this is already covered by the load and store instructions
with immediate offset. However, most 16-bit Thumb instructions can only use
low registers. As a result, there is a pair of dedicated 16-bit version of LDR and
STR instructions with SP-relative addressing.
An example of using SP-relative addressing mode (Figure 5.5) can be: at the
beginning of a function the SP value can be decremented to reserve space for local
variables and then the local variables can be accessed using SP-related addressing.
At the end of the function, the SP is incremented to return to the original value,
which frees the allocated stack space before returning to the calling code.

143

144

CHAPTER 5 Instruction Set

FIGURE 5.5
Local variable space allocation and accesses in stack

Load and store with unprivileged access level
There is a set of load and store instructions to allow program code executing in privileged access level to access memory with unprivileged access rights, as shown in
Table 5.20.
These instructions might be needed in some OS environments where an unprivileged application can access an API function (running within the privileged access
level) with a data pointer as an input parameter, and this API operates on memory
data specified by the pointer. If the data access is carried out using normal load
and store instructions, the unprivileged application task will then have the ability
to modify data that is used by other tasks or OS kernel using this API. By coding
the API using these special Load and Store instructions with unprivileged access
level, the API can only access the data which the application task can access.
Table 5.20 Memory Access Instructions with Unprivileged Access Level
Example of LDR/STR with
Unprivileged Access Level

Description
Note: the #offset field is optional

LDRBT Rd, [Rn, #offset]

Read byte from memory location Rn þ offset

LDRSBT Rd, [Rn, #offset]

Read and signed extend byte from memory
location Rn þ offset
Read half word from memory location Rn þ offset

LDRHT Rd, [Rn, #offset]
LDRSHT Rd, [Rn, #offset]
LDRT

Rd, [Rn, #offset]

Read and signed extended half word from memory
location Rn þ offset
Read word from memory location Rn þ offset

STRBT Rd, [Rn, #offset]

Store byte to memory location Rn þ offset

STRHT Rd, [Rn, #offset]

Store half-word to memory location Rn þ offset

STRT

Store word to memory location Rn þ offset

Rd, [Rn, #offset]

5.6 Instruction set

Exclusive accesses
The exclusive access instructions are a special group of memory access instructions
for implementing semaphores or MUTEX (Mutual Exclusive) operations. They are
normally used within embedded OS where a resource (often hardware, but can also
be software) has to be shared between multiple application tasks, or even multiple
processors.
Exclusive access instructions include exclusive loads and exclusive stores.
Special hardware inside the processor and optionally in the bus interconnect are
needed to monitor exclusive accesses. Inside the processor, a single bit register
is present to record an on-going exclusive access sequence: we call it the local
exclusive access monitor. On the system bus level, an exclusive access monitor
might also be present to check if a memory location (or memory device) used
by an exclusive access sequence has been accessed by another processor or bus
master. The processor has extra signals in the bus interface to indicate that a transfer is an exclusive access and to receive a response from the system bus level exclusive access monitor.
In a semaphore or a MUTEX operation, a data variable in RAM is used
to represent a token. It can be used to indicate, for example, that a hardware
resource has been allocated to an application task. For example, assume that if
the variable is 0, it indicates the resource is available, and 1 indicates that it is
already allocated to a task. The exclusive access sequence for requesting the
resource might be:
1. The variable is accessed with an exclusive load (read). The local exclusive access
monitor inside the processor is updated to indicate an active exclusive access transfer
and, if a bus level exclusive access monitor is present, it will also be updated.
2. The variable is checked by the application code to determine whether the
hardware resource has already been allocated. If the value is 1 (already allocated), then it can retry later or give up. If the value is 0 (resource free), then it
can try to allocate the resource in the next step.
3. The task uses an exclusive store to write a value of 1 to the variable. If the local
exclusive access monitor is set and there is no error reported by the bus level
exclusive access monitor, the variable will be updated and the exclusive store
will get a success return status. If something happened between the exclusive
load and exclusive store that could affect the exclusiveness of the access to the
variable, the exclusive store will get a failed return status and the variable will
not be updated (either cancelled by the processor itself or the store is blocked by
the bus level exclusive access monitor).
4. From the return status, the application task knows that if it has allocated the
hardware resource successfully. If not, it can retry later or give up.
The exclusive store fails if:
•

The bus level exclusive access monitor returns an exclusive fail response (e.g.,
the memory location or memory range has been accessed by another processor)

145

146

CHAPTER 5 Instruction Set

Table 5.21 Exclusive Access Instructions
Example of Exclusive Access

Description

LDREXB Rt, [Rn]

Exclusive read byte from memory location Rn

LDREXH Rt, [Rn]

Exclusive read half-word from memory location Rn

LDREX Rt, [Rn, #offset]

Exclusive read word from memory location Rn þ
offset
Exclusive store byte in Rt to memory location Rn.
Return status in Rd.
Exclusive store half word in Rt to memory location
Rn. Return status in Rd.
Exclusive store word in Rt from to location Rn þ
offset. Return status in Rd.
Force the local exclusive access monitor to clear
so that next exclusive store must fail. This is not a
memory access instruction, but is listed here due
to its usage.

STREXB Rd, Rt, [Rn]
STREXH Rd, Rt, [Rn]
STREX Rd, Rt, [Rn, #offset]
CLREX

•

The local exclusive access monitor is not set. This can be caused by:
a) Incorrect exclusive access sequence
b) An interrupt entry/exit between the exclusive load and exclusive store (the
memory location or memory range could have been accessed by an interrupt
handler or another application task).
c) Execution of a special instruction CLREX that clears the local exclusive
access monitor.
The instructions for exclusive accesses are given in Table 5.21.

5.6.3 Arithmetic operations
The CortexÒ-M3 and Cortex-M4 processors provide many different instructions for
arithmetic operations. A few basic ones are introduced here. Many data processing instructions can have multiple instruction formats. For example, an ADD instruction can
operate between two registers or between one register and an immediate data value:
ADD

R0, R0,

R1

; R0 = R0 + R1

ADDS R0, R0, #0x12 ; R0 = R0 + 0x12 with APSR (flags) update
ADC

R0, R1,

R2

; R0 = R1 + R2 + carry

These are all ADD instructions, but they have different syntaxes and binary
coding.
With the traditional Thumb instruction syntax (pre-UAL), when 16-bit Thumb
code is used, an ADD instruction can change the flags in the PSR. However, the
32-bit Thumb-2 instruction can either change the flags or leave them unchanged.

5.6 Instruction set

To distinguish between the two different operations, in Unified Assembly Language
(UAL) syntax, the S suffix should be used if the following operation depends on
the flags:
ADD

R0, R1, R2 ; Flag unchanged

ADDS R0, R1, R2 ; Flag change

Aside from ADD instructions, the arithmetic functions that the Cortex-M3 supports
include SUB (subtract), MUL (multiply), and UDIV/SDIV (unsigned and signed
divide). Table 5.22 shows some of the most commonly used arithmetic instructions.
These instructions can be used with or without the “S” suffix to specify whether
the APSR should be updated.
By default, if a divide by zero takes place, the result of the UDIV and SDIV instructions will be zero. You can set up the DIVBYZERO bit in the NVIC Configuration Control Register so that when a divide by zero occurs, a fault exception (usage
fault) takes place.
Both the Cortex-M3 and Cortex-M4 processors support 32-bit multiply instructions and multiply accumulate (MAC) instructions that give 32-bit and 64-bit results.

Table 5.22 Instructions for Arithmetic Data Operations
Commonly Used Arithmetic Instructions
(optional suffixes not shown)

Operation

ADD Rd, Rn, Rm

ADD operation

; Rd = Rn + Rm

ADD Rd, Rn, # immed ; Rd = Rn + #immed
ADC Rd, Rn, Rm

; Rd = Rn + Rm + carry

ADC Rd, #immed

; Rd = Rd + #immed + carry

ADDW Rd, Rn,#immed ; Rd = Rn + #immed
SUB Rd, Rn, Rm

; Rd = Rn - Rm

SUB Rd, #immed

; Rd = Rd - #immed

ADD with carry
ADD register with 12-bit
immediate value
SUBTRACT

SUB Rd, Rn,#immed ; Rd = Rn - #immed
SBC Rd, Rn, #immed
SBC Rd, Rn, Rm

; Rd = Rn - #immed - borrow
; Rd = Rn - Rm - borrow

SUBW Rd, Rn,#immed ; Rd = Rn - #immed
RSB Rd, Rn, #immed ; Rd = #immed - Rn
RSB Rd, Rn, Rm

SUBTRACT with borrow (not
carry)
SUBTRACT register with
12-bit immediate value
Reverse subtract

; Rd = Rm - Rn

MUL Rd, Rn, Rm

; Rd = Rn * Rm

Multiply (32-bit result)

UDIV Rd, Rn, Rm

; Rd = Rn /Rm

Unsigned and signed divide

SDIV Rd, Rn, Rm

; Rd = Rn /Rm

147

148

CHAPTER 5 Instruction Set

Table 5.23 Instructions for Multiply and MAC (Multiply Accumulate)
Instruction (no “S” suffix because APSR
is not updated)

Operation

MLA Rd, Rn, Rm, Ra ; Rd = Ra + Rn * Rm

32-bit MAC instruction, 32-bit result

MLS Rd, Rn, Rm, Ra ; Rd = Ra - Rn * Rm

32-bit multiply with subtract instruction,
32-bit result
32-bit multiply & MAC instructions for
signed values, 64-bit result

SMULL RdLo, RdHi, Rn, Rm ;{RdHi,RdLo}
= Rn * Rm
SMLAL RdLo, RdHi, Rn, Rm ;{RdHi,RdLo}
+= Rn * Rm
UMULL RdLo, RdHi, Rn, Rm ;{RdHi,RdLo}
= Rn * Rm

32-bit multiply & MAC instructions for
unsigned values, 64-bit result

UMLAL RdLo, RdHi, Rn, Rm ;{RdHi,RdLo}
+= Rn * Rm

These instructions support signed or unsigned values (Table 5.23). The APSR flags
are not affected by these instructions.
Additional MAC instructions are supported by the Cortex-M4 processor. This
will be introduced later in section 5.7.3 Multiply and MAC instructions.

5.6.4 Logic operations
The CortexÒ-M3 and Cortex-M4 processors support various instructions for logic
operations such as AND, OR, exclusive OR and so on. Like the arithmetic instructions, the 16-bit versions of these instructions update the flags in APSR. If the “S”
suffix is not specified, the assembler will convert them into 32-bit instructions.
The logic operation instructions are given in Table 5.24.
To use the 16-bit versions of these instructions, the operation must be between
two registers with the destination being one of the source registers. Also, the registers used must be low registers (R0-R7), and the S suffix should be used (APSR update). The ORN instruction is not available in 16-bit form.

5.6.5 Shift and rotate instructions
The CortexÒ-M3 and Cortex-M4 processors support various shift and rotate instructions, as shown in Table 5.25, and illustrated in Figure 5.6.
If the S suffix is used, these rotate and shift instructions also update the Carry
flag in the APSR. If the shift or rotate operation shifts the register position by multiple bits, the value of the carry flag C will be the last bit that shifts out of the
register.
You might wonder why there are rotate right instructions but no instructions for
rotate left. Actually, a rotate left operation can be replaced by a rotate right operation

5.6 Instruction set

Table 5.24 Instructions for Logical Operations
Instruction (optional S suffix not shown)
AND Rd, Rn
AND Rd, Rn,#immed
AND Rd, Rn, Rm
ORR Rd, Rn
ORR Rd, Rn,#immed
ORR Rd, Rn, Rm
BIC Rd, Rn
BIC Rd, Rn,#immed
BIC Rd, Rn, Rm
ORN Rd, Rn,#immed
ORN Rd, Rn, Rm
EOR Rd, Rn
EOR Rd, Rn,#immed
EOR Rd, Rn, Rm

; Rd ¼ Rd & Rn
; Rd ¼ Rn & #immed
; Rd ¼ Rn & Rm
; Rd ¼ Rd j Rn
; Rd ¼ Rn j #immed
; Rd ¼ Rn j Rm
; Rd ¼ Rd & (wRn)
; Rd ¼ Rn &(w#immed)
; Rd ¼ Rn &(wRm)
; Rd ¼ Rn j (w#immed)
; Rd ¼ Rn j (wRm)
; Rd ¼ Rd ^ Rn
; Rd ¼ Rn j #immed
; Rd ¼ Rn j Rm

Operation
Bitwise AND

Bitwise OR

Bit clear

Bitwise OR NOT
Bitwise Exclusive OR

Table 5.25 Instructions for Shift and Rotate Operations
Instruction (optional “S” suffix not shown)
ASR
ASR
ASR
LSL
LSL
LSL
LSR
LSR
LSR
ROR
ROR
RRX

Rd, Rn,#immed
Rd, Rn
Rd, Rn, Rm
Rd, Rn,#immed
Rd, Rn
Rd, Rn, Rm
Rd, Rn,#immed
Rd, Rn
Rd, Rn, Rm
Rd, Rn
Rd, Rn, Rm
Rd, Rn

; Rd ¼ Rn >> immed
; Rd ¼ Rd >> Rn
; Rd ¼ Rn >> Rm
; Rd ¼ Rn << immed
; Rd ¼ Rd << Rn
; Rd ¼ Rn << Rm
; Rd ¼ Rn >> immed
; Rd ¼ Rd >> Rn
; Rd ¼ Rn >> Rm
; Rd rot by Rn
; Rd ¼ Rn rot by Rm
; {C, Rd} ¼ {Rn, C}

Operation
Arithmetic shift right

Logical shift left

Logical shift right

Rotate right
Rotate right extended

with a different rotate amount. For example, a rotate left by 4 bits can be written as a
rotate right by 28 bits. This gives you the same result in the destination register (note
that the C flag will be different from rotate left) and takes same amount of time to
execute.
To use the 16-bit version of these instructions, the registers used must be low registers (R0-R7), and the S suffix should be used (APSR update). The RRX instruction
is not available in 16-bit form.

149

150

CHAPTER 5 Instruction Set

Logical Shift Left (LSL)
C

Register

0

0

Register

C

Register

C

Logical Shift Right (LSR)

Rotate Right (ROR)

Arithmetic Shift Right (ASR)
Register

C

Rotate Right eXtended (RRX)
Register

C

FIGURE 5.6
Shift and Rotate operations

5.6.6 Data conversion operations (extend and reverse ordering)
In the CortexÒ-M3 and Cortex-M4 processors, a number of instructions are available
for handling signed and unsigned extensions of data; for example, to convert an 8-bit
value to 32-bit, or from 16-bit to 32-bit. The signed and unsigned instructions are
available in both 16-bit and 32-bit forms (Table 5.26). The 16-bit form of the instructions can only access low registers (R0 to R7).
Table 5.26 Signed and Unsigned Extension
Instruction

Operation

SXTB Rd, Rm
SXTH Rd, Rm

; Rd ¼ signed_extend(Rn[7:0])
; Rd ¼ signed_extend(Rn[15:0])

UXTB Rd, Rm

; Rd ¼ unsigned_extend(Rn[7:0])

UXTH Rd, Rm

; Rd ¼ unsigned_extend(Rn[15:0])

Signed extend byte data into word
Signed extend half-word data into
word
Unsigned extend byte data into
word
Unsigned extend half-word data
into word

5.6 Instruction set

Table 5.27 Signed and Unsigned Extension with Optional Rotate
Instruction

Operation

SXTB Rd, Rm {, ROR #n} ; n ¼ 8 / 16/ 24
SXTH Rd, Rm {, ROR #n} ; n ¼ 8 / 16/ 24
UXTB Rd, Rm {, ROR #n} ; n ¼ 8 / 16/ 24
UXTH Rd, Rm {, ROR #n} ; n ¼ 8 / 16/ 24

Signed extend byte data into word
Signed extend half-word data into word
Unsigned extend byte data into word
Unsigned extend half-word data into word

The 32-bit form of these instructions can access high registers, and optionally
rotate the input data before the signed extension operations, as shown in Table 5.27.
For SXTB/SXTH, the data are sign extended using bit[7]/bit[15] of Rn. With
UXTB and UXTH, the value is zero extended to 32-bit.
For example, if R0 is 0x55AA8765:
SXTB R1, R0 ; R1 = 0x00000065
SXTH R1, R0 ; R1 = 0xFFFF8765
UXTB R1, R0 ; R1 = 0x00000065
UXTH R1, R0 ; R1 = 0x00008765

These instructions are useful for converting between different data types. Sometimes the signed extend or unsigned extend operation can be taking place on the fly
when loading data from memory (e.g., LDRB for unsigned data and LDRSB for
signed data).
Another group of data conversion operations is used for reversing data bytes in a
register, listed in Table 5.28 and illustrated in Figure 5.7. These instructions are usually used for converting data between little endian and big endian.
The 16-bit form of these instructions can only access low registers (R0 to R7).
REV reverses the byte order in a data word, and REVH reverses the byte order
inside a half-word. For example, if R0 is 0x12345678, in executing the following:
REV

R1, R0

REVH R2, R0

R1 will become 0x78563412, and R2 will be 0x34127856.

Table 5.28 Instructions for Reversing Data
Instruction

Operation

REV

Reverse bytes in word

Rd, Rn ; Rd = rev(Rn)

REV16 Rd, Rn ; Rd = rev16(Rn)

Reverse bytes in each half-word

REVSH Rd, Rn ; Rd = revsh(Rn)

Reverse bytes in bottom half-word and
sign extend the result

151

152

CHAPTER 5 Instruction Set

FIGURE 5.7
Reverse operations

REVSH is similar to REVH except that it only processes the lower half-word and
then sign extends the result. For example, if R0 is 0x33448899, running:
REVSH R1, R0

R1 will become 0xFFFF9988.

5.6.7 Bit-field processing instructions
To make the CortexÒ-M3 and Cortex-M4 processor an excellent architecture for control applications, these processors support a number of bit-field processing operations,
as listed in Table 5.29.
Table 5.29 Instructions for Bit-Field Processing
Instruction

Operation

BFC Rd, #, #
BFI Rd, Rn, #, #
CLZ Rd, Rm
RBIT Rd, Rn
SBFX Rd, Rn, #, #
UBFX Rd, Rn, #, #

Clear bit field within a register
Insert bit field to a register
Count leading zero
Reverse bit order in register
Copy bit field from source and sign extend it
Copy bit field from source register

5.6 Instruction set

BFC (Bit Field Clear) clears 1 to 31 adjacent bits in any position of a register. The
syntax of the instruction is:
BFC , <#lsb>, <#width>

For example:
LDR R0,=0x1234FFFF
BFC R0, #4, #8

This will give R0 ¼ 0x1234F00F.
BFI (Bit Field Insert) copies 1 to 31 bits (#width) from one register to any location (#lsb) in another register. The syntax is:
BFI , , <#lsb>, <#width>

For example:
LDR R0,=0x12345678
LDR R1,=0x3355AACC
BFI R1, R0, #8, #16 ; Insert R0[15:0] to R1[23:8]

This will give R1 ¼ 0x335678CC.
The CLZ instruction counts the number of leading zeros. If no bits are set the
result is 32, and if all bits are set the result is 0. It is commonly used to determine
the number of bit shifts required to normalize a value so that the leading one is
shifted to bit 31. It is often used in floating point calculations.
The RBIT instruction reverses the bit order in a data word. The syntax is:
RBIT , 

This instruction is very useful for processing serial bit streams in data communications. For example, if R0 is 0xB4E10C23 (binary value 1011_0100_1110_
0001_0000_1100_0010_0011), executing:
RBIT R0, R1

R0 will become 0xC430872D (binary value 1100_0100_0011_0000_1000_
0111_0010_1101).
UBFX and SBFX are the Unsigned and Signed Bit Field Extract instructions.
The syntax of the instructions are:
UBFX , , <#lsb>, <#width>
SBFX , , <#lsb>, <#width>

UBFX extracts a bit field from a register starting from any location (specified by
the <#lsb> operand) with any width (specified by the <#width> operand), zero extends it, and puts it in the destination register. For example:
LDR

R0,=0x5678ABCD

UBFX R1, R0, #4, #8

153

154

CHAPTER 5 Instruction Set

This will give R1 ¼ 0x000000BC (zero extend of 0xBC).
Similarly, SBFX extracts a bit field but it sign extends it before putting it in a
destination register. For example:
LDR

R0,=0x5678ABCD

SBFX R1, R0, #4, #8

This will give R1 ¼ 0xFFFFFFBC (signed extend of 0xBC).

5.6.8 Compare and test
The compare and test instructions are used to update the flags in the APSR, which
may then be used by a conditional branch or conditional execution (this will be
covered in the next section). Table 5.30 listed these instructions.
Note that these instructions do not have the “S” suffix because the APSR is
always updated.

5.6.9 Program flow control
There are several types of instructions for program flow control:
•
•

Branch
Function call
Table 5.30 Instructions for Compare and Test
Instruction

Operation

CMP , 

Compare: Calculate Rn-Rm. APSR is updated but the result
is not stored.
Compare: Calculate Rn – immediate data.
Compare negative: Calculate Rn þ Rm. APSR is updated
but the result is not stored.
Compare negative: Calculate Rn þ immediate data. APSR is
updated but the result is not stored.
Test (bitwise AND): Calculate AND result between Rn and
Rm. N bit and Z bit in APSR are updated but the AND result is
not stored. C bit can be updated if barrel shifter is used.
Test (bitwise AND): Calculate AND result between Rn and
immediate data. N bit and Z bit in APSR are updated but the
AND result is not stored.
Test (bitwise XOR): Calculate XOR result between Rn and
Rm. N bit and Z bit in APSR are updated but the AND result is
not stored. C bit can be updated if barrel shifter is used.
Test (bitwise XOR): Calculate XOR result between Rn and
immediate data. N bit and Z bit in APSR are updated but the
AND result is not stored.

CMP , #
CMN , 
CMN , #
TST , 
TST , #
TEQ , 
TEQ , #

5.6 Instruction set

•
•
•
•

Conditional branch
Combined compare and conditional branch
Conditional execution (IF-THEN instruction)
Table branch

Branches
A number of instructions can cause branch operations:
•
•
•

Branch instructions (e.g., B, BX)
A data processing instruction that updates R15 (PC) (e.g., MOV, ADD)
A memory read instruction that writes to PC (e.g., LDR, LDM, POP)
In general, although it is possible to use any of the above operations to create
branches, it is more common to use B (Branch), BX (Branch with Exchange), and
POP instructions (commonly used for function return). Sometimes the other
methods are used in table branches for older ARMÒ processors, which are not
required in CortexÒ-M3/M4 as these processors have special instructions for table
branches.
In this section we will focus on just the branch instructions. The most basic
branch instructions are given in Table 5.31.

Function calls
To call a function, the Branch and Link (BL) instruction or Branch and Link with
eXchange (BLX) instructions can be used (Table 5.32). They execute the branch
and at the same time save the return address to the Link Register (LR), so that the processor can branch back to the original program after the function call is completed.

Table 5.31 Unconditional Branch Instructions
Instruction

Operation

B 

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : Yes
Page Mode                       : UseOutlines
XMP Toolkit                     : 3.1-701
Modify Date                     : 2013:11:07 15:52:42+08:00
Create Date                     : 2013:11:07 09:58:55+08:00
Metadata Date                   : 2013:11:07 15:52:42+08:00
Creator Tool                    : Elsevier
Format                          : application/pdf
Title                           : Front Matter
Description                     : The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, Third Edition, 2014 3. 10.1016/B978-0-12-408082-9.01001-0
Document ID                     : uuid:7dba16dd-f578-4da5-b7d3-e04e00e37dad
Instance ID                     : uuid:063c0c71-796b-45a1-a71d-29c4906662aa
Producer                        : Acrobat Distiller 8.0.0 (Windows)
Page Count                      : 1015
Subject                         : The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, Third Edition, 2014 3. 10.1016/B978-0-12-408082-9.01001-0
Creator                         : Elsevier
Elsevier Web PDF Specifications : 6.4
Robots                          : noindex
EXIF Metadata provided by EXIF.tools

Navigation menu