NEON Programmer’s Guide DEN0018A Programmers
User Manual:
Open the PDF directly: View PDF
Page Count: 411 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- NEON Programmer’s Guide
- Contents
- Preface
- 1: Introduction
- 2: Compiling NEON Instructions
- 2.1 Vectorization
- 2.1.1 Enabling auto-vectorization in ARM Compiler toolchain
- 2.1.2 Enabling auto-vectorization in GCC compiler
- 2.1.3 C pointer aliasing
- 2.1.4 Natural types
- 2.1.5 Array grouping
- 2.1.6 Inside knowledge
- 2.1.7 Enabling the NEON unit in bare-metal applications
- 2.1.8 Enabling the NEON unit in a Linux stock kernel
- 2.1.9 Enabling the NEON unit in a Linux custom kernel
- 2.1.10 Optimizing for vectorization
- 2.2 Generating NEON code using the vectorizing compiler
- 2.3 Vectorizing examples
- 2.4 NEON assembler and ABI restrictions
- 2.5 NEON libraries
- 2.6 Intrinsics
- 2.7 Detecting presence of a NEON unit
- 2.8 Writing code to imply SIMD
- 2.9 GCC command line options
- 2.1 Vectorization
- 3: NEON Instruction Set Architecture
- 4: NEON Intrinsics
- 4.1 Introduction
- 4.2 Vector data types for NEON intrinsics
- 4.3 Prototype of NEON Intrinsics
- 4.4 Using NEON intrinsics
- 4.5 Variables and constants in NEON code
- 4.6 Accessing vector types from C
- 4.7 Loading data from memory into vectors
- 4.8 Constructing a vector from a literal bit pattern
- 4.9 Constructing multiple vectors from interleaved memory
- 4.10 Loading a single lane of a vector from memory
- 4.11 Programming using NEON intrinsics
- 4.12 Instructions without an equivalent intrinsic
- 5: Optimizing NEON Code
- 5.1 Optimizing NEON assembler code
- 5.2 Scheduling
- 5.2.1 NEON instruction scheduling
- 5.2.2 Mixed ARM and NEON instruction sequences
- 5.2.3 Passing data between ARM general-purpose registers and NEON registers
- 5.2.4 Dual issue for NEON instructions
- 5.2.5 Example of how to read NEON instruction tables
- 5.2.6 Optimizations by variable spreading
- 5.2.7 Optimizations when using lengthening instructions
- 6: NEON Code Examples with Intrinsics
- 7: NEON Code Examples with Mixed Operations
- 8: NEON Code Examples with Optimization
- A: NEON Microarchitecture
- B: Operating System Support
- C: NEON and VFP Instruction Summary
- C.1 List of all NEON and VFP instructions
- C.2 List of doubling instructions
- C.3 List of halving instructions
- C.4 List of widening or long instructions
- C.5 List of narrowing instructions
- C.6 List of rounding instructions
- C.7 List of saturating instructions
- C.8 NEON general data processing instructions
- C.9 NEON shift instructions
- C.10 NEON logical and compare operations
- C.11 NEON arithmetic instructions
- C.11.1 VABA{L}
- C.11.2 VABD{L}
- C.11.3 V{Q}ABS
- C.11.4 V{Q}ADD, VADDL, VADDW
- C.11.5 V{R}ADDHN
- C.11.6 VCLS
- C.11.7 VCLZ
- C.11.8 VCNT
- C.11.9 V{R}HADD
- C.11.10 VHSUB
- C.11.11 VMAX and VMIN
- C.11.12 V{Q}NEG
- C.11.13 VPADD{L}, VPADAL
- C.11.14 VPMAX and VPMIN
- C.11.15 VRECPE
- C.11.16 VRECPS
- C.11.17 VRSQRTE
- C.11.18 VRSQRTS
- C.11.19 V{Q}SUB, VSUBL and VSUBW
- C.11.20 V{R}SUBHN
- C.12 NEON multiply instructions
- C.13 NEON load and store instructions
- C.13.1 Interleaving
- C.13.2 Alignment restrictions in load and store, element and structure instructions
- C.13.3 VLDn and VSTn (single n-element structure to one lane)
- C.13.4 VLDn (single n-element structure to all lanes)
- C.13.5 VLDn and VSTn (multiple n-element structures)
- C.13.6 VLDR and VSTR
- C.13.7 VLDM, VSTM, VPOP, and VPUSH
- C.13.8 VMOV (between two ARM registers and a NEON register)
- C.13.9 VMOV (between an ARM register and a NEON scalar)
- C.13.10 VMRS and VMSR (between an ARM register and a NEON or VFP system register)
- C.14 VFP instructions
- C.14.1 VABS
- C.14.2 VADD
- C.14.3 VCMP (Floating-point compare)
- C.14.4 VCVT (between single-precision and double-precision)
- C.14.5 VCVT (between floating-point and integer)
- C.14.6 VCVT (between floating-point and fixed-point)
- C.14.7 VCVTB, VCVTT (half-precision extension)
- C.14.8 VDIV
- C.14.9 VFMA, VFMS, VFNMA, VFNMS (Fused floating-point multiply accumulate and fused floating-point multiply subtract with optional negation)
- C.14.10 VMOV
- C.14.11 VMOV
- C.14.12 VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS
- C.14.13 VNEG
- C.14.14 VSQRT
- C.14.15 VSUB
- C.15 NEON and VFP pseudo-instructions
- D: NEON Intrinsics Reference
- D.1 NEON intrinsics description
- D.2 Intrinsics type conversion
- D.3 Arithmetic
- D.4 Multiply
- D.4.1 VMUL
- D.4.2 VMLA
- D.4.3 VMLAL
- D.4.4 VMLS
- D.4.5 VMLSL
- D.4.6 VQDMULH
- D.4.7 VQRDMULH
- D.4.8 VQDMLAL
- D.4.9 VQDMLSL
- D.4.10 VMULL
- D.4.11 VQDMULL
- D.4.12 VMLA_LANE
- D.4.13 VMLAL_LANE
- D.4.14 VQDMLAL_LANE
- D.4.15 VMLS_LANE
- D.4.16 VMLSL_LANE
- D.4.17 VQDMLSL_LANE
- D.4.18 VMUL_N
- D.4.19 VMULL_N
- D.4.20 VMULL_LANE
- D.4.21 VQDMULL_N
- D.4.22 VQDMULL_LANE
- D.4.23 VQDMULH_N
- D.4.24 VQDMULH_LANE
- D.4.25 VQRDMULH_N
- D.4.26 VQRDMULH_LANE
- D.4.27 VMLA_LANE
- D.4.28 VMLAL_N
- D.4.29 VQDMLAL_N
- D.4.30 VMLSL_N
- D.4.31 VQDMLSL_N
- D.5 Data processing
- D.5.1 VPADD
- D.5.2 VPADDL
- D.5.3 VPADAL
- D.5.4 VPMAX
- D.5.5 VPMIN
- D.5.6 VABD
- D.5.7 VABDL
- D.5.8 VABA
- D.5.9 VABAL
- D.5.10 VMAX
- D.5.11 VMIN
- D.5.12 VABS
- D.5.13 VQABS
- D.5.14 VNEG
- D.5.15 VQNEG
- D.5.16 VCLS
- D.5.17 VCLZ
- D.5.18 VCNT
- D.5.19 VRECPE
- D.5.20 VRECPS
- D.5.21 VRSQRTE
- D.5.22 VRSQRTS
- D.5.23 VMOVN
- D.5.24 VMOVL
- D.5.25 VQMOVN
- D.5.26 VQMOVUN
- D.6 Logical and compare
- D.7 Shift
- D.8 Floating-point
- D.9 Load and store
- D.10 Permutation
- D.11 Miscellaneous