Intel® 64 And IA 32 Architectures Optimization Reference Manual Intel April 2019 (248966 041)
User Manual:
Open the PDF directly: View PDF
Page Count: 825 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Chapter 1 Introduction
- Chapter 2 Intel® 64 and IA-32 Processor Architectures
- 2.1 The Skylake Server Microarchitecture
- 2.2 The Skylake Microarchitecture
- 2.3 Haswell Microarchitecture
- 2.4 Intel® Microarchitecture Code Name Sandy Bridge
- 2.5 Intel® Core™ Microarchitecture and Enhanced Intel® Core™ Microarchitecture
- 2.6 Intel® Microarchitecture Code Name Nehalem
- 2.6.1 Microarchitecture Pipeline
- 2.6.2 Front End Overview
- 2.6.3 Execution Engine
- 2.6.4 Cache and Memory Subsystem
- 2.6.5 Load and Store Operation Enhancements
- 2.6.6 REP String Enhancement
- 2.6.7 Enhancements for System Software
- 2.6.8 Efficiency Enhancements for Power Consumption
- 2.6.9 Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem
- 2.7 Intel® Hyper-Threading Technology
- 2.8 Intel® 64 Architecture
- 2.9 SIMD Technology
- 2.10 Summary of SIMD Technologies and Application Level Extensions
- 2.10.1 MMX™ Technology
- 2.10.2 Streaming SIMD Extensions
- 2.10.3 Streaming SIMD Extensions 2
- 2.10.4 Streaming SIMD Extensions 3
- 2.10.5 Supplemental Streaming SIMD Extensions 3
- 2.10.6 SSE4.1
- 2.10.7 SSE4.2
- 2.10.8 AESNI and PCLMULQDQ
- 2.10.9 Intel® Advanced Vector Extensions
- 2.10.10 Half-Precision Floating-Point Conversion (F16C)
- 2.10.11 RDRAND
- 2.10.12 Fused-Multiply-ADD (FMA) Extensions
- 2.10.13 Intel AVX2
- 2.10.14 General-Purpose Bit-Processing Instructions
- 2.10.15 Intel® Transactional Synchronization Extensions
- 2.10.16 RDSEED
- 2.10.17 ADCX and ADOX Instructions
- Chapter 3 General Optimization Guidelines
- 3.1 Performance Tools
- 3.2 Processor Perspectives
- 3.3 Coding Rules, Suggestions and Tuning Hints
- 3.4 Optimizing the Front End
- 3.4.1 Branch Prediction Optimization
- 3.4.2 Fetch and Decode Optimization
- 3.4.2.1 Optimizing for Micro-fusion
- 3.4.2.2 Optimizing for Macro-fusion
- 3.4.2.3 Length-Changing Prefixes (LCP)
- 3.4.2.4 Optimizing the Loop Stream Detector (LSD)
- 3.4.2.5 Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge
- 3.4.2.6 Optimization for Decoded ICache
- 3.4.2.7 Other Decoding Guidelines
- 3.5 Optimizing the Execution Core
- 3.5.1 Instruction Selection
- 3.5.1.1 Integer Divide
- 3.5.1.2 Using LEA
- 3.5.1.3 ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge
- 3.5.1.4 Bitwise Rotation
- 3.5.1.5 Variable Bit Count Rotation and Shift
- 3.5.1.6 Address Calculations
- 3.5.1.7 Clearing Registers and Dependency Breaking Idioms
- 3.5.1.8 Compares
- 3.5.1.9 Using NOPs
- 3.5.1.10 Mixing SIMD Data Types
- 3.5.1.11 Spill Scheduling
- 3.5.1.12 Zero-Latency MOV Instructions
- 3.5.2 Avoiding Stalls in Execution Core
- 3.5.3 Vectorization
- 3.5.4 Optimization of Partially Vectorizable Code
- 3.5.1 Instruction Selection
- 3.6 Optimizing Memory Accesses
- 3.6.1 Load and Store Execution Bandwidth
- 3.6.2 Minimize Register Spills
- 3.6.3 Enhance Speculative Execution and Memory Disambiguation
- 3.6.4 Alignment
- 3.6.5 Store Forwarding
- 3.6.6 Data Layout Optimizations
- 3.6.7 Stack Alignment
- 3.6.8 Capacity Limits and Aliasing in Caches
- 3.6.9 Mixing Code and Data
- 3.6.10 Write Combining
- 3.6.11 Locality Enhancement
- 3.6.12 Minimizing Bus Latency
- 3.6.13 Non-Temporal Store Bus Traffic
- 3.7 Prefetching
- 3.8 Floating-point Considerations
- 3.9 Maximizing PCIe Performance
- Chapter 4 Coding for SIMD Architectures
- 4.1 Checking for Processor Support of SIMD Technologies
- 4.1.1 Checking for MMX Technology Support
- 4.1.2 Checking for Streaming SIMD Extensions Support
- 4.1.3 Checking for Streaming SIMD Extensions 2 Support
- 4.1.4 Checking for Streaming SIMD Extensions 3 Support
- 4.1.5 Checking for Supplemental Streaming SIMD Extensions 3 Support
- 4.1.6 Checking for SSE4.1 Support
- 4.1.7 Checking for SSE4.2 Support
- 4.1.8 DetectiON of PCLMULQDQ and AESNI Instructions
- 4.1.9 Detection of AVX Instructions
- 4.1.10 Detection of VEX-Encoded AES and VPCLMULQDQ
- 4.1.11 Detection of F16C Instructions
- 4.1.12 Detection of FMA
- 4.1.13 Detection of AVX2
- 4.2 Considerations for Code Conversion to SIMD Programming
- 4.3 Coding Techniques
- 4.4 Stack and Data Alignment
- 4.5 Improving Memory Utilization
- 4.6 Instruction Selection
- 4.7 Tuning the Final Application
- 4.1 Checking for Processor Support of SIMD Technologies
- Chapter 5 Optimizing for SIMD Integer Applications
- 5.1 General Rules on SIMD Integer Code
- 5.2 Using SIMD Integer with x87 Floating-point
- 5.3 Data Alignment
- 5.4 Data Movement Coding Techniques
- 5.4.1 Unsigned Unpack
- 5.4.2 Signed Unpack
- 5.4.3 Interleaved Pack with Saturation
- 5.4.4 Interleaved Pack without Saturation
- 5.4.5 Non-Interleaved Unpack
- 5.4.6 Extract Data Element
- 5.4.7 Insert Data Element
- 5.4.8 Non-Unit Stride Data Movement
- 5.4.9 Move Byte Mask to Integer
- 5.4.10 Packed Shuffle Word for 64-bit Registers
- 5.4.11 Packed Shuffle Word for 128-bit Registers
- 5.4.12 Shuffle Bytes
- 5.4.13 Conditional Data Movement
- 5.4.14 Unpacking/interleaving 64-bit Data in 128-bit Registers
- 5.4.15 Data Movement
- 5.4.16 Conversion Instructions
- 5.5 Generating Constants
- 5.6 Building Blocks
- 5.6.1 Absolute Difference of Unsigned Numbers
- 5.6.2 Absolute Difference of Signed Numbers
- 5.6.3 Absolute Value
- 5.6.4 Pixel Format Conversion
- 5.6.5 Endian Conversion
- 5.6.6 Clipping to an Arbitrary Range [High, Low]
- 5.6.7 Packed Max/Min of Byte, Word and Dword
- 5.6.8 Packed Multiply Integers
- 5.6.9 Packed Sum of Absolute Differences
- 5.6.10 MPSADBW and PHMINPOSUW
- 5.6.11 Packed Average (Byte/Word)
- 5.6.12 Complex Multiply by a Constant
- 5.6.13 Packed 64-bit Add/Subtract
- 5.6.14 128-bit Shifts
- 5.6.15 PTEST and Conditional Branch
- 5.6.16 Vectorization of Heterogeneous Computations across Loop Iterations
- 5.6.17 Vectorization of Control Flows in Nested Loops
- 5.7 Memory Optimizations
- 5.8 Converting from 64-bit to 128-bit SIMD Integers
- 5.9 Tuning Partially Vectorizable Code
- 5.10 Parallel Mode AES Encryption and Decryption
- 5.11 Light-Weight Decompression and Database Processing
- Chapter 6 Optimizing for SIMD Floating-point Applications
- 6.1 General Rules for SIMD Floating-point Code
- 6.2 Planning Considerations
- 6.3 Using SIMD Floating-point with x87 Floating-point
- 6.4 Scalar Floating-point Code
- 6.5 Data Alignment
- 6.6 SIMD Optimizations and Microarchitectures
- Chapter 7 INT8 Deep Learning Inference
- Chapter 8 Optimizing Cache Usage
- 8.1 General Prefetch Coding Guidelines
- 8.2 Prefetch and Cacheability Instructions
- 8.3 Prefetch
- 8.4 Cacheability Control
- 8.5 Memory Optimization Using Prefetch
- 8.5.1 Software-Controlled Prefetch
- 8.5.2 Hardware Prefetch
- 8.5.3 Example of Effective Latency Reduction with Hardware Prefetch
- 8.5.4 Example of Latency Hiding with S/W Prefetch Instruction
- 8.5.5 Software Prefetching Usage Checklist
- 8.5.6 Software Prefetch Scheduling Distance
- 8.5.7 Software Prefetch Concatenation
- 8.5.8 Minimize Number of Software Prefetches
- 8.5.9 Mix Software Prefetch with Computation Instructions
- 8.5.10 Software Prefetch and Cache Blocking Techniques
- 8.5.11 Hardware Prefetching and Cache Blocking Techniques
- 8.5.12 Single-pass versus Multi-pass Execution
- 8.6 Memory Optimization using Non-Temporal Stores
- 8.6.1 Non-temporal Stores and Software Write-Combining
- 8.6.2 Cache Management
- 8.6.2.1 Video Encoder
- 8.6.2.2 Video Decoder
- 8.6.2.3 Conclusions from Video Encoder and Decoder Implementation
- 8.6.2.4 Optimizing Memory Copy Routines
- 8.6.2.5 Using the 8-byte Streaming Stores and Software Prefetch
- 8.6.2.6 Using 16-byte Streaming Stores and Hardware Prefetch
- 8.6.2.7 Performance Comparisons of Memory Copy Routines
- 8.6.3 Deterministic Cache Parameters
- Chapter 9 Introducing sub-numa clustering
- Chapter 10 Multicore and Hyper-Threading Technology
- 10.1 Performance and Usage Models
- 10.2 Programming Models and Multithreading
- 10.3 Optimization Guidelines
- 10.4 Thread Synchronization
- 10.5 System Bus Optimization
- 10.6 Memory Optimization
- 10.7 Front end Optimization
- 10.8 Affinities and Managing Shared Platform Resources
- 10.9 Optimization of Other Shared Resources
- Chapter 11 Intel® Optane™ DC Persistent Memory
- Chapter 12 64-bit Mode Coding Guidelines
- Chapter 13 SSE4.2 and SIMD Programming For Text- Processing/Lexing/Parsing
- Chapter 14 Optimizations for Intel® AVX, FMA and AVX2
- 14.1 Intel® AVX Intrinsics Coding
- 14.2 Non-Destructive Source (NDS)
- 14.3 Mixing AVX Code with SSE Code
- 14.4 128-Bit Lane Operation and AVX
- 14.5 Data Gather and Scatter
- 14.6 Data Alignment for Intel® AVX
- 14.7 L1D Cache LIne Replacements
- 14.8 4K Aliasing
- 14.9 Conditional SIMD Packed Loads and Stores
- 14.10 Mixing Integer and Floating-Point Code
- 14.11 Handling Port 5 Pressure
- 14.12 Divide and Square Root Operations
- 14.13 Optimization of Array Sub Sum Example
- 14.14 Half-Precision Floating-Point Conversions
- 14.15 Fused Multiply-Add (FMA) Instructions Guidelines
- 14.16 AVX2 Optimization Guidelines
- 14.16.1 Multi-Buffering and AVX2
- 14.16.2 Modular Multiplication and AVX2
- 14.16.3 Data Movement Considerations
- 14.16.3.1 SIMD Heuristics to implement Memcpy()
- 14.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB
- 14.16.3.3 Memset() Implementation Considerations
- 14.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code
- 14.16.3.5 256-bit Fetch versus Two 128-bit Fetches
- 14.16.3.6 Mixing MULX and AVX2 Instructions
- 14.16.4 Considerations for Gather Instructions
- 14.16.5 AVX2 Conversion Remedy to MMX Instruction Throughput Limitation
- Chapter 15 Intel® TSX Recommendations
- 15.1 Introduction
- 15.2 Application-Level Tuning and Optimizations
- 15.3 Developing an Intel TSX Enabled Synchronization Library
- 15.3.1 Adding HLE Prefixes
- 15.3.2 Elision Friendly Critical Section Locks
- 15.3.3 Using HLE or RTM for Lock Elision
- 15.3.4 An example wrapper for lock elision using RTM
- 15.3.5 Guidelines for the RTM fallback handler
- 15.3.6 Implementing Elision-Friendly Locks using Intel TSX
- 15.3.7 Eliding Application-Specific Meta-Locks using Intel TSX
- 15.3.8 Avoiding Persistent Non-Elided Execution
- 15.3.9 Reading the Value of an Elided Lock in RTM-based libraries
- 15.3.10 Intermixing HLE and RTM
- 15.4 Using the Performance Monitoring Support for Intel TSX
- 15.4.1 Measuring Transactional Success
- 15.4.2 Finding locks to elide and verifying all locks are elided.
- 15.4.3 Sampling Transactional Aborts
- 15.4.4 Classifying Aborts using a Profiling Tool
- 15.4.5 XABORT Arguments for RTM fallback handlers
- 15.4.6 Call Graphs for Transactional Aborts
- 15.4.7 Last Branch Records and Transactional Aborts
- 15.4.8 Profiling and Testing Intel TSX Software using the Intel® SDE
- 15.4.9 HLE Specific Performance Monitoring Events
- 15.4.10 Computing Useful Metrics for Intel TSX
- 15.5 Performance Guidelines
- 15.6 Debugging Guidelines
- 15.7 Common Intrinsics for Intel TSX
- Chapter 16 Power Optimization for Mobile Usages
- 16.1 Overview
- 16.2 Mobile Usage Scenarios
- 16.3 ACPI C-States
- 16.4 Guidelines for Extending Battery Life
- 16.5 Tuning Software for Intelligent Power Consumption
- 16.5.1 Reduction of Active Cycles
- 16.5.2 PAUSE and Sleep(0) Loop Optimization
- 16.5.3 Spin-Wait Loops
- 16.5.4 Using Event Driven Service Instead of Polling in Code
- 16.5.5 Reducing Interrupt Rate
- 16.5.6 Reducing Privileged Time
- 16.5.7 Setting Context Awareness in the Code
- 16.5.8 Saving Energy by Optimizing for Performance
- 16.6 Processor Specific Power Management Optimization for System Software
- Chapter 17 Skylake Server Microarchitecture and Software Optimization for Intel® AVX-512
- 17.1 Basic Intel® AVX-512 vs. Intel® AVX2 Coding
- 17.2 Masking
- 17.3 Forwarding and Unmasked Operations
- 17.4 Forwarding and Memory Masking
- 17.5 Data Compress
- 17.6 Data Expand
- 17.7 Ternary Logic
- 17.8 New Shuffle instructions
- 17.9 Broadcast
- 17.10 Embedded Rounding
- 17.11 Scatter Instruction
- 17.12 Static Rounding Modes, Suppress-All-Exceptions (SAE)
- 17.13 QWORD Instruction SUpport
- 17.14 Vector Length Orthogonality
- 17.15 New Intel® AVX-512 Instructions for Transcendental Support
- 17.15.1 VRCP14, VRSQRT14 - Software Sequences for 1/x, x/y, sqrt(x)
- 17.15.2 VGETMANT VGETEXP - Vector Get Mantissa and Vector Get Exponent
- 17.15.3 VRNDSCALE - Vector Round Scale
- 17.15.4 VREDUCE - Vector Reduce
- 17.15.5 VSCALEF - Vector Scale
- 17.15.6 VFPCLASS - Vector Floating Point Class
- 17.15.7 VPERM, VPERMI2, VPERMT2 - Small Table Lookup Implementation
- 17.16 Conflict Detection
- 17.17 FMA Latency
- 17.18 Mixing Intel® AVX Extensions or Intel® AVX-512 Extensions with Intel® Streaming SIMD Extensions (Intel® SSE) Code
- 17.19 Mixing zmm Vector Code with xmm/ymm
- 17.20 Servers With a Single FMA Unit
- 17.21 Gather/Scatter to Shuffle (G2S/STS)
- 17.22 Data Alignment
- 17.23 Dynamic Memory Allocation and Memory Alignment
- 17.24 Division and Square Root Operations
- 17.24.1 Divide and Square Root Approximation Methods
- 17.24.2 Divide and Square Root Performance
- 17.24.3 Approximation Latencies
- 17.24.4 Code Snippets
- 17.24.4.1 Single Precision, Divide, 24 Bits (IEEE)
- 17.24.4.2 Single Precision, Divide, 23 Bits
- 17.24.4.3 Single Precision, Divide, 14 Bits
- 17.24.4.4 Single Precision, Reciprocal Square Root, 22 Bits
- 17.24.4.5 Single Precision, Reciprocal Square Root, 23 Bits
- 17.24.4.6 Single Precision, Reciprocal Square Root, 14 Bits
- 17.24.4.7 Single Precision, Square Root, 24 Bits (IEEE)
- 17.24.4.8 Single Precision, Square Root, 23 Bits
- 17.24.4.9 Single Precision, Square Root, 14 Bits
- 17.24.4.10 Double Precision, Divide, 53 Bits (IEEE)
- 17.24.4.11 Double Precision, Divide, 52 Bits
- 17.24.4.12 Double Precision, Divide, 26 Bits
- 17.24.4.13 Double Precision, Divide, 14 Bits
- 17.24.4.14 Double Precision, Reciprocal Square Root, 51 Bits
- 17.24.4.15 Double Precision, Reciprocal Square Root, 52 Bits
- 17.24.4.16 Double Precision, Reciprocal Square Root, 50 Bits
- 17.24.4.17 Double Precision, Reciprocal Square Root, 26 Bits
- 17.24.4.18 Double Precision, Reciprocal Square Root, 14 Bits
- 17.24.4.19 Double Precision, Square Root, 53 Bits (IEEE)
- 17.24.4.20 Double Precision, Square Root, 52 Bits
- 17.24.4.21 Double Precision, Square Root, 26 Bits
- 17.24.4.22 Double Precision, Square Root, 14 Bits
- 17.25 Tips on Compiler Usage
- 17.26 Skylake Server Power Management
- Chapter 18 Software Optimization For Goldmont Plus, Goldmont, and Silvermont Microarchitectures
- 18.1 Microarchitectures of Recent Intel Atom Processor Generations
- 18.2 Coding Recommendations for Goldmont Plus, Goldmont and Silvermont Microarchitectures
- 18.2.1 Optimizing The Front End
- 18.2.2 Optimizing The Execution Core
- 18.2.2.1 Scheduling
- 18.2.2.2 Address Generation
- 18.2.2.3 FP Multiply-Accumulate-Store Execution
- 18.2.2.4 Integer Multiply Execution
- 18.2.2.5 Zeroing Idioms
- 18.2.2.6 NOP Idioms
- 18.2.2.7 Move Elimination and ESP Folding
- 18.2.2.8 Stack Manipulation Instruction
- 18.2.2.9 Flags usage
- 18.2.2.10 SIMD Floating-Point and X87 Instructions
- 18.2.2.11 SIMD Integer Instructions
- 18.2.2.12 Vectorization Considerations
- 18.2.2.13 Other SIMD Instructions
- 18.2.2.14 Instruction Selection
- 18.2.2.15 Integer Division
- 18.2.2.16 Integer Shift
- 18.2.2.17 Pause Instruction
- 18.2.3 Optimizing Memory Accesses
- 18.3 Instruction Latency and Throughput
- Chapter 19 Knights Landing Microarchitecture and Software Optimization
- 19.1 Knights Landing Microarchitecture
- 19.2 Intel® AVX-512 Coding Recommendations for Knights Landing Microarchitecture
- 19.2.1 Using Gather and Scatter Instructions
- 19.2.2 Using Enhanced Reciprocal Instructions
- 19.2.3 Using AVX-512CD Instructions
- 19.2.4 Using Intel® Hyper-Threading Technology
- 19.2.5 Front End Considerations
- 19.2.6 Integer Execution Considerations
- 19.2.7 Optimizing FP and Vector Execution
- 19.2.8 Memory Optimization
- Appendix A Application Performance Tools
- A.1 Compilers
- A.2 Performance Libraries
- A.3 Performance Profilers
- A.4 Thread and Memory Checkers
- A.5 Vectorization Assistant
- A.6 Cluster Tools
- A.7 Intel® Academic Community
- Appendix B Using Performance Monitoring Events
- B.1 Top-Down Analysis Method
- B.2 Performance Monitoring and Microarchitecture
- B.3 Intel® Xeon® processor 5500 Series
- B.4 Performance Analysis Techniques for Intel® Xeon® Processor 5500 Series
- B.4.1 Cycle Accounting and Uop Flow Analysis
- B.4.2 Stall Cycle Decomposition and Core Memory Accesses
- B.4.3 Core PMU Precise Events
- B.4.3.1 Precise Memory Access Events
- B.4.3.2 Load Latency Event
- B.4.3.3 Precise Execution Events
- B.4.3.4 Last Branch Record (LBR)
- B.4.3.5 Measuring Core Memory Access Latency
- B.4.3.6 Measuring Per-Core Bandwidth
- B.4.3.7 Miscellaneous L1 and L2 Events for Cache Misses
- B.4.3.8 TLB Misses
- B.4.3.9 L1 Data Cache
- B.4.4 Front End Monitoring Events
- B.4.5 Uncore Performance Monitoring Events
- B.4.6 Intel QuickPath Interconnect Home Logic (QHL)
- B.4.7 Measuring Bandwidth From the Uncore
- B.5 Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge
- B.6 Using Performance Events of Intel® Core™ Solo and Intel® Core™ Duo processors
- B.7 Drill-Down Techniques for Performance Analysis
- B.8 Event ratios for Intel Core microarchitecture
- B.8.1 Clocks Per Instructions Retired Ratio (CPI)
- B.8.2 Front End Ratios
- B.8.3 Branch Prediction Ratios
- B.8.4 Execution Ratios
- B.8.5 Memory Sub-System - Access Conflicts Ratios
- B.8.6 Memory Sub-System - Cache Misses Ratios
- B.8.7 Memory Sub-system - Prefetching
- B.8.8 Memory Sub-system - TLB Miss Ratios
- B.8.9 Memory Sub-system - Core Interaction
- B.8.10 Memory Sub-system - Bus Characterization
- Appendix C Instruction Latency and Throughput
- Appendix D Intel® Atom™ Microarchitecture and Software Optimization