Intel® 64 And IA 32 Architectures Optimization Reference Manual Intel April 2019 (248966 041)

User Manual:

Open the PDF directly: View PDF .
Page Count: 825 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Scroll down to view the document on your mobile browser.

Chapter 1 Introduction
- 1.1 Tuning Your Application
- 1.2 About This Manual
- 1.3 Related Information
Chapter 2 Intel® 64 and IA-32 Processor Architectures
- 2.1 The Skylake Server Microarchitecture
  - 2.1.1 Skylake Server Microarchitecture Cache
  - 2.1.2 Non-Temporal Stores on Skylake Server Microarchitecture
- 2.2 The Skylake Microarchitecture
- 2.3 Haswell Microarchitecture
- 2.4 Intel® Microarchitecture Code Name Sandy Bridge
- 2.5 Intel® Core™ Microarchitecture and Enhanced Intel® Core™ Microarchitecture
- 2.6 Intel® Microarchitecture Code Name Nehalem
- 2.7 Intel® Hyper-Threading Technology
- 2.8 Intel® 64 Architecture
- 2.9 SIMD Technology
- 2.10 Summary of SIMD Technologies and Application Level Extensions
Chapter 3 General Optimization Guidelines
- 3.1 Performance Tools
- 3.2 Processor Perspectives
- 3.3 Coding Rules, Suggestions and Tuning Hints
- 3.4 Optimizing the Front End
  - 3.4.1 Branch Prediction Optimization
  - 3.4.2 Fetch and Decode Optimization
- 3.5 Optimizing the Execution Core
- 3.6 Optimizing Memory Accesses
- 3.7 Prefetching
- 3.8 Floating-point Considerations
- 3.9 Maximizing PCIe Performance
  - 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and Toward MMIO Regions (P2P)
Chapter 4 Coding for SIMD Architectures
- 4.1 Checking for Processor Support of SIMD Technologies
- 4.2 Considerations for Code Conversion to SIMD Programming
  - 4.2.1 Identifying Hot Spots
  - 4.2.2 Determine If Code Benefits by Conversion to SIMD Execution
- 4.3 Coding Techniques
  - 4.3.1 Coding Methodologies
- 4.4 Stack and Data Alignment
- 4.5 Improving Memory Utilization
- 4.6 Instruction Selection
  - 4.6.1 SIMD Optimizations and Microarchitectures
- 4.7 Tuning the Final Application
Chapter 5 Optimizing for SIMD Integer Applications
- 5.1 General Rules on SIMD Integer Code
- 5.2 Using SIMD Integer with x87 Floating-point
  - 5.2.1 Using the EMMS Instruction
  - 5.2.2 Guidelines for Using EMMS Instruction
- 5.3 Data Alignment
- 5.4 Data Movement Coding Techniques
- 5.5 Generating Constants
- 5.6 Building Blocks
- 5.7 Memory Optimizations
- 5.8 Converting from 64-bit to 128-bit SIMD Integers
  - 5.8.1 SIMD Optimizations and Microarchitectures
    - 5.8.1.1 Packed SSE2 Integer versus MMX Instructions
    - 5.8.1.2 Work-around for False Dependency Issue
- 5.9 Tuning Partially Vectorizable Code
- 5.10 Parallel Mode AES Encryption and Decryption
- 5.11 Light-Weight Decompression and Database Processing
  - 5.11.1 Reduced Dynamic Range Datasets
  - 5.11.2 Compression and Decompression Using SIMD Instructions
Chapter 6 Optimizing for SIMD Floating-point Applications
- 6.1 General Rules for SIMD Floating-point Code
- 6.2 Planning Considerations
- 6.3 Using SIMD Floating-point with x87 Floating-point
- 6.4 Scalar Floating-point Code
- 6.5 Data Alignment
- 6.6 SIMD Optimizations and Microarchitectures
Chapter 7 INT8 Deep Learning Inference
- 7.1 Introducing INT8 as Data Type for Deep Learning Inference
- 7.2 Introducing Intel® DL Boost
  - 7.2.1 Multiply and Add Unsigned and Signed Bytes (VPDPBUSD Instruction)
  - 7.2.2 Multiply and Add Signed Word Integers (VPDPWSSD Instruction)
- 7.3 General Optimizations
- 7.4 CNNs
  - 7.4.1 Convolutional Layers
    - 7.4.1.1 Direct Convolution
    - 7.4.1.2 Convolutional Layers with Low OFM Count
  - 7.4.2 Post Convolution
- 7.5 LSTM Networks
Chapter 8 Optimizing Cache Usage
- 8.1 General Prefetch Coding Guidelines
- 8.2 Prefetch and Cacheability Instructions
- 8.3 Prefetch
- 8.4 Cacheability Control
- 8.5 Memory Optimization Using Prefetch
- 8.6 Memory Optimization using Non-Temporal Stores
Chapter 9 Introducing sub-numa clustering
- 9.1 sub-numa clustering
- 9.2 comparison with cluster-on-die
- 9.3 SNC usage
Chapter 10 Multicore and Hyper-Threading Technology
- 10.1 Performance and Usage Models
  - 10.1.1 Multithreading
  - 10.1.2 Multitasking Environment
- 10.2 Programming Models and Multithreading
- 10.3 Optimization Guidelines
- 10.4 Thread Synchronization
- 10.5 System Bus Optimization
- 10.6 Memory Optimization
- 10.7 Front end Optimization
  - 10.7.1 Avoid Excessive Loop Unrolling
- 10.8 Affinities and Managing Shared Platform Resources
  - 10.8.1 Topology Enumeration of Shared Resources
  - 10.8.2 Non-Uniform Memory Access
- 10.9 Optimization of Other Shared Resources
  - 10.9.1 Expanded Opportunity for HT Optimization
Chapter 11 Intel® Optane™ DC Persistent Memory
- 11.1 Memory Mode and App-Direct Mode
- 11.2 Device Characteristics of Intel® Optane™ DC Persistent Memory Module
- 11.3 Platform Implications of Handling a Second Type of Memory
  - 11.3.1 Multi-Processor Cache Coherence
  - 11.3.2 Shared Queues in the Memory Hierarchy
- 11.4 Implementing Persistence for Memory
- 11.5 Power Consumption
  - 11.5.1 Read-Write Equivalence
  - 11.5.2 Spatial and Temporal Locality
Chapter 12 64-bit Mode Coding Guidelines
- 12.1 Introduction
- 12.2 Coding Rules Affecting 64-bit Mode
- 12.3 Alternate Coding Rules for 64-Bit Mode
Chapter 13 SSE4.2 and SIMD Programming For Text- Processing/Lexing/Parsing
- 13.1 SSE4.2 String and Text Instructions
  - 13.1.1 CRC32
- 13.2 Using SSE4.2 String and Text Instructions
  - 13.2.1 Unaligned Memory Access and Buffer Size Management
  - 13.2.2 Unaligned Memory Access and String Library
- 13.3 SSE4.2 Application Coding Guideline and Examples
- 13.4 SSE4.2 Enabled Numerical and Lexical Computation
- 13.5 Numerical Data Conversion to ASCII Format
  - 13.5.1 Large Integer Numeric Computation
    - 13.5.1.1 MULX Instruction and Large Integer Numeric Computation
Chapter 14 Optimizations for Intel® AVX, FMA and AVX2
- 14.1 Intel® AVX Intrinsics Coding
  - 14.1.1 Intel® AVX Assembly Coding
- 14.2 Non-Destructive Source (NDS)
- 14.3 Mixing AVX Code with SSE Code
  - 14.3.1 Mixing Intel® AVX and Intel SSE in Function Calls
- 14.4 128-Bit Lane Operation and AVX
- 14.5 Data Gather and Scatter
  - 14.5.1 Data Gather
  - 14.5.2 Data Scatter
- 14.6 Data Alignment for Intel® AVX
- 14.7 L1D Cache LIne Replacements
- 14.8 4K Aliasing
- 14.9 Conditional SIMD Packed Loads and Stores
  - 14.9.1 Conditional Loops
- 14.10 Mixing Integer and Floating-Point Code
- 14.11 Handling Port 5 Pressure
- 14.12 Divide and Square Root Operations
- 14.13 Optimization of Array Sub Sum Example
- 14.14 Half-Precision Floating-Point Conversions
- 14.15 Fused Multiply-Add (FMA) Instructions Guidelines
  - 14.15.1 Optimizing Throughput with FMA and Floating-Point Add/MUL
  - 14.15.2 Optimizing Throughput with Vector Shifts
- 14.16 AVX2 Optimization Guidelines
Chapter 15 Intel® TSX Recommendations
- 15.1 Introduction
  - 15.1.1 Optimization Outline
- 15.2 Application-Level Tuning and Optimizations
- 15.3 Developing an Intel TSX Enabled Synchronization Library
- 15.4 Using the Performance Monitoring Support for Intel TSX
- 15.5 Performance Guidelines
- 15.6 Debugging Guidelines
- 15.7 Common Intrinsics for Intel TSX
Chapter 16 Power Optimization for Mobile Usages
- 16.1 Overview
- 16.2 Mobile Usage Scenarios
  - 16.2.1 Intelligent Energy Efficient Software
- 16.3 ACPI C-States
- 16.4 Guidelines for Extending Battery Life
- 16.5 Tuning Software for Intelligent Power Consumption
- 16.6 Processor Specific Power Management Optimization for System Software
  - 16.6.1 Power Management Recommendation of Processor-Specific Inactive State Configurations
    - 16.6.1.1 Balancing Power Management and Responsiveness of Inactive To Active State Transitions
Chapter 17 Skylake Server Microarchitecture and Software Optimization for Intel® AVX-512
- 17.1 Basic Intel® AVX-512 vs. Intel® AVX2 Coding
  - 17.1.1 Intrinsic Coding
  - 17.1.2 Assembly Coding
- 17.2 Masking
  - 17.2.1 Masking Example
  - 17.2.2 Masking Cost
  - 17.2.3 Masking vs. Blending
  - 17.2.4 Nested Conditions / Mask Aggregation
  - 17.2.5 Memory Masking Microarchitecture Improvements
  - 17.2.6 Peeling and Remainder Masking
- 17.3 Forwarding and Unmasked Operations
- 17.4 Forwarding and Memory Masking
- 17.5 Data Compress
  - 17.5.1 Data Compress Example
- 17.6 Data Expand
  - 17.6.1 Data Expand Example
- 17.7 Ternary Logic
  - 17.7.1 Ternary Logic Example 1
  - 17.7.2 Ternary Logic Example 2
- 17.8 New Shuffle instructions
  - 17.8.1 Two Source Permute Example
- 17.9 Broadcast
  - 17.9.1 Embedded Broadcast
  - 17.9.2 Broadcast Executed on Load Ports
- 17.10 Embedded Rounding
  - 17.10.1 Static Rounding Mode
- 17.11 Scatter Instruction
  - 17.11.1 Data Scatter Example
- 17.12 Static Rounding Modes, Suppress-All-Exceptions (SAE)
- 17.13 QWORD Instruction SUpport
  - 17.13.1 QUADWORD Support in Arithmetic Instructions
  - 17.13.2 QUADWORD Support in Convert Instructions
  - 17.13.3 QUADWORD Support for Convert with Truncation Instructions
- 17.14 Vector Length Orthogonality
- 17.15 New Intel® AVX-512 Instructions for Transcendental Support
  - 17.15.1 VRCP14, VRSQRT14 - Software Sequences for 1/x, x/y, sqrt(x)
    - 17.15.1.1 Application Examples
  - 17.15.2 VGETMANT VGETEXP - Vector Get Mantissa and Vector Get Exponent
    - 17.15.2.1 Application Examples
  - 17.15.3 VRNDSCALE - Vector Round Scale
    - 17.15.3.1 Application Examples
  - 17.15.4 VREDUCE - Vector Reduce
    - 17.15.4.1 Application Examples
  - 17.15.5 VSCALEF - Vector Scale
    - 17.15.5.1 Application Examples
  - 17.15.6 VFPCLASS - Vector Floating Point Class
    - 17.15.6.1 Application Examples
  - 17.15.7 VPERM, VPERMI2, VPERMT2 - Small Table Lookup Implementation
    - 17.15.7.1 Application Examples
- 17.16 Conflict Detection
  - 17.16.1 Vectorization with Conflict Detection
  - 17.16.2 Sparse Dot Product with VPCONFLICT
- 17.17 FMA Latency
- 17.18 Mixing Intel® AVX Extensions or Intel® AVX-512 Extensions with Intel® Streaming SIMD Extensions (Intel® SSE) Code
- 17.19 Mixing zmm Vector Code with xmm/ymm
- 17.20 Servers With a Single FMA Unit
- 17.21 Gather/Scatter to Shuffle (G2S/STS)
  - 17.21.1 Gather to Shuffle in Strided Loads
  - 17.21.2 Scatter to Shuffle in Strided Stores
  - 17.21.3 Gather to Shuffle in Adjacent Loads
- 17.22 Data Alignment
  - 17.22.1 Align Data to 64 Bytes
- 17.23 Dynamic Memory Allocation and Memory Alignment
- 17.24 Division and Square Root Operations
  - 17.24.1 Divide and Square Root Approximation Methods
  - 17.24.2 Divide and Square Root Performance
  - 17.24.3 Approximation Latencies
  - 17.24.4 Code Snippets
- 17.25 Tips on Compiler Usage
- 17.26 Skylake Server Power Management
Chapter 18 Software Optimization For Goldmont Plus, Goldmont, and Silvermont Microarchitectures
- 18.1 Microarchitectures of Recent Intel Atom Processor Generations
- 18.2 Coding Recommendations for Goldmont Plus, Goldmont and Silvermont Microarchitectures
- 18.3 Instruction Latency and Throughput
Chapter 19 Knights Landing Microarchitecture and Software Optimization
- 19.1 Knights Landing Microarchitecture
- 19.2 Intel® AVX-512 Coding Recommendations for Knights Landing Microarchitecture
Appendix A Application Performance Tools
- A.1 Compilers
- A.2 Performance Libraries
- A.3 Performance Profilers
  - A.3.1 Intel® VTune™ Amplifier XE
- A.4 Thread and Memory Checkers
  - A.4.1 Intel® Inspector
- A.5 Vectorization Assistant
  - A.5.1 Intel® Advisor
- A.6 Cluster Tools
- A.7 Intel® Academic Community
Appendix B Using Performance Monitoring Events
- B.1 Top-Down Analysis Method
- B.2 Performance Monitoring and Microarchitecture
- B.3 Intel® Xeon® processor 5500 Series
- B.4 Performance Analysis Techniques for Intel® Xeon® Processor 5500 Series
- B.5 Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge
- B.6 Using Performance Events of Intel® Core™ Solo and Intel® Core™ Duo processors
- B.7 Drill-Down Techniques for Performance Analysis
- B.8 Event ratios for Intel Core microarchitecture
Appendix C Instruction Latency and Throughput
- C.1 Overview
- C.2 Definitions
- C.3 Latency and Throughput
Appendix D Intel® Atom™ Microarchitecture and Software Optimization
- D.1 Overview
- D.2 Intel® Atom™ Microarchitecture
  - D.2.1 Hyper-Threading Technology Support in Intel® Atom™ Microarchitecture
- D.3 Coding Recommendations for Intel® Atom™ Microarchitecture
- D.4 Instruction Latency

Navigation menu