Intel® 64 And IA 32 Architectures Optimization Reference Manual
(2016-06)%20Intel%2064%20and%20IA-32%20Architectures%20Optimization%20Reference%20Manual%20(248966-033)
ARCHITECTURE IA-32 64-ia-32-architectures-optimization-manual
64-ia-32-architectures-optimization-manual
amd64-ia32-optimization-reference-manual
64-ia-32-architectures-optimization-manual
User Manual: Pdf
Open the PDF directly: View PDF
Page Count: 672 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Chapter 1 Introduction
- Chapter 2 Intel® 64 and IA-32 Processor Architectures
- 2.1 The Skylake Microarchitecture
- 2.2 The Haswell Microarchitecture
- 2.3 Intel® Microarchitecture Code Name Sandy Bridge
- 2.4 Intel® Core™ Microarchitecture and Enhanced Intel® Core™ Microarchitecture
- 2.5 Intel® Microarchitecture Code Name Nehalem
- 2.5.1 Microarchitecture Pipeline
- 2.5.2 Front End Overview
- 2.5.3 Execution Engine
- 2.5.4 Cache and Memory Subsystem
- 2.5.5 Load and Store Operation Enhancements
- 2.5.6 REP String Enhancement
- 2.5.7 Enhancements for System Software
- 2.5.8 Efficiency Enhancements for Power Consumption
- 2.5.9 Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem
- 2.6 Intel® Hyper-Threading Technology
- 2.7 Intel® 64 Architecture
- 2.8 SIMD Technology
- 2.9 Summary of SIMD Technologies and Application Level Extensions
- 2.9.1 MMX™ Technology
- 2.9.2 Streaming SIMD Extensions
- 2.9.3 Streaming SIMD Extensions 2
- 2.9.4 Streaming SIMD Extensions 3
- 2.9.5 Supplemental Streaming SIMD Extensions 3
- 2.9.6 SSE4.1
- 2.9.7 SSE4.2
- 2.9.8 AESNI and PCLMULQDQ
- 2.9.9 Intel® Advanced Vector Extensions
- 2.9.10 Half-Precision Floating-Point Conversion (F16C)
- 2.9.11 RDRAND
- 2.9.12 Fused-Multiply-ADD (FMA) Extensions
- 2.9.13 Intel AVX2
- 2.9.14 General-Purpose Bit-Processing Instructions
- 2.9.15 Intel® Transactional Synchronization Extensions
- 2.9.16 RDSEED
- 2.9.17 ADCX and ADOX Instructions
- Chapter 3 General Optimization Guidelines
- 3.1 Performance Tools
- 3.2 Processor Perspectives
- 3.3 Coding Rules, Suggestions and Tuning Hints
- 3.4 Optimizing the Front End
- 3.4.1 Branch Prediction Optimization
- 3.4.2 Fetch and Decode Optimization
- 3.4.2.1 Optimizing for Micro-fusion
- 3.4.2.2 Optimizing for Macro-fusion
- 3.4.2.3 Length-Changing Prefixes (LCP)
- 3.4.2.4 Optimizing the Loop Stream Detector (LSD)
- 3.4.2.5 Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge
- 3.4.2.6 Optimization for Decoded ICache
- 3.4.2.7 Other Decoding Guidelines
- 3.5 Optimizing the Execution Core
- 3.5.1 Instruction Selection
- 3.5.1.1 Use of the INC and DEC Instructions
- 3.5.1.2 Integer Divide
- 3.5.1.3 Using LEA
- 3.5.1.4 ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge
- 3.5.1.5 Bitwise Rotation
- 3.5.1.6 Variable Bit Count Rotation and Shift
- 3.5.1.7 Address Calculations
- 3.5.1.8 Clearing Registers and Dependency Breaking Idioms
- 3.5.1.9 Compares
- 3.5.1.10 Using NOPs
- 3.5.1.11 Mixing SIMD Data Types
- 3.5.1.12 Spill Scheduling
- 3.5.1.13 Zero-Latency MOV Instructions
- 3.5.2 Avoiding Stalls in Execution Core
- 3.5.3 Vectorization
- 3.5.4 Optimization of Partially Vectorizable Code
- 3.5.1 Instruction Selection
- 3.6 Optimizing Memory Accesses
- 3.6.1 Load and Store Execution Bandwidth
- 3.6.2 Minimize Register Spills
- 3.6.3 Enhance Speculative Execution and Memory Disambiguation
- 3.6.4 Alignment
- 3.6.5 Store Forwarding
- 3.6.6 Data Layout Optimizations
- 3.6.7 Stack Alignment
- 3.6.8 Capacity Limits and Aliasing in Caches
- 3.6.9 Mixing Code and Data
- 3.6.10 Write Combining
- 3.6.11 Locality Enhancement
- 3.6.12 Minimizing Bus Latency
- 3.6.13 Non-Temporal Store Bus Traffic
- 3.7 Prefetching
- 3.8 Floating-point Considerations
- 3.9 Maximizing PCIe Performance
- Chapter 4 Coding for SIMD Architectures
- 4.1 Checking for Processor Support of SIMD Technologies
- 4.1.1 Checking for MMX Technology Support
- 4.1.2 Checking for Streaming SIMD Extensions Support
- 4.1.3 Checking for Streaming SIMD Extensions 2 Support
- 4.1.4 Checking for Streaming SIMD Extensions 3 Support
- 4.1.5 Checking for Supplemental Streaming SIMD Extensions 3 Support
- 4.1.6 Checking for SSE4.1 Support
- 4.1.7 Checking for SSE4.2 Support
- 4.1.8 DetectiON of PCLMULQDQ and AESNI Instructions
- 4.1.9 Detection of AVX Instructions
- 4.1.10 Detection of VEX-Encoded AES and VPCLMULQDQ
- 4.1.11 Detection of F16C Instructions
- 4.1.12 Detection of FMA
- 4.1.13 Detection of AVX2
- 4.2 Considerations for Code Conversion to SIMD Programming
- 4.3 Coding Techniques
- 4.4 Stack and Data Alignment
- 4.5 Improving Memory Utilization
- 4.6 Instruction Selection
- 4.7 Tuning the Final Application
- 4.1 Checking for Processor Support of SIMD Technologies
- Chapter 5 Optimizing for SIMD Integer Applications
- 5.1 General Rules on SIMD Integer Code
- 5.2 Using SIMD Integer with x87 Floating-point
- 5.3 Data Alignment
- 5.4 Data Movement Coding Techniques
- 5.4.1 Unsigned Unpack
- 5.4.2 Signed Unpack
- 5.4.3 Interleaved Pack with Saturation
- 5.4.4 Interleaved Pack without Saturation
- 5.4.5 Non-Interleaved Unpack
- 5.4.6 Extract Data Element
- 5.4.7 Insert Data Element
- 5.4.8 Non-Unit Stride Data Movement
- 5.4.9 Move Byte Mask to Integer
- 5.4.10 Packed Shuffle Word for 64-bit Registers
- 5.4.11 Packed Shuffle Word for 128-bit Registers
- 5.4.12 Shuffle Bytes
- 5.4.13 Conditional Data Movement
- 5.4.14 Unpacking/interleaving 64-bit Data in 128-bit Registers
- 5.4.15 Data Movement
- 5.4.16 Conversion Instructions
- 5.5 Generating Constants
- 5.6 Building Blocks
- 5.6.1 Absolute Difference of Unsigned Numbers
- 5.6.2 Absolute Difference of Signed Numbers
- 5.6.3 Absolute Value
- 5.6.4 Pixel Format Conversion
- 5.6.5 Endian Conversion
- 5.6.6 Clipping to an Arbitrary Range [High, Low]
- 5.6.7 Packed Max/Min of Byte, Word and Dword
- 5.6.8 Packed Multiply Integers
- 5.6.9 Packed Sum of Absolute Differences
- 5.6.10 MPSADBW and PHMINPOSUW
- 5.6.11 Packed Average (Byte/Word)
- 5.6.12 Complex Multiply by a Constant
- 5.6.13 Packed 64-bit Add/Subtract
- 5.6.14 128-bit Shifts
- 5.6.15 PTEST and Conditional Branch
- 5.6.16 Vectorization of Heterogeneous Computations across Loop Iterations
- 5.6.17 Vectorization of Control Flows in Nested Loops
- 5.7 Memory Optimizations
- 5.8 Converting from 64-bit to 128-bit SIMD Integers
- 5.9 Tuning Partially Vectorizable Code
- 5.10 Parallel Mode AES Encryption and Decryption
- 5.11 Light-Weight Decompression and Database Processing
- Chapter 6 Optimizing for SIMD Floating-point Applications
- 6.1 General Rules for SIMD Floating-point Code
- 6.2 Planning Considerations
- 6.3 Using SIMD Floating-point with x87 Floating-point
- 6.4 Scalar Floating-point Code
- 6.5 Data Alignment
- 6.6 SIMD Optimizations and Microarchitectures
- Chapter 7 Optimizing Cache Usage
- 7.1 General Prefetch Coding Guidelines
- 7.2 Prefetch and Cacheability Instructions
- 7.3 Prefetch
- 7.4 Cacheability Control
- 7.5 Memory Optimization Using Prefetch
- 7.5.1 Software-Controlled Prefetch
- 7.5.2 Hardware Prefetch
- 7.5.3 Example of Effective Latency Reduction with Hardware Prefetch
- 7.5.4 Example of Latency Hiding with S/W Prefetch Instruction
- 7.5.5 Software Prefetching Usage Checklist
- 7.5.6 Software Prefetch Scheduling Distance
- 7.5.7 Software Prefetch Concatenation
- 7.5.8 Minimize Number of Software Prefetches
- 7.5.9 Mix Software Prefetch with Computation Instructions
- 7.5.10 Software Prefetch and Cache Blocking Techniques
- 7.5.11 Hardware Prefetching and Cache Blocking Techniques
- 7.5.12 Single-pass versus Multi-pass Execution
- 7.6 Memory Optimization using Non-Temporal Stores
- 7.6.1 Non-temporal Stores and Software Write-Combining
- 7.6.2 Cache Management
- 7.6.2.1 Video Encoder
- 7.6.2.2 Video Decoder
- 7.6.2.3 Conclusions from Video Encoder and Decoder Implementation
- 7.6.2.4 Optimizing Memory Copy Routines
- 7.6.2.5 TLB Priming
- 7.6.2.6 Using the 8-byte Streaming Stores and Software Prefetch
- 7.6.2.7 Using 16-byte Streaming Stores and Hardware Prefetch
- 7.6.2.8 Performance Comparisons of Memory Copy Routines
- 7.6.3 Deterministic Cache Parameters
- Chapter 8 Multicore and Hyper-Threading Technology
- 8.1 Performance and Usage Models
- 8.2 Programming Models and Multithreading
- 8.3 Optimization Guidelines
- 8.4 Thread Synchronization
- 8.4.1 Choice of Synchronization Primitives
- 8.4.2 Synchronization for Short Periods
- 8.4.3 Optimization with Spin-Locks
- 8.4.4 Synchronization for Longer Periods
- 8.4.5 Prevent Sharing of Modified Data and False-Sharing
- 8.4.6 Placement of Shared Synchronization Variable
- 8.4.7 Pause Latency in Skylake Microarchitecture
- 8.5 System Bus Optimization
- 8.6 Memory Optimization
- 8.7 Front end Optimization
- 8.8 Affinities and Managing Shared Platform Resources
- 8.9 Optimization of Other Shared Resources
- Chapter 9 64-bit Mode Coding Guidelines
- Chapter 10 SSE4.2 and SIMD Programming For Text- Processing/Lexing/Parsing
- Chapter 11 Optimizations for Intel® AVX, FMA and AVX2
- 11.1 Intel® AVX Intrinsics Coding
- 11.2 Non-Destructive Source (NDS)
- 11.3 Mixing AVX Code with SSE Code
- 11.4 128-Bit Lane Operation and AVX
- 11.5 Data Gather and Scatter
- 11.6 Data Alignment for Intel® AVX
- 11.7 L1D Cache LIne Replacements
- 11.8 4K Aliasing
- 11.9 Conditional SIMD Packed Loads and Stores
- 11.10 Mixing Integer and Floating-Point Code
- 11.11 Handling Port 5 Pressure
- 11.12 Divide and Square Root Operations
- 11.13 Optimization of Array Sub Sum Example
- 11.14 Half-Precision Floating-Point Conversions
- 11.15 Fused Multiply-Add (FMA) Instructions Guidelines
- 11.16 AVX2 Optimization Guidelines
- 11.16.1 Multi-Buffering and AVX2
- 11.16.2 Modular Multiplication and AVX2
- 11.16.3 Data Movement Considerations
- 11.16.3.1 SIMD Heuristics to implement Memcpy()
- 11.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB
- 11.16.3.3 Memset() Implementation Considerations
- 11.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code
- 11.16.3.5 256-bit Fetch versus Two 128-bit Fetches
- 11.16.3.6 Mixing MULX and AVX2 Instructions
- 11.16.4 Considerations for Gather Instructions
- 11.16.5 AVX2 Conversion Remedy to MMX Instruction Throughput Limitation
- Chapter 12 Intel® TSX Recommendations
- 12.1 Introduction
- 12.2 Application-Level Tuning and Optimizations
- 12.3 Developing an Intel TSX Enabled Synchronization Library
- 12.3.1 Adding HLE Prefixes
- 12.3.2 Elision Friendly Critical Section Locks
- 12.3.3 Using HLE or RTM for Lock Elision
- 12.3.4 An example wrapper for lock elision using RTM
- 12.3.5 Guidelines for the RTM fallback handler
- 12.3.6 Implementing Elision-Friendly Locks using Intel TSX
- 12.3.7 Eliding Application-Specific Meta-Locks using Intel TSX
- 12.3.8 Avoiding Persistent Non-Elided Execution
- 12.3.9 Reading the Value of an Elided Lock in RTM-based libraries
- 12.3.10 Intermixing HLE and RTM
- 12.4 Using the Performance Monitoring Support for Intel TSX
- 12.4.1 Measuring Transactional Success
- 12.4.2 Finding locks to elide and verifying all locks are elided.
- 12.4.3 Sampling Transactional Aborts
- 12.4.4 Classifying Aborts using a Profiling Tool
- 12.4.5 XABORT Arguments for RTM fallback handlers
- 12.4.6 Call Graphs for Transactional Aborts
- 12.4.7 Last Branch Records and Transactional Aborts
- 12.4.8 Profiling and Testing Intel TSX Software using the Intel SDE
- 12.4.9 HLE Specific Performance Monitoring Events
- 12.4.10 Computing Useful Metrics for Intel TSX
- 12.5 Performance Guidelines
- 12.6 Debugging Guidelines
- 12.7 Common Intrinsics for Intel TSX
- Chapter 13 Power Optimization for Mobile Usages
- 13.1 Overview
- 13.2 Mobile Usage Scenarios
- 13.3 ACPI C-States
- 13.4 Guidelines for Extending Battery Life
- 13.5 Tuning Software for Intelligent Power Consumption
- 13.5.1 Reduction of Active Cycles
- 13.5.2 PAUSE and Sleep(0) Loop Optimization
- 13.5.3 Spin-Wait Loops
- 13.5.4 Using Event Driven Service Instead of Polling in Code
- 13.5.5 Reducing Interrupt Rate
- 13.5.6 Reducing Privileged Time
- 13.5.7 Setting Context Awareness in the Code
- 13.5.8 Saving Energy by Optimizing for Performance
- 13.6 Processor Specific Power Management Optimization for System Software
- Chapter 14 Intel® Atom™ Microarchitecture and Software Optimization
- Chapter 15 Silvermont Microarchitecture and Software Optimization
- Chapter 16 Knights Landing Microarchitecture and Software Optimization
- 16.1 Knights Landing Microarchitecture
- 16.2 Intel® AVX-512 Coding Recommendations for Knights Landing Microarchitecture
- 16.2.1 Using Gather and Scatter Instructions
- 16.2.2 Using Enhanced Reciprocal Instructions
- 16.2.3 Using AVX-512CD Instructions
- 16.2.4 Using Intel® Hyper-Threading Technology
- 16.2.5 Front End Considerations
- 16.2.6 Integer Execution Considerations
- 16.2.7 Optimizing FP and Vector Execution
- 16.2.8 Memory Optimization
- Appendix A Application Performance Tools
- A.1 Compilers
- A.2 Performance Libraries
- A.3 Performance Profilers
- A.4 Thread and Memory Checkers
- A.5 Vectorization Assistant
- A.6 Cluster Tools
- A.7 Intel® Academic Community
- Appendix B Using Performance Monitoring Events
- B.1 Top-Down Analysis Method
- B.2 Intel® Xeon® processor 5500 Series
- B.3 Performance Analysis Techniques for Intel® Xeon® Processor 5500 Series
- B.3.1 Cycle Accounting and Uop Flow Analysis
- B.3.2 Stall Cycle Decomposition and Core Memory Accesses
- B.3.3 Core PMU Precise Events
- B.3.3.1 Precise Memory Access Events
- B.3.3.2 Load Latency Event
- B.3.3.3 Precise Execution Events
- B.3.3.4 Last Branch Record (LBR)
- B.3.3.5 Measuring Core Memory Access Latency
- B.3.3.6 Measuring Per-Core Bandwidth
- B.3.3.7 Miscellaneous L1 and L2 Events for Cache Misses
- B.3.3.8 TLB Misses
- B.3.3.9 L1 Data Cache
- B.3.4 Front End Monitoring Events
- B.3.5 Uncore Performance Monitoring Events
- B.3.6 Intel QuickPath Interconnect Home Logic (QHL)
- B.3.7 Measuring Bandwidth From the Uncore
- B.4 Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge
- B.5 Using Performance Events of Intel® Core™ Solo and Intel® Core™ Duo processors
- B.6 Drill-Down Techniques for Performance Analysis
- B.7 Event ratios for Intel Core microarchitecture
- B.7.1 Clocks Per Instructions Retired Ratio (CPI)
- B.7.2 Front End Ratios
- B.7.3 Branch Prediction Ratios
- B.7.4 Execution Ratios
- B.7.5 Memory Sub-System - Access Conflicts Ratios
- B.7.6 Memory Sub-System - Cache Misses Ratios
- B.7.7 Memory Sub-system - Prefetching
- B.7.8 Memory Sub-system - TLB Miss Ratios
- B.7.9 Memory Sub-system - Core Interaction
- B.7.10 Memory Sub-system - Bus Characterization
- Appendix C Instruction Latency and Throughput