Intel® 64 And IA 32 Architectures Optimization Reference Manual

(2016-06)%20Intel%2064%20and%20IA-32%20Architectures%20Optimization%20Reference%20Manual%20(248966-033)

ARCHITECTURE IA-32 64-ia-32-architectures-optimization-manual

64-ia-32-architectures-optimization-manual

amd64-ia32-optimization-reference-manual

64-ia-32-architectures-optimization-manual

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 672

DownloadIntel® 64 And IA-32 Architectures Optimization Reference Manual 64-ia-32-architectures-optimization-manual
Open PDF In BrowserView PDF
Intel® 64 and IA-32 Architectures
Optimization Reference Manual

Order Number: 248966-033
June 2016

Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages
resulting from such losses.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel
products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which
includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
This document contains information on products, services and/or processes in development. All information provided here is subject
to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for
informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained
by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm.
Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation
in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 1997-2016, Intel Corporation. All Rights Reserved.

CONTENTS
PAGE

CHAPTER 1
INTRODUCTION
1.1
1.2
1.3

TUNING YOUR APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
ABOUT THIS MANUAL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
RELATED INFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3

CHAPTER 2
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1
2.1.1
2.1.2
2.1.3
2.2
2.2.1
2.2.2
2.2.3
2.2.4
2.2.4.1
2.2.5
2.2.6
2.3
2.3.1
2.3.2
2.3.2.1
2.3.2.2
2.3.2.3
2.3.2.4
2.3.3
2.3.3.1
2.3.3.2
2.3.4
2.3.5
2.3.5.1
2.3.5.2
2.3.5.3
2.3.5.4
2.3.6
2.3.7
2.4
2.4.1
2.4.2
2.4.2.1
2.4.2.2
2.4.2.3
2.4.2.4
2.4.2.5
2.4.2.6
2.4.3
2.4.3.1
2.4.4
2.4.4.1
2.4.4.2
2.4.4.3

THE SKYLAKE MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
The Out-of-Order Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-5
THE HASWELL MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-7
The Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8
Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8
Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
Load and Store Operation Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
The Haswell-E Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
The Broadwell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Intel® Microarchitecture Code Name Sandy Bridge Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
Legacy Decode Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Decoded ICache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Micro-op Queue and the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
The Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Renamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
The Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Load and Store Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
L1 DCache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
Ring Interconnect and Last Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
System Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
Intel® Microarchitecture Code Name Ivy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
INTEL® CORE™ MICROARCHITECTURE AND ENHANCED INTEL® CORE™ MICROARCHITECTURE . . . . . . . . . . . . . . . . 2-30
Intel® Core™ Microarchitecture Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Branch Prediction Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Instruction Fetch Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Instruction Queue (IQ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
Instruction Decode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Micro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
Issue Ports and Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
Intel® Advanced Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38
Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39
Data Prefetch to L1 caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
Data Prefetch Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
iii

CONTENTS
PAGE

2.4.4.4
2.4.4.5
2.4.5
2.4.5.1
2.4.5.2
2.5
2.5.1
2.5.2
2.5.3
2.5.3.1
2.5.4
2.5.5
2.5.5.1
2.5.5.2
2.5.6
2.5.7
2.5.8
2.5.9
2.6
2.6.1
2.6.1.1
2.6.1.2
2.6.1.3
2.6.2
2.6.3
2.6.4
2.6.5
2.7
2.8
2.9
2.9.1
2.9.2
2.9.3
2.9.4
2.9.5
2.9.6
2.9.7
2.9.8
2.9.9
2.9.10
2.9.11
2.9.12
2.9.13
2.9.14
2.9.15
2.9.16
2.9.17

Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Advanced Smart Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTEL® MICROARCHITECTURE CODE NAME NEHALEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Microarchitecture Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Front End Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Issue Ports and Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Load and Store Operation Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Efficient Handling of Alignment Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Store Forwarding Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REP String Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Enhancements for System Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Efficiency Enhancements for Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . .
INTEL® HYPER-THREADING TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Processor Resources and HT Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Replicated Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitioned Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Microarchitecture Pipeline and HT Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Front End Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTEL® 64 ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SIMD TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SUMMARY OF SIMD TECHNOLOGIES AND APPLICATION LEVEL EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MMX™ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Streaming SIMD Extensions 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Streaming SIMD Extensions 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supplemental Streaming SIMD Extensions 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SSE4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AESNI and PCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Advanced Vector Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Half-Precision Floating-Point Conversion (F16C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RDRAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fused-Multiply-ADD (FMA) Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General-Purpose Bit-Processing Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Transactional Synchronization Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RDSEED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ADCX and ADOX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
3.1
3.1.1
3.1.2
3.1.3
3.2
3.2.1
3.2.2
3.2.3
3.3
3.4
3.4.1
3.4.1.1
3.4.1.2
3.4.1.3
iv

2-41
2-42
2-42
2-43
2-44
2-44
2-45
2-46
2-47
2-48
2-48
2-49
2-49
2-50
2-52
2-53
2-53
2-53
2-53
2-55
2-55
2-55
2-55
2-56
2-56
2-56
2-56
2-56
2-57
2-59
2-59
2-59
2-59
2-60
2-60
2-60
2-61
2-61
2-61
2-62
2-62
2-62
2-62
2-62
2-62
2-62
2-63

PERFORMANCE TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Intel® C++ and Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1
General Compiler Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
VTune™ Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
PROCESSOR PERSPECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
CPUID Dispatch Strategy and Compatible Code Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
Transparent Cache-Parameter Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
Threading Strategy and Hardware Multithreading Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
CODING RULES, SUGGESTIONS AND TUNING HINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
OPTIMIZING THE FRONT END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Branch Prediction Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
Eliminating Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
Spin-Wait and Idle Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6

CONTENTS
PAGE

3.4.1.4
Inlining, Calls and Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8
3.4.1.5
Code Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8
3.4.1.6
Branch Type Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9
3.4.1.7
Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
3.4.1.8
Compiler Support for Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.2
Fetch and Decode Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.2.1
Optimizing for Micro-fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.2.2
Optimizing for Macro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.2.3
Length-Changing Prefixes (LCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
3.4.2.4
Optimizing the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
3.4.2.5
Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge. . . . . . . . 3-18
3.4.2.6
Optimization for Decoded ICache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
3.4.2.7
Other Decoding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5
OPTIMIZING THE EXECUTION CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1
Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1.1
Use of the INC and DEC Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.2
Integer Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.3
Using LEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
3.5.1.4
ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
3.5.1.5
Bitwise Rotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3.5.1.6
Variable Bit Count Rotation and Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3.5.1.7
Address Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
3.5.1.8
Clearing Registers and Dependency Breaking Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
3.5.1.9
Compares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
3.5.1.10
Using NOPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
3.5.1.11
Mixing SIMD Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3.5.1.12
Spill Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3.5.1.13
Zero-Latency MOV Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3.5.2
Avoiding Stalls in Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
3.5.2.1
ROB Read Port Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
3.5.2.2
Writeback Bus Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3.5.2.3
Bypass between Execution Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3.5.2.4
Partial Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
3.5.2.5
Partial XMM Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
3.5.2.6
Partial Flag Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
3.5.2.7
Floating-Point/SIMD Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
3.5.3
Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
3.5.4
Optimization of Partially Vectorizable Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
3.5.4.1
Alternate Packing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
3.5.4.2
Simplifying Result Passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
3.5.4.3
Stack Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
3.5.4.4
Tuning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
3.6
OPTIMIZING MEMORY ACCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
3.6.1
Load and Store Execution Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41
3.6.1.1
Make Use of Load Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . 3-41
3.6.1.2
L1D Cache Latency in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
3.6.1.3
Handling L1D Cache Bank Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
3.6.2
Minimize Register Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
3.6.3
Enhance Speculative Execution and Memory Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
3.6.4
Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
3.6.5
Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
3.6.5.1
Store-to-Load-Forwarding Restriction on Size and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
3.6.5.2
Store-forwarding Restriction on Data Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51
3.6.6
Data Layout Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
3.6.7
Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
3.6.8
Capacity Limits and Aliasing in Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
3.6.8.1
Capacity Limits in Set-Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
3.6.8.2
Aliasing Cases in the Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo and Intel® Core™ 2 Duo Processors3-55
3.6.9
Mixing Code and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
3.6.9.1
Self-modifying Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
3.6.9.2
Position Independent Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
3.6.10
Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
3.6.11
Locality Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-58
3.6.12
Minimizing Bus Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59
3.6.13
Non-Temporal Store Bus Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
v

CONTENTS
PAGE

3.7
3.7.1
3.7.2
3.7.3
3.7.4
3.7.5
3.7.6
3.7.6.1
3.7.6.2
3.7.6.3
3.8
3.8.1
3.8.2
3.8.2.1
3.8.2.2
3.8.3
3.8.3.1
3.8.3.2
3.8.3.3
3.8.4
3.8.4.1
3.8.4.2
3.8.5
3.8.5.1
3.8.5.2
3.9

PREFETCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Instruction Fetching and Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Prefetching for First-Level Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Prefetching for Second-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cacheability Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REP Prefix and Data Movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Enhanced REP MOVSB and STOSB operation (ERMSB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memcpy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memmove Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memset Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FLOATING-POINT CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Guidelines for Optimizing Floating-point Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Microarchitecture Specific Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Long-Latency FP Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-point Modes and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dealing with floating-point exceptions in x87 FPU code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-point Exceptions in SSE/SSE2/SSE3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-point Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x87 vs. Scalar SIMD Floating-point Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scalar SSE/SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MAXIMIZING PCIE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4
CODING FOR SIMD ARCHITECTURES
4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
4.1.7
4.1.8
4.1.9
4.1.10
4.1.11
4.1.12
4.1.13
4.2
4.2.1
4.2.2
4.3
4.3.1
4.3.1.1
4.3.1.2
4.3.1.3
4.3.1.4
4.4
4.4.1
4.4.1.1
4.4.1.2
4.4.2
4.4.3
4.4.4
4.4.4.1
4.5
4.5.1
4.5.2
4.5.3
vi

3-60
3-61
3-61
3-63
3-63
3-64
3-66
3-66
3-67
3-68
3-68
3-68
3-69
3-69
3-69
3-69
3-69
3-70
3-70
3-70
3-71
3-72
3-73
3-73
3-73
3-74

CHECKING FOR PROCESSOR SUPPORT OF SIMD TECHNOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Checking for MMX Technology Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
Checking for Streaming SIMD Extensions Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
Checking for Streaming SIMD Extensions 2 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
Checking for Streaming SIMD Extensions 3 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
Checking for Supplemental Streaming SIMD Extensions 3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
Checking for SSE4.1 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
Checking for SSE4.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
DetectiON of PCLMULQDQ and AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
Detection of AVX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-5
Detection of VEX-Encoded AES and VPCLMULQDQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
Detection of F16C Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
Detection of FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8
Detection of AVX2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9
CONSIDERATIONS FOR CODE CONVERSION TO SIMD PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Identifying Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Determine If Code Benefits by Conversion to SIMD Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Coding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Automatic Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
STACK AND DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Alignment and Contiguity of Data Access Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Using Padding to Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Using Arrays to Make Data Contiguous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Stack Alignment For 128-bit SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
Data Alignment for MMX Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
Data Alignment for 128-bit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
Compiler-Supported Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
IMPROVING MEMORY UTILIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24

CONTENTS
PAGE

4.6
4.6.1
4.7

INSTRUCTION SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
SIMD Optimizations and Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27
TUNING THE FINAL APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28

CHAPTER 5
OPTIMIZING FOR SIMD INTEGER APPLICATIONS
5.1
5.2
5.2.1
5.2.2
5.3
5.4
5.4.1
5.4.2
5.4.3
5.4.4
5.4.5
5.4.6
5.4.7
5.4.8
5.4.9
5.4.10
5.4.11
5.4.12
5.4.13
5.4.14
5.4.15
5.4.16
5.5
5.6
5.6.1
5.6.2
5.6.3
5.6.4
5.6.5
5.6.6
5.6.6.1
5.6.6.2
5.6.7
5.6.8
5.6.9
5.6.10
5.6.11
5.6.12
5.6.13
5.6.14
5.6.15
5.6.16
5.6.17
5.7
5.7.1
5.7.1.1
5.7.2
5.7.2.1
5.7.2.2
5.7.2.3
5.7.3
5.8
5.8.1
5.8.1.1
5.8.1.2
5.9
5.10
5.10.1

GENERAL RULES ON SIMD INTEGER CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
USING SIMD INTEGER WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Using the EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Guidelines for Using EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
DATA MOVEMENT CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Unsigned Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
Signed Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
Interleaved Pack with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
Interleaved Pack without Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
Non-Interleaved Unpack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
Extract Data Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
Insert Data Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Non-Unit Stride Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Move Byte Mask to Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Packed Shuffle Word for 64-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Packed Shuffle Word for 128-bit Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Shuffle Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Conditional Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
Unpacking/interleaving 64-bit Data in 128-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
GENERATING CONSTANTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
BUILDING BLOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Absolute Difference of Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Absolute Difference of Signed Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Pixel Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
Endian Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
Clipping to an Arbitrary Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Highly Efficient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packed Max/Min of Byte, Word and Dword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packed Multiply Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packed Sum of Absolute Differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
MPSADBW and PHMINPOSUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
Packed Average (Byte/Word) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
Packed 64-bit Add/Subtract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
128-bit Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
PTEST and Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
Vectorization of Heterogeneous Computations across Loop Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
Vectorization of Control Flows in Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
MEMORY OPTIMIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Partial Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Supplemental Techniques for Avoiding Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
Increasing Bandwidth of Memory Fills and Video Fills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Increasing Memory Bandwidth Using the MOVDQ Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM Page . . . . . . . . . . . . 5-30
Increasing UC and WC Store Bandwidth by Using Aligned Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
Reverse Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
CONVERTING FROM 64-BIT TO 128-BIT SIMD INTEGERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34
SIMD Optimizations and Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34
Packed SSE2 Integer versus MMX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34
Work-around for False Dependency Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35
TUNING PARTIALLY VECTORIZABLE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35
PARALLEL MODE AES ENCRYPTION AND DECRYPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38
AES Counter Mode of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38
vii

CONTENTS
PAGE

5.10.2
AES Key Expansion Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.3
Enhancement in Intel Microarchitecture Code Name Haswell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.3.1
AES and Multi-Buffer Cryptographic Throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.3.2
PCLMULQDQ Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11
LIGHT-WEIGHT DECOMPRESSION AND DATABASE PROCESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11.1
Reduced Dynamic Range Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11.2
Compression and Decompression Using SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5-46
5-48
5-48
5-48
5-48
5-49
5-49

CHAPTER 6
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
6.1
6.2
6.3
6.4
6.5
6.5.1
6.5.1.1
6.5.1.2
6.5.1.3
6.5.1.4
6.5.2
6.5.3
6.6
6.6.1
6.6.1.1
6.6.1.2
6.6.2
6.6.3
6.6.4
6.6.4.1

GENERAL RULES FOR SIMD FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
PLANNING CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
USING SIMD FLOATING-POINT WITH X87 FLOATING-POINT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
SCALAR FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Vertical versus Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Data Swizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-5
Data Deswizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-7
Horizontal ADD Using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-8
Use of CVTTPS2PI/CVTTSS2SI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Flush-to-Zero and Denormals-are-Zero Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
SIMD OPTIMIZATIONS AND MICROARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
SIMD Floating-point Programming Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
SSE3 and Complex Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Packed Floating-Point Performance in Intel Core Duo Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Dot Product and Horizontal SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
Using Horizontal SIMD Instruction Sets and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
SOA and Vector Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20

CHAPTER 7
OPTIMIZING CACHE USAGE
7.1
7.2
7.3
7.3.1
7.3.2
7.3.3
7.4
7.4.1
7.4.1.1
7.4.1.2
7.4.1.3
7.4.1.4
7.4.2
7.4.2.1
7.4.2.2
7.4.3
7.4.4
7.4.5
7.4.5.1
7.4.5.2
7.4.5.3
7.4.6
7.4.7
7.5
7.5.1
7.5.2
7.5.3
7.5.4
7.5.5

viii

GENERAL PREFETCH CODING GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
PREFETCH AND CACHEABILITY INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
PREFETCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Software Data Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3
Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3
Prefetch and Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5
CACHEABILITY CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
The Non-temporal Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5
Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6
Streaming Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6
Memory Type and Non-temporal Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6
Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6
Streaming Store Usage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7
Coherent Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7
Non-coherent requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7
Streaming Store Instruction Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8
The Streaming Load Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8
FENCE Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8
SFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8
LFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9
MFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9
CLFLUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9
CLFLUSHOPT Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
MEMORY OPTIMIZATION USING PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Software-Controlled Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Example of Effective Latency Reduction with Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Example of Latency Hiding with S/W Prefetch Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Software Prefetching Usage Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15

CONTENTS
PAGE

7.5.6
7.5.7
7.5.8
7.5.9
7.5.10
7.5.11
7.5.12
7.6
7.6.1
7.6.2
7.6.2.1
7.6.2.2
7.6.2.3
7.6.2.4
7.6.2.5
7.6.2.6
7.6.2.7
7.6.2.8
7.6.3
7.6.3.1
7.6.3.2
7.6.3.3

Software Prefetch Scheduling Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Software Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Minimize Number of Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mix Software Prefetch with Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Software Prefetch and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Prefetching and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Single-pass versus Multi-pass Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MEMORY OPTIMIZATION USING NON-TEMPORAL STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-temporal Stores and Software Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Video Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Video Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions from Video Encoder and Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimizing Memory Copy Routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TLB Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using the 8-byte Streaming Stores and Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using 16-byte Streaming Stores and Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance Comparisons of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cache Sharing Using Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cache Sharing in Single-Core or Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determine Prefetch Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7-16
7-16
7-17
7-19
7-19
7-23
7-24
7-25
7-25
7-26
7-26
7-26
7-27
7-27
7-28
7-29
7-29
7-30
7-31
7-32
7-32
7-32

CHAPTER 8
MULTICORE AND HYPER-THREADING TECHNOLOGY
8.1
8.1.1
8.1.2
8.2
8.2.1
8.2.1.1
8.2.2
8.2.3
8.2.3.1
8.2.4
8.2.4.1
8.2.4.2
8.2.4.3
8.3
8.3.1
8.3.2
8.3.3
8.3.4
8.3.5
8.4
8.4.1
8.4.2
8.4.3
8.4.4
8.4.4.1
8.4.5
8.4.6
8.4.7
8.5
8.5.1
8.5.2
8.5.3
8.5.4
8.5.5
8.6
8.6.1
8.6.2
8.6.2.1
8.6.2.2

PERFORMANCE AND USAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-1
Multitasking Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2
PROGRAMMING MODELS AND MULTITHREADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4
Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4
Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4
Specialized Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4
Producer-Consumer Threading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-5
Tools for Creating Multithreaded Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7
Programming with OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8
Automatic Parallelization of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8
Supporting Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8
OPTIMIZATION GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Key Practices of Thread Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8
Key Practices of System Bus Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
Key Practices of Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
Key Practices of Execution Resource Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
Generality and Performance Impact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
THREAD SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Choice of Synchronization Primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Synchronization for Short Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Optimization with Spin-Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Synchronization for Longer Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Avoid Coding Pitfalls in Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Prevent Sharing of Modified Data and False-Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Placement of Shared Synchronization Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Pause Latency in Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
SYSTEM BUS OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Conserve Bus Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Understand the Bus and Cache Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Avoid Excessive Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Improve Effective Latency of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Use Full Write Transactions to Achieve Higher Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
MEMORY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
Cache Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
Shared-Memory Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Minimize Sharing of Data between Physical Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Batched Producer-Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
ix

CONTENTS
PAGE

8.6.3
8.7
8.7.1
8.8
8.8.1
8.8.2
8.9
8.9.1

Eliminate 64-KByte Aliased Data Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
FRONT END OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Avoid Excessive Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Topology Enumeration of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24
Non-Uniform Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24
OPTIMIZATION OF OTHER SHARED RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
Expanded Opportunity for HT Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26

CHAPTER 9
64-BIT MODE CODING GUIDELINES
9.1
9.2
9.2.1
9.2.2
9.2.3
9.2.4
9.2.5
9.3
9.3.1
9.3.2
9.3.3

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
CODING RULES AFFECTING 64-BIT MODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
Use Legacy 32-Bit Instructions When Data Size Is 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Use Extra Registers to Reduce Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Effective Use of 64-Bit by 64-Bit Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
Replace 128-bit Integer Division with 128-bit Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
Sign Extension to Full 64-Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4
ALTERNATE CODING RULES FOR 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
Use 64-Bit Registers Instead of Two 32-Bit Registers
for 64-Bit Arithmetic Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-5
CVTSI2SS and CVTSI2SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6

CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXTPROCESSING/LEXING/PARSING
10.1
10.1.1
10.2
10.2.1
10.2.2
10.3
10.3.1
10.3.2
10.3.3
10.3.4
10.3.5
10.3.6
10.4
10.5
10.5.1
10.5.1.1

SSE4.2 STRING AND TEXT INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
CRC32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
USING SSE4.2 STRING AND TEXT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Unaligned Memory Access and Buffer Size Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Unaligned Memory Access and String Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
Null Character Identification (Strlen equivalent). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
White-Space-Like Character Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9
Substring Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
String Token Extraction and Case Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18
Unicode Processing and PCMPxSTRy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22
Replacement String Library Function Using SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26
SSE4.2 ENABLED NUMERICAL AND LEXICAL COMPUTATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-28
NUMERICAL DATA CONVERSION TO ASCII FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-34
Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48
MULX Instruction and Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48

CHAPTER 11
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11.1
11.1.1
11.2
11.3
11.3.1
11.4
11.4.1
11.4.2
11.4.3
11.5
11.5.1
11.5.2
11.6
11.6.1
11.6.2
11.6.3

x

INTEL® AVX INTRINSICS CODING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Intel® AVX Assembly Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
NON-DESTRUCTIVE SOURCE (NDS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
MIXING AVX CODE WITH SSE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Mixing Intel® AVX and Intel SSE in Function Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
128-BIT LANE OPERATION AND AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Programming With the Lane Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
Strided Load Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
The Register Overlap Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
DATA GATHER AND SCATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Data Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Data Scatter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
DATA ALIGNMENT FOR INTEL® AVX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
Align Data to 32 Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
Consider 16-Byte Memory Access when Memory is Unaligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
Prefer Aligned Stores Over Aligned Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22

CONTENTS
PAGE

11.7
L1D CACHE LINE REPLACEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.8
4K ALIASING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.9
CONDITIONAL SIMD PACKED LOADS AND STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23
11.9.1
Conditional Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
11.10 MIXING INTEGER AND FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25
11.11 HANDLING PORT 5 PRESSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28
11.11.1
Replace Shuffles with Blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28
11.11.2
Design Algorithm With Fewer Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30
11.11.3
Perform Basic Shuffles on Load Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-32
11.12 DIVIDE AND SQUARE ROOT OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-34
11.12.1
Single-Precision Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-35
11.12.2
Single-Precision Reciprocal Square Root. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-37
11.12.3
Single-Precision Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-39
11.13 OPTIMIZATION OF ARRAY SUB SUM EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-41
11.14 HALF-PRECISION FLOATING-POINT CONVERSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43
11.14.1
Packed Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43
11.14.2
Packed Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-44
11.14.3
Locality Consideration for using Half-Precision FP to Conserve Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45
11.15 FUSED MULTIPLY-ADD (FMA) INSTRUCTIONS GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46
11.15.1
Optimizing Throughput with FMA and Floating-Point Add/MUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
11.15.2
Optimizing Throughput with Vector Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48
11.16 AVX2 OPTIMIZATION GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-49
11.16.1
Multi-Buffering and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54
11.16.2
Modular Multiplication and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54
11.16.3
Data Movement Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54
11.16.3.1
SIMD Heuristics to implement Memcpy(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-55
11.16.3.2
Memcpy() Implementation Using Enhanced REP MOVSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-55
11.16.3.3
Memset() Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56
11.16.3.4
Hoisting Memcpy/Memset Ahead of Consuming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
11.16.3.5
256-bit Fetch versus Two 128-bit Fetches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
11.16.3.6
Mixing MULX and AVX2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
11.16.4
Considerations for Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-64
11.16.4.1
Strided Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-67
11.16.4.2
Adjacent Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-68
11.16.5
AVX2 Conversion Remedy to MMX Instruction Throughput Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-69

CHAPTER 12
INTEL® TSX RECOMMENDATIONS

12.1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
12.1.1
Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2
APPLICATION-LEVEL TUNING AND OPTIMIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2.1
Existing TSX-enabled Locking Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2.1.1
Libraries allowing lock elision for unmodified programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2.1.2
Libraries requiring program modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2.2
Initial Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2.3
Run and Profile the Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2.4
Minimize Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
12.2.4.1
Transactional Aborts due to Data Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
12.2.4.2
Transactional Aborts due to Limited Transactional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6
12.2.4.3
Lock Elision Specific Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
12.2.4.4
HLE Specific Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
12.2.4.5
Miscellaneous Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
12.2.5
Using Transactional-Only Code Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
12.2.6
Dealing with Transactional Regions or Paths that Abort at a High Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
12.2.6.1
Transitioning to Non-Elided Execution without Aborting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
12.2.6.2
Forcing an Early Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.2.6.3
Not Eliding Selected Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.3
DEVELOPING AN INTEL TSX ENABLED SYNCHRONIZATION LIBRARY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.3.1
Adding HLE Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.3.2
Elision Friendly Critical Section Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.3.3
Using HLE or RTM for Lock Elision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
12.3.4
An example wrapper for lock elision using RTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
12.3.5
Guidelines for the RTM fallback handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
12.3.6
Implementing Elision-Friendly Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13
xi

CONTENTS
PAGE

12.3.6.1
12.3.6.2
12.3.6.3
12.3.6.4
12.3.7
12.3.8
12.3.9
12.3.10
12.4
12.4.1
12.4.2
12.4.3
12.4.4
12.4.5
12.4.6
12.4.7
12.4.8
12.4.9
12.4.10
12.5
12.6
12.7
12.7.1
12.7.1.1
12.7.2
12.7.2.1
12.7.2.2
12.7.2.3
12.7.3

Implementing a Simple Spinlock using HLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13
Implementing Reader-Writer Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
Implementing Ticket Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
Implementing Queue-Based Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
Eliding Application-Specific Meta-Locks using Intel TSX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16
Avoiding Persistent Non-Elided Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17
Reading the Value of an Elided Lock in RTM-based libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18
Intermixing HLE and RTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
USING THE PERFORMANCE MONITORING SUPPORT FOR INTEL TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20
Measuring Transactional Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20
Finding locks to elide and verifying all locks are elided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20
Sampling Transactional Aborts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21
Classifying Aborts using a Profiling Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21
XABORT Arguments for RTM fallback handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22
Call Graphs for Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22
Last Branch Records and Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22
Profiling and Testing Intel TSX Software using the Intel SDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23
HLE Specific Performance Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23
Computing Useful Metrics for Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24
PERFORMANCE GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-25
DEBUGGING GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-25
COMMON INTRINSICS FOR INTEL TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-26
RTM C intrinsics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-26
Emulated RTM intrinsics on older gcc compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-26
HLE intrinsics on gcc and other Linux compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27
Generating HLE intrinsics with gcc4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28
C++11 atomic support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28
Emulating HLE intrinsics with older gcc-compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28
HLE intrinsics on Windows C/C++ compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29

CHAPTER 13
POWER OPTIMIZATION FOR MOBILE USAGES

13.1
13.2
13.2.1
13.3
13.3.1
13.3.2
13.3.3
13.3.4
13.4
13.4.1
13.4.2
13.4.3
13.4.4
13.4.5
13.4.6
13.4.7
13.4.7.1
13.4.7.2
13.4.7.3
13.5
13.5.1
13.5.1.1
13.5.1.2
13.5.2
13.5.3
13.5.4
13.5.5
13.5.6
13.5.7
13.5.8
13.6
13.6.1
xii

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
MOBILE USAGE SCENARIOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
Intelligent Energy Efficient Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
ACPI C-STATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
Processor-Specific C4 and Deep C4 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
Processor-Specific Deep C-States and Intel® Turbo Boost Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
Processor-Specific Deep C-States for Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . 13-5
Intel® Turbo Boost Technology 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
GUIDELINES FOR EXTENDING BATTERY LIFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
Adjust Performance to Meet Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
Reducing Amount of Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Platform-Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Handling Sleep State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8
Using Enhanced Intel SpeedStep® Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8
Enabling Intel® Enhanced Deeper Sleep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9
Multicore Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
Enhanced Intel SpeedStep® Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
Thread Migration Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
Multicore Considerations for C-States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
TUNING SOFTWARE FOR INTELLIGENT POWER CONSUMPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
Reduction of Active Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
Multi-threading to reduce Active Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13
PAUSE and Sleep(0) Loop Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
Spin-Wait Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
Using Event Driven Service Instead of Polling in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
Reducing Interrupt Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
Reducing Privileged Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
Setting Context Awareness in the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16
Saving Energy by Optimizing for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17
PROCESSOR SPECIFIC POWER MANAGEMENT OPTIMIZATION FOR SYSTEM SOFTWARE . . . . . . . . . . . . . . . . . . . 13-17
Power Management Recommendation of Processor-Specific Inactive State Configurations . . . . . . . . . . . . . 13-17

CONTENTS
PAGE

13.6.1.1

Balancing Power Management and Responsiveness of Inactive To Active State Transitions. . . . . . . . . . 13-19

CHAPTER 14
INTEL® ATOM™ MICROARCHITECTURE AND SOFTWARE OPTIMIZATION
14.1
14.2
14.2.1
14.3
14.3.1
14.3.2
14.3.2.1
14.3.2.2
14.3.2.3
14.3.2.4
14.3.2.5
14.3.2.6
14.3.3
14.3.3.1
14.3.3.2
14.3.3.3
14.3.3.4
14.3.3.5
14.3.3.6
14.3.3.7
14.3.3.8
14.4

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1
INTEL® ATOM™ MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1
Hyper-Threading Technology Support in Intel® Atom™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
CODING RECOMMENDATIONS FOR INTEL® ATOM™ MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Optimization for Front End of Intel® Atom™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Optimizing the Execution Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
Integer Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
Address Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Integer Multiply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Integer Shift Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
Partial Register Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
FP/SIMD Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
Optimizing Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-9
Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-9
First-level Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Segment Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
String Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Parameter Passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Function Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Optimization of Multiply/Add Dependent Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Position Independent Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
INSTRUCTION LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14

CHAPTER 15
SILVERMONT MICROARCHITECTURE AND SOFTWARE OPTIMIZATION
15.1
15.1.1
15.2
15.2.1
15.2.2
15.3
15.3.1
15.3.1.1
15.3.1.2
15.3.1.3
15.3.2
15.3.2.1
15.3.2.2
15.3.2.3
15.3.2.4
15.3.2.5
15.3.2.6
15.3.2.7
15.3.3
15.3.3.1
15.3.3.2
15.3.3.3
15.3.3.4
15.3.3.5
15.3.3.6
15.4

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
Intel Atom Processor Family Based on the Silvermont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
SILVERMONT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
Integer Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
Floating-Point Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
CODING RECOMMENDATIONS FOR SILVERMONT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
Optimizing The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
Front End High IPC Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
Loop Unrolling and Loop Stream Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5
Optimizing The Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
Address Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
FP Multiply-Accumulate-Store Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
Integer Multiply Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
Zeroing Idioms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8
Flags usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8
Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9
Optimizing Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
Memory Reissue/Sleep causes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
PrefetchW Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
Cache Line Splits and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
Segment Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
Copy and String Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
INSTRUCTION LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12

CHAPTER 16
KNIGHTS LANDING MICROARCHITECTURE AND SOFTWARE OPTIMIZATION
16.1
16.1.1
16.1.2

KNIGHTS LANDING MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-2
Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3
Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3

xiii

CONTENTS
PAGE

16.1.3
UnTile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-5
16.2
INTEL® AVX-512 CODING RECOMMENDATIONS FOR KNIGHTS LANDING MICROARCHITECTURE . . . . . . . . . . . . . . 16-7
16.2.1
Using Gather and Scatter Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-7
16.2.2
Using Enhanced Reciprocal Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-8
16.2.3
Using AVX-512CD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-8
16.2.4
Using Intel® Hyper-Threading Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-9
16.2.5
Front End Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
16.2.5.1
Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
16.2.5.2
Branching Across 4GB Boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.6
Integer Execution Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.6.1
Flags usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.6.2
Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.7
Optimizing FP and Vector Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.7.1
Instruction Selection Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.2.7.2
Porting Intrinsic From Prior Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-13
16.2.7.3
Vectorization Trade-Off Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-14
16.2.8
Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-16
16.2.8.1
Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-16
16.2.8.2
Hardware Prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
16.2.8.3
Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
16.2.8.4
Memory Execution Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
16.2.8.5
Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-18
16.2.8.6
Way, Set Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-18
16.2.8.7
Streaming Store Versus Regular Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-19
16.2.8.8
Compiler Switches and Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-19
16.2.8.9
Direct Mapped MCDRAM Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-19

APPENDIX A
APPLICATION PERFORMANCE TOOLS
A.1
A.1.1
A.1.2
A.1.2.1
A.1.2.2
A.1.3
A.1.4
A.1.4.1
A.1.4.2
A.1.5
A.2
A.2.1
A.2.2
A.2.3
A.2.4
A.3
A.3.1
A.3.1.1
A.3.1.2
A.3.1.3
A.4
A.4.1
A.5
A.5.1
A.6
A.6.1
A.6.1.1
A.6.2
A.6.3
A.7

xiv

COMPILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recommended Optimization Settings for Intel® 64 and IA-32 Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vectorization and Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multithreading with OpenMP* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Automatic Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inline Expansion of Library Functions (/Oi, /Oi-) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interprocedural and Profile-Guided Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interprocedural Optimization (IPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Profile-Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Cilk Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PERFORMANCE LIBRARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Integrated Performance Primitives (Intel® IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Math Kernel Library (Intel® MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Threading Building Blocks (Intel® TBB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Benefits Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PERFORMANCE PROFILERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® VTune™ Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Event-Based Sampling Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Platform Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
THREAD AND MEMORY CHECKERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VECTORIZATION ASSISTANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CLUSTER TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® Trace Analyzer and Collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MPI Performance Snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® MPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Intel® MPI Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTEL® ACADEMIC COMMUNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A-2
A-2
A-2
A-3
A-3
A-3
A-3
A-3
A-3
A-4
A-4
A-4
A-5
A-5
A-5
A-5
A-5
A-6
A-6
A-6
A-6
A-6
A-7
A-7
A-7
A-7
A-7
A-7
A-8
A-8

CONTENTS
PAGE

APPENDIX B
USING PERFORMANCE MONITORING EVENTS
B.1
B.1.1
B.1.2
B.1.3
B.1.4
B.1.5
B.1.6
B.1.7
B.1.8
B.1.8.1
B.2
B.3
B.3.1
B.3.1.1
B.3.1.2
B.3.2
B.3.2.1
B.3.3
B.3.3.1
B.3.3.2
B.3.3.3
B.3.3.4
B.3.3.5
B.3.3.6
B.3.3.7
B.3.3.8
B.3.3.9
B.3.4
B.3.4.1
B.3.4.2
B.3.5
B.3.5.1
B.3.5.2
B.3.5.3
B.3.5.4
B.3.6
B.3.7
B.4
B.4.1
B.4.2
B.4.2.1
B.4.2.2
B.4.2.3
B.4.3
B.4.4
B.4.4.1
B.4.4.2
B.4.4.3
B.4.4.4
B.4.5
B.4.5.1
B.4.5.2
B.4.6
B.4.6.1
B.4.7
B.4.7.1
B.4.7.2
B.4.7.3
B.4.7.4
B.4.7.5
B.5

TOP-DOWN ANALYSIS METHOD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Top-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2
Front End Bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-3
Back End Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-4
Memory Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-4
Core Bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5
Bad Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5
Retiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6
TMAM and Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6
TMAM Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6
INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7
PERFORMANCE ANALYSIS TECHNIQUES FOR INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . . B-8
Cycle Accounting and Uop Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-9
Cycle Drill Down and Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
Basic Block Drill Down. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
Stall Cycle Decomposition and Core Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
Measuring Costs of Microarchitectural Conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
Core PMU Precise Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15
Precise Memory Access Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16
Load Latency Event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17
Precise Execution Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-19
Last Branch Record (LBR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-20
Measuring Core Memory Access Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-22
Measuring Per-Core Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-24
Miscellaneous L1 and L2 Events for Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
TLB Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
Front End Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
Branch Mispredictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
Front End Code Generation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
Uncore Performance Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-27
Global Queue Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-27
Global Queue Port Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-29
Global Queue Snoop Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-29
L3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-30
Intel QuickPath Interconnect Home Logic (QHL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-30
Measuring Bandwidth From the Uncore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-35
PERFORMANCE TUNING TECHNIQUES FOR INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE . . . . . . . B-36
Correlating Performance Bottleneck to Source Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-36
Hierarchical Top-Down Performance Characterization Methodology and Locating Performance Bottlenecks . B37
Back End Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-38
Core Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-38
Memory Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-38
Back End Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-39
Memory Sub-System Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-40
Accounting for Load Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-41
Cache-line Replacement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-42
Lock Contention Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-43
Other Memory Access Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-43
Execution Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-46
Longer Instruction Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-46
Assists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-46
Bad Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-47
Branch Mispredicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-47
Front End Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-47
Understanding the Micro-op Delivery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-47
Understanding the Sources of the Micro-op Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-49
The Decoded ICache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-50
Issues in the Legacy Decode Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-51
Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-51
USING PERFORMANCE EVENTS OF INTEL® CORE™ SOLO AND INTEL® CORE™ DUO PROCESSORS. . . . . . . . . . . . . . B-52
xv

CONTENTS
PAGE

B.5.1
Understanding the Results in a Performance Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.5.2
Ratio Interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.5.3
Notes on Selected Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.6
DRILL-DOWN TECHNIQUES FOR PERFORMANCE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.6.1
Cycle Composition at Issue Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-55
B.6.2
Cycle Composition of OOO Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-55
B.6.3
Drill-Down on Performance Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-56
B.7
EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-57
B.7.1
Clocks Per Instructions Retired Ratio (CPI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-57
B.7.2
Front End Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-58
B.7.2.1
Code Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-58
B.7.2.2
Branching and Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-58
B.7.2.3
Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-58
B.7.2.4
Macro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-58
B.7.2.5
Length Changing Prefix (LCP) Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-59
B.7.2.6
Self Modifying Code Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-59
B.7.3
Branch Prediction Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-59
B.7.3.1
Branch Mispredictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-59
B.7.3.2
Virtual Tables and Indirect Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-59
B.7.3.3
Mispredicted Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4
Execution Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.1
Resource Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.2
ROB Read Port Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.3
Partial Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.4
Partial Flag Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.5
Bypass Between Execution Domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.4.6
Floating-Point Performance Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-60
B.7.5
Memory Sub-System - Access Conflicts Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-61
B.7.5.1
Loads Blocked by the L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-61
B.7.5.2
4K Aliasing and Store Forwarding Block Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-61
B.7.5.3
Load Block by Preceding Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-61
B.7.5.4
Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.5.5
Load Operation Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.6
Memory Sub-System - Cache Misses Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.6.1
Locating Cache Misses in the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.6.2
L1 Data Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.6.3
L2 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-62
B.7.7
Memory Sub-system - Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-63
B.7.7.1
L1 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-63
B.7.7.2
L2 Hardware Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-63
B.7.7.3
Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-63
B.7.8
Memory Sub-system - TLB Miss Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-63
B.7.9
Memory Sub-system - Core Interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.9.1
Modified Data Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.9.2
Fast Synchronization Penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.9.3
Simultaneous Extensive Stores and Load Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.10
Memory Sub-system - Bus Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.10.1
Bus Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-64
B.7.10.2
Modified Cache Lines Eviction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-65

APPENDIX C
INSTRUCTION LATENCY AND THROUGHPUT
C.1
C.2
C.3
C.3.1
C.3.2
C.3.3
C.3.3.1

xvi

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
DEFINITIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
LATENCY AND THROUGHPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Latency and Throughput with Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-3
Table Footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-19
Instructions with Memory Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-20
Software Observable Latency of Memory References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-20

CONTENTS
PAGE

EXAMPLES
Example 3-1.
Example 3-2.
Example 3-3.
Example 3-4.
Example 3-5.
Example 3-6.
Example 3-7.
Example 3-8.
Example 3-9.
Example 3-10.
Example 3-11.
Example 3-12.
Example 3-13.
Example 3-14.
Example 3-15.
Example 3-16.
Example 3-17.
Example 3-18.
Example 3-19.
Example 3-20.
Example 3-21.
Example 3-22.
Example 3-23.
Example 3-24.
Example 3-25.
Example 3-26.
Example 3-27.
Example 3-28.
Example 3-29.
Example 3-30.
Example 3-31.
Example 3-32.
Example 3-33.
Example 3-34.
Example 3-35.
Example 3-36.
Example 3-37.
Example 3-38.
Example 3-39.
Example 3-40.
Example 3-41.
Example 3-42.
Example 3-43.
Example 3-44.
Example 3-45.
Example 3-46.
Example 3-47.
Example 3-48.
Example 3-49.
Example 3-50.
Example 3-51.
Example 3-52.
Example 3-53.
Example 3-54.
Example 3-55.
Example 3-56.

Assembly Code with an Unpredictable Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Code Optimization to Eliminate Branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Eliminating Branch with CMOV Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Use of PAUSE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Static Branch Prediction Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Static Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Static Not-Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Indirect Branch With Two Favored Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
A Peeling Technique to Reduce Indirect Branch Misprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Macro-fusion, Unsigned Iteration Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Macro-fusion, If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Macro-fusion, Signed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Macro-fusion, Signed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Additional Macro-fusion Benefit in Intel Microarchitecture Code Name Sandy Bridge. . . . . . . . . . . . . . 3-16
Avoiding False LCP Delays with 0xF7 Group Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
Unrolling Loops in LSD to Optimize Emission Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
Independent Two-Operand LEA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Alternative to Three-Operand LEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Examples of 512-bit Additions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Clearing Register to Break Dependency While Negating Array Elements . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Spill Scheduling Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Zero-Latency MOV Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Byte-Granular Data Computation Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions . . . . . . . . . . . . . . 3-30
Avoiding Partial Register Stalls in Integer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Avoiding Partial Register Stalls in SIMD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
Avoiding Partial Flag Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
Partial Flag Register Accesses in Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . 3-34
Reference Code Template for Partially Vectorizable Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty . . . . . . . . . . . . . . . . . . . . . . 3-37
Using Four Registers to Reduce Memory Spills and Simplify Result Passing . . . . . . . . . . . . . . . . . . . . . . 3-38
Stack Optimization Technique to Simplify Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
Base Line Code Sequence to Estimate Loop Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
Optimize for Load Port Bandwidth in Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . 3-41
Index versus Pointers in Pointer-Chasing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
Example of Bank Conflicts in L1D Cache and Remedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
Using XMM Register in Lieu of Memory for Register Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
Loads Blocked by Stores of Unknown Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Code That Causes Cache Line Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
Situations Showing Small Loads After Large Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Non-forwarding Example of Large Load After Small Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
A Non-forwarding Situation in Compiler Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Two Ways to Avoid Non-forwarding Situation in Example 3-43. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50
Large and Small Load Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50
Loop-carried Dependence Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Rearranging a Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Decomposing an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-53
Examples of Dynamical Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Aliasing Between Loads and Stores Across Loop Iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
Instruction Pointer Query Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Using Non-temporal Stores and 64-byte Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
On-temporal Stores and Partial Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
Using DCU Hardware Prefetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61
Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-62
Technique For Using L1 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63

xvii

CONTENTS
PAGE

Example 3-57.
Example 3-58.
Example 4-1.
Example 4-2.
Example 4-3.
Example 4-4.
Example 4-5.
Example 4-6.
Example 4-7.
Example 4-8.
Example 4-9.
Example 4-10.
Example 4-11.
Example 4-12.
Example 4-13.
Example 4-14.
Example 4-15.
Example 4-16.
Example 4-17.
Example 4-18.
Example 4-19.
Example 4-20.
Example 4-21.
Example 4-22.
Example 4-23.
Example 4-24.
Example 4-25.
Example 4-26.
Example 5-1.
Example 5-2.
Example 5-3.
Example 5-4.
Example 5-5.
Example 5-6.
Example 5-7.
Example 5-8.
Example 5-9.
Example 5-10.
Example 5-11.
Example 5-12.
Example 5-13.
Example 5-14.
Example 5-15.
Example 5-16.
Example 5-17.
Example 5-18.
Example 5-19.
Example 5-20.
Example 5-21.
Example 5-22.
Example 5-23.
Example 5-24.
Example 5-25.
Example 5-26.
Example 5-27.
Example 5-28.
Example 5-29.
Example 5-30.
Example 5-31.
xviii

REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
Algorithm to Avoid Changing Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-72
Identification of MMX Technology with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Identification of SSE with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Identification of SSE2 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Identification of SSE3 with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Identification of SSSE3 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Identification of SSE4.1 with cpuid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Identification of SSE4.2 with cpuid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Detection of AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Detection of PCLMULQDQ Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Detection of AVX Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Detection of VEX-Encoded AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Detection of VEX-Encoded AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Simple Four-Iteration Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Streaming SIMD Extensions Using Inlined Assembly Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Simple Four-Iteration Loop Coded with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
C++ Code Using the Vector Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Automatic Vectorization for a Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
C Algorithm for 64-bit Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
AoS Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
SoA Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
AoS and SoA Code Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
Hybrid SoA Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Pseudo-code Before Strip Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
Strip Mined Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
Emulation of Conditional Moves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
Resetting Register Between __m64 and FP Data Types Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
FIR Processing Example in C language Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
SSE2 and SSSE3 Implementation of FIR Processing Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack Instructions Code . . . . . . . . . . . . . . . . . . 5-5
Signed Unpack Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Interleaved Pack with Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Interleaved Pack without Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Unpacking Two Packed-word Sources in Non-interleaved Way Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
PEXTRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Repeated PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Non-Unit Stride Load/Store Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Scatter and Gather Operations Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
PMOVMSKB Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Broadcast a Word Across XMM, Using 2 SSE2 Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Swap/Reverse words in an XMM, Using 3 SSE2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Generating Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Absolute Difference of Two Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Absolute Difference of Signed Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Computing Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Basic C Implementation of RGBA to BGRA Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
Color Pixel Format Conversion Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
Color Pixel Format Conversion Using SSSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
Big-Endian to Little-Endian Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Clipping to a Signed Range of Words [High, Low]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
Clipping to an Arbitrary Signed Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
Simplified Clipping to an Arbitrary Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
Using PTEST to Separate Vectorizable and non-Vectorizable Loop Iterations. . . . . . . . . . . . . . . . . . . . . 5-24
Using PTEST and Variable BLEND to Vectorize Heterogeneous Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24

CONTENTS
PAGE

Example 5-32.
Example 5-33.
Example 5-34.
Example 5-35.
Example 5-36.
Example 5-37.
Example 5-38.
Example 5-39.
Example 5-40.
Example 5-41.
Example 5-42.
Example 5-43.
Example 5-44.
Example 5-45.
Example 5-46.
Example 5-47.
Example 5-48.
Example 5-49.
Example 5-50.
Example 6-1.
Example 6-2.
Example 6-3.
Example 6-4.
Example 6-5.
Example 6-6.
Example 6-7.
Example 6-8.
Example 6-9.
Example 6-10.
Example 6-11.
Example 6-12.
Example 6-13.
Example 6-14.
Example 6-15.
Example 6-16.
Example 6-17.
Example 6-18.
Example 6-19.
Example 6-20.
Example 6-21.
Example 6-22.
Example 6-23.
Example 6-24.
Example 7-1.
Example 7-2.
Example 7-3.
Example 7-4.
Example 7-5.
Example 7-6.
Example 7-7.
Example 7-8.
Example 7-9.
Example 7-10.
Example 7-11.
Example 7-12.
Example 8-1.
Example 8-2.
Example 8-3.
Example 8-4.

Baseline C Code for Mandelbrot Set Map Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26
A Large Load after a Series of Small Stores (Penalty) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Accessing Data Without Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
A Series of Small Loads After a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Eliminating Delay for a Series of Small Loads after a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
An Example of Video Processing with Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
Video Processing Using LDDQU to Avoid Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Un-optimized Reverse Memory Copy in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
Using PSHUFB to Reverse Byte Ordering 16 Bytes at a Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-33
PMOVSX/PMOVZX Work-around to Avoid False Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35
Table Look-up Operations in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35
Shift Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
PEXTRD Techniques on Non-Vectorizable Table Look-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
Pseudo-Code Flow of AES Counter Mode Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-39
AES128-CTR Implementation with Eight Block in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-39
AES128 Key Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46
Compress 32-bit Integers into 5-bit Buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-49
Decompression of a Stream of 5-bit Integers into 32-bit Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-51
Pseudocode for Horizontal (xyz, AoS) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Swizzling Data Using SHUFPS, MOVLHPS, MOVHLPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Swizzling Data Using UNPCKxxx Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Deswizzling Single-Precision SIMD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Deswizzling Data Using SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Multiplication of Two Pair of Single-precision Complex Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Division of Two Pair of Single-precision Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Double-Precision Complex Multiplication of Two Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Double-Precision Complex Multiplication Using Scalar SSE2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Dot Product of Vector Length 4 Using SSE/SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Dot Product of Vector Length 4 Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Dot Product of Vector Length 4 Using SSE4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Unrolled Implementation of Four Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Normalization of an Array of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
Normalize (x, y, z) Components of an Array of Vectors Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
Normalize (x, y, z) Components of an Array of Vectors Using SSE4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
Data Organization in Memory for AOS Vector-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
AOS Vector-Matrix Multiplication with HADDPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
AOS Vector-Matrix Multiplication with DPPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
Data Organization in Memory for SOA Vector-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
Vector-Matrix Multiplication with Native SOA Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22
Pseudo-code Using CLFLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Flushing Cache Lines Using CLFLUSH or CLFLUSHOPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Populating an Array for Circular Pointer Chasing with Constant Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Using Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
Concatenation and Unrolling the Last Iteration of Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
Data Access of a 3D Geometry Engine without Strip-mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
Data Access of a 3D Geometry Engine with Strip-mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Using HW Prefetch to Improve Read-Once Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
Basic Algorithm of a Simple Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
A Memory Copy Routine Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Memory Copy Using Hardware Prefetch and Bus Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
Serial Execution of Producer and Consumer Work Items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Basic Structure of Implementing Producer Consumer Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Thread Function for an Interlaced Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Spin-wait Loop and PAUSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
xix

CONTENTS
PAGE

Example 8-5. Coding Pitfall using Spin Wait Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Example 8-6. Placement of Synchronization and Regular Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
Example 8-8. Batched Implementation of the Producer Consumer Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Example 8-9. Parallel Memory Initialization Technique Using OpenMP and NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
Example 9-1. Compute 64-bit Quotient and Remainder with 64-bit Divisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Example 9-2. Quotient and Remainder of 128-bit Dividend with 64-bit Divisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Example 10-1. A Hash Function Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Example 10-2. Hash Function Using CRC32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Example 10-3. Strlen() Using General-Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
Example 10-4. Sub-optimal PCMPISTRI Implementation of EOS handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8
Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8
Example 10-6. WordCnt() Using C and Byte-Scanning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9
Example 10-7. WordCnt() Using PCMPISTRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
Example 10-8. KMP Substring Search in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Example 10-10.Substring Search Using PCMPISTRI and KMP Overlap Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15
Example 10-11.I Equivalent Strtok_s() Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19
Example 10-12.I Equivalent Strupr() Using PCMPISTRM Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21
Example 10-13.UTF16 VerStrlen() Using C and Table Lookup Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22
Example 10-14.Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23
Example 10-15.Intrinsic Listings of UTF16 VerStrlen() Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-25
Example 10-16.Replacement String Library Strcmp Using SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27
Example 10-17.High-level flow of Character Subset Validation for String Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29
Example 10-18.Intrinsic Listings of atol() Replacement Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29
Example 10-19.Auxiliary Routines and Data Constants Used in sse4i_atol() listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-31
Example 10-20.Conversion of 64-bit Integer to ASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-34
Example 10-21.Conversion of 64-bit Integer to ASCII without Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-35
Example 10-22.Conversion of 64-bit Integer to ASCII Using SSE4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-37
Example 10-23.Conversion of 64-bit Integer to Wide Character String Using SSE4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-43
Example 10-24. MULX and Carry Chain in Large Integer Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48
Example 10-25. Building-block Macro Used in Binary Decimal Floating-point Operations . . . . . . . . . . . . . . . . . . . . . . . . . 10-49
Example 11-1. Cartesian Coordinate Transformation with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Example 11-2. Cartesian Coordinate Transformation with Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
Example 11-3. Direct Polynomial Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
Example 11-4. Function Calls and AVX/SSE transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Example 11-5. AoS to SoA Conversion of Complex Numbers in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Example 11-6. Aos to SoA Conversion of Complex Numbers Using AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13
Example 11-7. Register Overlap Method for Median of 3 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Example 11-8. Data Gather - AVX versus Scalar Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
Example 11-9. Scatter Operation Using AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Example 11-10.SAXPY using Intel AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
Example 11-11.Using 16-Byte Memory Operations for Unaligned 32-Byte Memory Operation. . . . . . . . . . . . . . . . . . . 11-21
Example 11-12.SAXPY Implementations for Unaligned Data Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21
Example 11-13.Loop with Conditional Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
Example 11-14.Handling Loop Conditional with VMASKMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
Example 11-15.Three-Tap Filter in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25
Example 11-16.Three-Tap Filter with 128-bit Mixed Integer and FP SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
Example 11-17.256-bit AVX Three-Tap Filter Code with VSHUFPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
Example 11-18.Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code. . . . . . . . . . . . . . . . . . . . . . . . . . 11-27
Example 11-19.8x8 Matrix Transpose - Replace Shuffles with Blends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-29
Example 11-20.8x8 Matrix Transpose Using VINSRTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31
Example 11-21.Port 5 versus Load Port Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-33
Example 11-22.Divide Using DIVPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
Example 11-23.Divide Using RCPPS 11-bit Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
Example 11-24.Divide Using RCPPS and Newton-Raphson Iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
Example 11-25.Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38
Example 11-26.Reciprocal Square Root Using RCPPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38
Example 11-27.Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38
xx

CONTENTS
PAGE

Example 11-28.Square Root Using SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-39
Example 11-29. Square Root Using RCPPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40
Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40
Example 11-31. Array Sub Sums Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42
Example 11-32. Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43
Example 11-33. Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-44
Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision . . . . . . . . . . . . . . . . . . 11-45
Example 11-35. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
Example 11-36. Unrolling to Hide Dependent FP Add Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
Example 11-37. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-49
Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-50
Example 11-39. Separable KLT Intra-block Transformation Using AVX2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-52
Example 11-40. Macros for Parallel Moduli/Remainder Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
Example 11-41. Signed 64-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58
Example 11-42. Unsigned 63-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60
Example 11-43. Access Patterns Favoring Non-VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-64
Example 11-44. Access Patterns Likely to Favor VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-65
Example 11-45. Software AVX Sequence Equivalent to Full-Mask VPGATHERD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-66
Example 11-46.AOS to SOA Transformation Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-67
Example 11-47. Non-Strided AOS to SOA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-68
Example 11-48. Conversion to Throughput-Reduced MMX sequence to AVX2 Alternative . . . . . . . . . . . . . . . . . . . . . . 11-70
Example 12-1. Reduce Data Conflict with Conditional Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6
Example 12-2. Transition from Non-Elided Execution without Aborting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Example 12-3. Exemplary Wrapper Using RTM for Lock/Unlock Primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
Example 12-4. Spin Lock Example Using HLE in GCC 4.8 and Later . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14
Example 12-5. Spin Lock Example Using HLE in Intel and Microsoft Compiler Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14
Example 12-6. A Meta Lock Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16
Example 12-7. A Meta Lock Example Using RTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17
Example 12-8. HLE-enabled Lock-Acquire/ Lock-Release Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17
Example 12-9. A Spin Wait Example Using HLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18
Example 12-10. A Conceptual Example of Intermixed HLE and RTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
Example 12-11. Emulated RTM intrinsic for Older GCC compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27
Example 12-12. C++ Example of HLE Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28
Example 12-13. Emulated HLE Intrinsic with Older GCC compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28
Example 12-14. HLE Intrinsic Supported by Intel and Microsoft Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29
Example 13-1. Unoptimized Sleep Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
Example 13-2. Power Consumption Friendly Sleep Loop Using PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
Example 14-1. Instruction Pairing and Alignment to Optimize Decode Throughput on Intel® Atom™ Microarchitecture14-4
Example 14-2. Alternative to Prevent AGU and Execution Unit Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Example 14-3. Pipeling Instruction Execution in Integer Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
Example 14-4. Memory Copy of 64-byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Example 14-5. Examples of Dependent Multiply and Add Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12
Example 14-6. Instruction Pointer Query Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
Example 15-1. Unrolled Loop Executes In-Order Due to Multiply-Store Port Conflict. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
Example 15-2. Grouping Store Instructions Eliminates Bubbles and Improves IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
Example 16-1. Gather Comparison Between AVX-512F and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-7
Example 16-2. Gather Comparison Between AVX-512F and KNC Equivalent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-8
Example 16-3. Using VRCP28SS for 32-bit Floating-Point Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-8
Example 16-4. Vectorized Histogram Update Using AVX-512CD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-9
Example 16-5. Replace VCOMIS* with VCMPSS/KORTEST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-12
Example 16-6. Using Software Sequence for Horizontal Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-13
Example 16-7. Optimized Inner Loop of DGEMM for Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 16-13
Example 16-8. Ordering of Memory Instruction for MEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-18

xxi

CONTENTS
PAGE

FIGURES
Figure 2-1.
Figure 2-2.
Figure 2-3.
Figure 2-4.
Figure 2-5.
Figure 2-6.
Figure 2-7.
Figure 2-8.
Figure 2-9.
Figure 2-10.
Figure 2-11.
Figure 2-12.
Figure 2-13.
Figure 2-14.
Figure 2-15.
Figure 2-16.
Figure 3-1.
Figure 3-2.
Figure 3-3.
Figure 3-4.
Figure 4-1.
Figure 4-2.
Figure 4-3.
Figure 4-4.
Figure 4-5.
Figure 5-1.
Figure 5-2.
Figure 5-3.
Figure 5-4.
Figure 5-5.
Figure 5-6.
Figure 5-7.
Figure 5-8.
Figure 5-9.
Figure 6-1.
Figure 6-2.
Figure 6-3.
Figure 6-4.
Figure 6-5.
Figure 6-6.
Figure 7-1.
Figure 7-2.
Figure 7-3.
Figure 7-4.
Figure 7-5.
Figure 7-6.
Figure 7-7.
Figure 7-8.
Figure 7-9.
Figure 7-10.
Figure 8-1.
Figure 8-2.
Figure 8-3.
Figure 8-4.
Figure 8-5.
Figure 10-1.
xxii

CPU Core Pipeline Functionality of the Skylake Microarchitecture. . . . . . . . . . . . . . . .2-2
CPU Core Pipeline Functionality of the Haswell Microarchitecture . . . . . . . . . . . . . . .2-6
Four Core System Integration of the Haswell Microarchitecture. . . . . . . . . . . . . . . . .2-7
An Example of the Haswell-E Microarchitecture Supporting 12 Processor Cores2-11
Intel Microarchitecture Code Name Sandy Bridge Pipeline Functionality . . . . . . . . 2-14
Intel Core Microarchitecture Pipeline Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Execution Core of Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38
Store-Forwarding Enhancements in Enhanced Intel Core Microarchitecture . . . . 2-41
Intel Advanced Smart Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42
Intel Microarchitecture Code Name Nehalem Pipeline Functionality . . . . . . . . . . . . 2-45
Front End of Intel Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . . . 2-46
Store-Forwarding Scenarios of 16-Byte Store Operations. . . . . . . . . . . . . . . . . . . . . 2-51
Store-Forwarding Enhancement in Intel Microarchitecture Code Name Nehalem 2-52
Hyper-Threading Technology on an SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-54
Typical SIMD Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-57
SIMD Instruction Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-58
Generic Program Flow of Partially Vectorized Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
Cache Line Split in Accessing Elements in a Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
Size and Alignment Restrictions in Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . 3-48
Memcpy Performance Comparison for Lengths up to 2KB . . . . . . . . . . . . . . . . . . . . . 3-66
General Procedural Flow of Application Detection of AVX . . . . . . . . . . . . . . . . . . . . . . .4-6
General Procedural Flow of Application Detection of Float-16 . . . . . . . . . . . . . . . . . . .4-8
Converting to Streaming SIMD Extensions Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Hand-Coded Assembly and High-Level Compiler Performance Trade-offs . . . . . . 4-13
Loop Blocking Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
PACKSSDW mm, mm/mm64 Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
Interleaved Pack with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
Result of Non-Interleaved Unpack Low in MM0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
Result of Non-Interleaved Unpack High in MM1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
PEXTRW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
PINSRW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
PMOVSMKB Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Data Alignment of Loads and Stores in Reverse Memory Copy . . . . . . . . . . . . . . . . 5-32
A Technique to Avoid Cacheline Split Loads in Reverse Memory Copy Using Two Aligned Loads5-33
Homogeneous Operation on Parallel Data Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Horizontal Computation Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Dot Product Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4
Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-9
Asymmetric Arithmetic Operation of the SSE3 Instruction . . . . . . . . . . . . . . . . . . . . 6-11
Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD . . . . . . . . . . . . . 6-11
CLFLUSHOPT versus CLFLUSH In SkyLake Microarchitecture . . . . . . . . . . . . . . . . . . 7-11
Effective Latency Reduction as a Function of Access Stride . . . . . . . . . . . . . . . . . . . 7-14
Memory Access Latency and Execution Without Prefetch . . . . . . . . . . . . . . . . . . . . . 7-14
Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Prefetch and Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . 7-18
Spread Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Cache Blocking – Temporally Adjacent and Non-adjacent Passes . . . . . . . . . . . . . . . 7-20
Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes Loops7-21
Single-Pass Vs. Multi-Pass 3D Geometry Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
Amdahl’s Law and MP Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2
Single-threaded Execution of Producer-consumer Threading Model. . . . . . . . . . . . . .8-5
Execution of Producer-consumer Threading Model
on a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-5
Interlaced Variation of the Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . .8-6
Batched Approach of Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
SSE4.2 String/Text Instruction Immediate Operand Control . . . . . . . . . . . . . . . . . . . 10-2

CONTENTS
PAGE

Figure 10-2.
Figure 10-3.
Figure 10-4.
Figure 11-1.
Figure 11-2.
Figure 11-3.
Figure 11-4.
Figure 11-5.
Figure 13-1.
Figure 13-2.
Figure 13-3.
Figure 13-4.
Figure 13-5.
Figure 13-6.
Figure 13-7.
Figure 13-8.
Figure 13-9.
Figure 13-10.
Figure 14-1.
Figure 15-1.
Figure 16-1.
Figure 16-2.
Figure B-1.
Figure B-2.
Figure B-3.
Figure B-4.
Figure B-5.
Figure B-6.
Figure B-7.
Figure B-8.
Figure B-9.
Figure B-11.
Figure B-10.
Figure B-12.
Figure B-13.
Figure B-15.
Figure B-14.
Figure B-16.

Retrace Inefficiency of Byte-Granular, Brute-Force Search . . . . . . . . . . . . . . . . . . . 10-12
SSE4.2 Speedup of SubString Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18
Compute Four Remainders of Unsigned Short Integer in Parallel. . . . . . . . . . . . . . 10-37
AVX-SSE Transitions in the Broadwell, and Prior Generation Microarchitectures 11-8
AVX-SSE Transitions in the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 11-8
4x4 Image Block Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-50
Throughput Comparison of Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-65
Comparison of HW GATHER Versus Software Sequence in Skylake Microarchitecture11-66
Performance History and State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Active Time Versus Halted Time of a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
Application of C-states to Idle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
Profiles of Coarse Task Scheduling and Power Consumption . . . . . . . . . . . . . . . . . . 13-9
Thread Migration in a Multicore Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
Progression to Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
Energy Saving due to Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13
Energy Saving due to Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13
Energy Saving Comparison of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . 13-16
Power Saving Comparison of Power-Source-Aware Frame Rate Configurations13-17
Intel Atom Microarchitecture Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Silvermont Microarchitecture Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
Tile-Mesh Topology of the Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . 16-1
Processor Core Pipeline Functionality of the Knights Landing Microarchitecture 16-2
General TMAM Hierarchy for Out-of-Order Microarchitectures. . . . . . . . . . . . . . . . . . .B-2
TMAM’s Top Level Drill Down Flowchart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-3
TMAM Hierarchy Supported by Skylake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . .B-7
System Topology Supported by Intel® Xeon® Processor 5500 Series . . . . . . . . . . . . .B-8
PMU Specific Event Logic Within the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
LBR Records and Basic Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-21
Using LBR Records to Rectify Skewed Sample Distribution . . . . . . . . . . . . . . . . . . . B-21
RdData Request after LLC Miss to Local Home (Clean Rsp). . . . . . . . . . . . . . . . . . . . B-32
RdData Request after LLC Miss to Remote Home (Clean Rsp) . . . . . . . . . . . . . . . . . B-32
RdData Request after LLC Miss to Local Home (Hitm Response) . . . . . . . . . . . . . . B-33
RdData Request after LLC Miss to Remote Home (Hitm Response) . . . . . . . . . . . . B-33
RdData Request after LLC Miss to Local Home (Hit Response) . . . . . . . . . . . . . . . . B-34
RdInvOwn Request after LLC Miss to Remote Home (Clean Res) . . . . . . . . . . . . . . B-34
RdInvOwn Request after LLC Miss to Local Home (Hit Res) . . . . . . . . . . . . . . . . . . . B-35
RdInvOwn Request after LLC Miss to Remote Home (Hitm Res) . . . . . . . . . . . . . . . B-35
Performance Events Drill-Down and Software Tuning Feedback Loop . . . . . . . . . B-54

xxiii

CONTENTS
PAGE

TABLES
Table 2-1.
Table 2-2.
Table 2-3.
Table 2-4.
Table 2-5.
Table 2-6.
Table 2-7.
Table 2-8.
Table 2-9.
Table 2-10.
Table 2-11.
Table 2-12.
Table 2-13.
Table 2-14.
Table 2-15.
Table 2-16.
Table 2-17.
Table 2-18.
Table 2-19.
Table 2-20.
Table 2-21.
Table 2-22.
Table 2-23.
Table 2-24.
Table 2-25.
Table 2-26.
Table 2-27.
Table 2-28.
Table 2-29.
Table 2-30.
Table 2-31.
Table 2-32.
Table 3-1.
Table 3-2.
Table 3-3.
Table 3-4.
Table 3-5.
Table 5-1.
Table 6-1.
Table 7-1.
Table 7-2.
Table 7-3.
Table 8-1.
Table 8-2.
Table 8-3.
Table 10-1.
Table 10-2.
Table 10-3.
Table 10-4.
Table 10-5.
Table 11-1.
Table 11-2.
Table 11-3.
Table 11-4.
Table 11-5.
Table 11-6.
Table 11-7.
xxiv

Dispatch Port and Execution Stacks of the Skylake Microarchitecture. . . . . . . . . . . .2-3
Skylake Microarchitecture Execution Units and Representative Instructions . . . . .2-4
Bypass Delay Between Producer and Consumer Micro-ops . . . . . . . . . . . . . . . . . . . . . .2-4
Cache Parameters of the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-5
TLB Parameters of the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6
Dispatch Port and Execution Stacks of the Haswell Microarchitecture . . . . . . . . . . .2-8
Haswell Microarchitecture Execution Units and Representative Instructions . . . . .2-9
Bypass Delay Between Producer and Consumer Micro-ops (cycles) . . . . . . . . . . . . . .2-9
Cache Parameters of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
TLB Parameters of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
TLB Parameters of the Broadwell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Components of the Front End of Intel Microarchitecture Code Name Sandy Bridge2-15
ICache and ITLB of Intel Microarchitecture Code Name Sandy Bridge. . . . . . . . . . . 2-15
Dispatch Port and Execution Stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Execution Core Writeback Latency (cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Lookup Order and Load Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
L1 Data Cache Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Effect of Addressing Modes on Load Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
DTLB and STLB Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Store Forwarding Conditions (1 and 2 byte stores) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
Store Forwarding Conditions (4-16 byte stores) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
32-byte Store Forwarding Conditions (0-15 byte alignment) . . . . . . . . . . . . . . . . . . 2-26
32-byte Store Forwarding Conditions (16-31 byte alignment) . . . . . . . . . . . . . . . . . 2-27
Components of the Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Issue Ports of Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture2-37
Cache Parameters of Processors based on Intel Core Microarchitecture . . . . . . . . 2-43
Characteristics of Load and Store Operations in Intel Core Microarchitecture . . . 2-43
Bypass Delay Between Producer and Consumer Micro-ops (cycles) . . . . . . . . . . . . 2-47
Issue Ports of Intel Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . . 2-48
Cache Parameters of Intel Core i7 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-49
Performance Impact of Address Alignments of MOVDQU from L1 . . . . . . . . . . . . . 2-50
Macro-Fusible Instructions in Intel Microarchitecture Code Name Sandy Bridge . 3-13
Small Loop Criteria Detected by Sandy Bridge and Haswell Microarchitectures . 3-18
Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture3-50
Relative Performance of Memcpy() Using ERMSB Vs. 128-bit AVX . . . . . . . . . . . . . 3-67
Effect of Address Misalignment on Memcpy() Performance. . . . . . . . . . . . . . . . . . . . 3-67
PSHUF Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
SoA Form of Representing Vertices Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4
Software Prefetching Considerations into Strip-mining Code . . . . . . . . . . . . . . . . . . 7-23
Relative Performance of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30
Deterministic Cache Parameters Leaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-31
Properties of Synchronization Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Design-Time Resource Management Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Microarchitectural Resources Comparisons of HT Implementations . . . . . . . . . . . . 8-26
SSE4.2 String/Text Instructions Compare Operation on N-elements . . . . . . . . . . . 10-2
SSE4.2 String/Text Instructions Unary Transformation on IntRes1 . . . . . . . . . . . . 10-3
SSE4.2 String/Text Instructions Output Selection Imm[6] . . . . . . . . . . . . . . . . . . . . . 10-3
SSE4.2 String/Text Instructions Element-Pair Comparison Definition. . . . . . . . . . . 10-3
SSE4.2 String/Text Instructions Eflags Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
Features between 256-bit AVX, 128-bit AVX and Legacy SSE Extensions . . . . . 11-2
State Transitions of Mixing AVX and SSE Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
Approximate Magnitude of AVX-SSE Transition Penalties in Different Microarchitectures11-9
Effect of VZEROUPPER with Inter-Function Calls Between AVX and SSE Code 11-10
Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture11-34
Single-Precision Divide and Square Root Alternatives . . . . . . . . . . . . . . . . . . . . . . . . 11-35
Comparison of Single-Precision Divide Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . 11-37

CONTENTS
PAGE

Table 11-8.
Table 11-9.
Table 11-10.
Table 11-11.
Table 12-1.
Table 13-1.
Table 13-2.
Table 13-3.
Table 13-4.
Table 13-5.
Table 13-6.
Table 14-1.
Table 14-2.
Table 15-1.
Table 15-2.
Table 15-3.
Table 15-4.
Table 15-5.
Table 15-6.
Table 16-1.
Table 16-2.
Table 16-3.
Table 16-4.
Table 16-5.
Table A-1.
Table B-1.
Table B-2.
Table B-3.
Table B-4.
Table B-5.
Table B-6.
Table B-7.
Table B-8.
Table B-9.
Table B-10.
Table B-11.
Table B-12.
Table B-13.
Table B-14.
Table B-15.
Table C-1.
Table C-2.
Table C-3.
Table C-4.
Table C-5.
Table C-6.
Table C-7.
Table C-8.
Table C-9.
Table C-10.
Table C-11.
Table C-12.
Table C-13.
Table C-14.
Table C-15.
Table C-16.

Comparison of Single-Precision Reciprocal Square Root Operation . . . . . . . . . . . . 11-39
Comparison of Single-Precision Square Root Operation . . . . . . . . . . . . . . . . . . . . . . 11-41
Comparison of AOS to SOA with Strided Access Pattern . . . . . . . . . . . . . . . . . . . . . 11-68
Comparison of Indexed AOS to SOA Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 11-69
RTM Abort Status Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22
ACPI C-State Type Mappings to Processor Specific C-State for Mobile Processors Based on Intel
Microarchitecture Code Name Nehalem13-5
ACPI C-State Type Mappings to Processor Specific C-State of Intel Microarchitecture Code Name Sandy
Bridge13-5
C-State Total Processor Exit Latency for Client Systems (Core+ Package Exit Latency) with Slow VR13-18
C-State Total Processor Exit Latency for Client Systems (Core+ Package Exit Latency) with Fast VR13-18
C-State Core-Only Exit Latency for Client Systems with Slow VR. . . . . . . . . . . . . 13-18
POWER_CTL MSR in Next Generation Intel Processor (Intel® Microarchitecture Code Name Sandy Bridge)1319
Instruction Latency/Throughput Summary of Intel® Atom™ Microarchitecture . . 14-7
Intel® Atom™ Microarchitecture Instructions Latency Data . . . . . . . . . . . . . . . . . . . . 14-14
Function Unit Mapping of the Silvermont Microarchitecture . . . . . . . . . . . . . . . . . . . 15-3
Integer Multiply Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8
Floating-Point and SIMD Integer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9
Unsigned Integer Division Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10
Signed Integer Division Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10
Silvermont Microarchitecture Instructions Latency and Throughput . . . . . . . . . . 15-13
Integer Pipeline Characteristics of the Knights Landing Microarchitecture . . . . . . 16-4
Vector Pipeline Characteristics of the Knights Landing Microarchitecture . . . . . . 16-4
Characteristics of Caching Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-5
Alternatives to MSROM Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
Cycle Cost Building Blocks for Vectorization Estimate for Knights Landing Microarchitecture16-14
Recommended Processor Optimization Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-2
Cycle Accounting and Micro-ops Flow Recipe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-9
CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow . . . . . . . . . . . . . . B-10
Cycle Accounting of Wasted Work Due to Misprediction . . . . . . . . . . . . . . . . . . . . . . B-11
Cycle Accounting of Instruction Starvation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow . . . . . . . . . . . . . . B-13
Approximate Latency of L2 Misses of Intel Xeon Processor 5500. . . . . . . . . . . . . B-15
Load Latency Event Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-18
Data Source Encoding for Load Latency PEBS Record . . . . . . . . . . . . . . . . . . . . . . . . . B-18
Core PMU Events to Drill Down L2 Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-22
Core PMU Events for Super Queue Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23
Core PMU Event to Drill Down OFFCore Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23
OFFCORE_RSP_0 MSR Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23
Common Request and Response Types for OFFCORE_RSP_0 MSR . . . . . . . . . . . . . B-24
Uncore PMU Events for Occupancy Cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-29
Common QHL Opcode Matching Facility Programming . . . . . . . . . . . . . . . . . . . . . . . . . B-31
CPUID Signature Values of Of Recent Intel Microarchitectures. . . . . . . . . . . . . . . . . . . C-3
Instruction Extensions Introduction by Microarchitectures (CPUID Signature). . . . . C-4
BMI1, BMI2 and General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4
256-bit AVX2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4
Gather Timing Data from L1D* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-6
BMI1, BMI2 and General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-6
F16C,RDRAND Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7
256-bit AVX Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7
AESNI and PCLMULQDQ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9
SSE4.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9
SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-10
Supplemental Streaming SIMD Extension 3 Instructions. . . . . . . . . . . . . . . . . . . . . . . C-11
Streaming SIMD Extension 3 SIMD Floating-point Instructions . . . . . . . . . . . . . . . . . C-11
Streaming SIMD Extension 2 128-bit Integer Instructions . . . . . . . . . . . . . . . . . . . . . C-13
Streaming SIMD Extension 2 Double-precision Floating-point Instructions . . . . . . C-14
Streaming SIMD Extension Single-precision Floating-point Instructions. . . . . . . . . C-15
xxv

CONTENTS
PAGE

Table C-17.
Table C-18.

xxvi

General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-17
Pointer-Chasing Variability of Software Measurable Latency of L1 Data Cache LatencyC-20

CHAPTER 1
INTRODUCTION
The Intel® 64 and IA-32 Architectures Optimization Reference Manual describes how to optimize software to take advantage of the performance characteristics of IA-32 and Intel 64 architecture processors.
Optimizations described in this manual apply to processors based on the Intel® Core™ microarchitecture, Enhanced Intel® Core™ microarchitecture, Intel® microarchitecture code name Nehalem, Intel®
microarchitecture code name Westmere, Intel® microarchitecture code name Sandy Bridge, Intel®
microarchitecture code name Ivy Bridge, Intel® microarchitecture code name Haswell, Intel NetBurst®
microarchitecture, the Intel® Core™ Duo, Intel® Core™ Solo, Pentium® M processor families.
The target audience for this manual includes software programmers and compiler writers. This manual
assumes that the reader is familiar with the basics of the IA-32 architecture and has access to the Intel®
64 and IA-32 Architectures Software Developer’s Manual (five volumes). A detailed understanding of Intel
64 and IA-32 processors is often required. In many cases, knowledge of the underlying microarchitectures is required.
The design guidelines that are discussed in this manual for developing highperformance software generally apply to current as well as to future IA-32 and Intel 64 processors. The
coding rules and code optimization techniques listed target the Intel Core microarchitecture, the Intel
NetBurst microarchitecture and the Pentium M processor microarchitecture. In most cases, coding rules
apply to software running in 64-bit mode of Intel 64 architecture, compatibility mode of Intel 64 architecture, and IA-32 modes (IA-32 modes are supported in IA-32 and Intel 64 architectures). Coding rules
specific to 64-bit modes are noted separately.

1.1

TUNING YOUR APPLICATION

Tuning an application for high performance on any Intel 64 or IA-32 processor requires understanding
and basic skills in:

•
•
•
•
•

Intel 64 and IA-32 architecture.
C and Assembly language.
Hot-spot regions in the application that have impact on performance.
Optimization capabilities of the compiler.
Techniques used to evaluate application performance.

The Intel® VTune™ Performance Analyzer can help you analyze and locate hot-spot regions in your applications. On the Intel® Core™ i7, Intel® Core™2 Duo, Intel® Core™ Duo, Intel® Core™ Solo, Pentium®
4, Intel® Xeon® and Pentium® M processors, this tool can monitor an application through a selection of
performance monitoring events and analyze the performance event data that is gathered during code
execution.
This manual also describes information that can be gathered using the performance counters through
Pentium 4 processor’s performance monitoring events.

1.2

ABOUT THIS MANUAL

The Intel® Xeon® processor 3000, 3200, 5100, 5300, 7200 and 7300 series, Intel® Pentium® dual-core,
Intel® Core™2 Duo, Intel® Core™2 Quad, and Intel® Core™2 Extreme processors are based on Intel®
Core™ microarchitecture. In this document, references to the Core 2 Duo processor refer to processors
based on the Intel® Core™ microarchitecture.
The Intel® Xeon® processor 3100, 3300, 5200, 5400, 7400 series, Intel® Core™2 Quad processor
Q8000 series, and Intel® Core™2 Extreme processors QX9000 series are based on 45 nm Enhanced
Intel® Core™microarchitecture.

INTRODUCTION
The Intel® Core™ i7 processor and Intel® Xeon® processor 3400, 5500, 7500 series are based on 45 nm
Intel® microarchitecture code name Nehalem. Intel® microarchitecture code name Westmere is a 32 nm
version of Intel® microarchitecture code name Nehalem. Intel® Xeon® processor 5600 series, Intel Xeon
processor E7 and various Intel Core i7, i5, i3 processors are based on Intel® microarchitecture code
name Westmere.
The Intel® Xeon® processor E5 family, Intel® Xeon® processor E3-1200 family, Intel® Xeon® processor
E7-8800/4800/2800 product families, Intel® CoreTM i7-3930K processor, and 2nd generation Intel®
CoreTM i7-2xxx, Intel® CoreTM i5-2xxx, Intel® CoreTM i3-2xxx processor series are based on the Intel®
microarchitecture code name Sandy Bridge.
The 3rd generation Intel® Core™ processors and the Intel Xeon processor E3-1200 v2 product family are
based on Intel® microarchitecture code name Ivy Bridge. The Intel® Xeon® processor E5 v2 and E7 v2
families are based on the Ivy Bridge-E microarchitecture, support Intel 64 architecture and multiple
physical processor packages in a platform.
The 4th generation Intel® Core™ processors and the Intel® Xeon® processor E3-1200 v3 product family
are based on Intel® microarchitecture code name Haswell. The Intel® Xeon® processor E5 26xx v3
family is based on the Haswell-E microarchitecture, supports Intel 64 architecture and multiple physical
processor packages in a platform.
The Intel® Core™ M processor family and 5th generation Intel® Core™ processors are based on the
Intel® microarchitecture code name Broadwell and support Intel 64 architecture.
The 6th generation Intel® Core™ processors are based on the Intel® microarchitecture code name
Skylake and support Intel 64 architecture.
In this document, references to the Pentium 4 processor refer to processors based on the Intel NetBurst®
microarchitecture. This includes the Intel Pentium 4 processor and many Intel Xeon processors based on
Intel NetBurst microarchitecture. Where appropriate, differences are noted (for example, some Intel
Xeon processors have third level cache).
The Dual-core Intel® Xeon® processor LV is based on the same architecture as Intel® Core™ Duo and
Intel® Core™ Solo processors.
Intel® Atom™ processor is based on Intel® Atom™ microarchitecture.
The following bullets summarize chapters in this manual.

•
•

Chapter 1: Introduction — Defines the purpose and outlines the contents of this manual.
Chapter 2: Intel® 64 and IA-32 Processor Architectures — Describes the microarchitecture of
recent IA-32 and Intel 64 processor families, and other features relevant to software optimization.

•

Chapter 3: General Optimization Guidelines — Describes general code development and optimization techniques that apply to all applications designed to take advantage of the common features
of the Intel Core microarchitecture, Enhanced Intel Core microarchitecture, Intel NetBurst microarchitecture and Pentium M processor microarchitecture.

•

Chapter 4: Coding for SIMD Architectures — Describes techniques and concepts for using the
SIMD integer and SIMD floating-point instructions provided by the MMX™ technology, Streaming
SIMD Extensions, Streaming SIMD Extensions 2, Streaming SIMD Extensions 3, SSSE3, and SSE4.1.

•

Chapter 5: Optimizing for SIMD Integer Applications — Provides optimization suggestions and
common building blocks for applications that use the 128-bit SIMD integer instructions.

•

Chapter 6: Optimizing for SIMD Floating-point Applications — Provides optimization
suggestions and common building blocks for applications that use the single-precision and doubleprecision SIMD floating-point instructions.

•

Chapter 7: Optimizing Cache Usage — Describes how to use the PREFETCH instruction, cache
control management instructions to optimize cache usage, and the deterministic cache parameters.

•

Chapter 8: Multicore and Hyper-Threading Technology — Describes guidelines and techniques
for optimizing multithreaded applications to achieve optimal performance scaling. Use these when
targeting multicore processor, processors supporting Hyper-Threading Technology, or multiprocessor
(MP) systems.

1-2

INTRODUCTION

•

Chapter 9: 64-Bit Mode Coding Guidelines — This chapter describes a set of additional coding
guidelines for application software written to run in 64-bit mode.

•

Chapter 10: SSE4.2 and SIMD Programming for Text-Processing/Lexing/Parsing—
Describes SIMD techniques of using SSE4.2 along with other instruction extensions to improve
text/string processing and lexing/parsing applications.

•

Chapter 11: Optimizations for Intel® AVX, FMA and AVX2— Provides optimization suggestions
and common building blocks for applications that use Intel® Advanced Vector Extensions, FMA, and
AVX2.

•

Chapter 12: Intel Transactional Synchronization Extensions — Tuning recommendations to
use lock elision techniques with Intel Transactional Synchronization Extensions to optimize multithreaded software with contended locks.

•

Chapter 13: Power Optimization for Mobile Usages — This chapter provides background on
power saving techniques in mobile processors and makes recommendations that developers can
leverage to provide longer battery life.

•

Chapter 14: Intel® Atom™ Microarchitecture and Software Optimization — Describes the
microarchitecture of processor families based on Intel Atom microarchitecture, and software optimization techniques targeting Intel Atom microarchitecture.

•

Chapter 15: Silvermont Microarchitecture and Software Optimization — Describes the microarchitecture of processor families based on the Silvermont microarchitecture, and software optimization techniques targeting Intel processors based on the Silvermont microarchitecture.

•

Appendix A: Application Performance Tools — Introduces tools for analyzing and enhancing
application performance without having to write assembly code.

•

Appendix B: Using Performance Monitoring Events — Provides information on the Top-Down
Analysis Method and information on how to use performance events specific to the Intel Xeon
processor 5500 series, processors based on Intel microarchitecture code name Sandy Bridge, and
Intel Core Solo and Intel Core Duo processors.

•

Appendix C: IA-32 Instruction Latency and Throughput — Provides latency and throughput
data for the IA-32 instructions. Instruction timing data specific to recent processor families are
provided.

1.3

RELATED INFORMATION

For more information on the Intel® architecture, techniques, and the processor architecture terminology,
the following are of particular interest:

•
•
•
•
•
•
•

Intel® 64 and IA-32 Architectures Software Developer’s Manual (in five volumes)
Intel® Processor Identification with the CPUID Instruction, AP-485
Developing Multi-threaded Applications: A Platform Consistent Approach
Intel® C++ Compiler documentation and online help
Intel® Fortran Compiler documentation and online help
Intel® VTune™ Performance Analyzer documentation and online help
Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor MP

More relevant links are:

•

Software network link:
http://softwarecommunity.intel.com/isn/home/

•

Developer centers:
http://www3.intel.com/cd/ids/developer/asmo-na/eng/dc/index.htm

•

Processor support general link:
http://www.intel.com/support/processors/

1-3

INTRODUCTION

•

Software products and packages:
http://www3.intel.com/cd/software/products/asmo-na/eng/index.htm

•

Intel 64 and IA-32 processor manuals (printed or PDF downloads):
http://developer.intel.com/products/processor/manuals/index.htm

•

Intel Multi-Core Technology:
http://developer.intel.com/technology/multi-core/index.htm

•

Hyper-Threading Technology (HT Technology):
http://developer.intel.com/technology/hyperthread/

•

SSE4.1 Application Note: Motion Estimation with Intel® Streaming SIMD Extensions 4:
http://softwarecommunity.intel.com/articles/eng/1246.htm

•

SSE4.1 Application Note: Increasing Memory Throughput with Intel® Streaming SIMD Extensions 4:
http://softwarecommunity.intel.com/articles/eng/1248.htm

•

Processor Topology and Cache Topology white paper and reference code
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration

•

Multi-buffering techniques using SIMD extensions:
https://www-ssl.intel.com/content/www/us/en/communications/communications-ia-multi-bufferpaper.html

•

Parallel hashing using Multi-buffering techniques:
http://www.scirp.org/journal/PaperInformation.aspx?paperID=23995
http://eprint.iacr.org/2012/476.pdf

•

AES Library of sample code:
http://software.intel.com/en-us/articles/download-the-intel-aesni-sample-library/
http://software.intel.com/file/26898

•

PCMMULQDQ resources:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-crccomputation-paper.html
https://www-ssl.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/aesipsec-performance-linux-paper.html

•

Modular exponentiation using redundant representation and AVX2:
http://rd.springer.com/chapter/10.1007%2F978-3-642-31662-3_9?LI=true

1-4

CHAPTER 2
INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES
®

This chapter gives an overview of features relevant to software optimization for current generations of
Intel 64 and IA-32 processors (processors based on Intel® microarchitecture code name Skylake, Intel®
microarchitecture code name Broadwell, Intel® microarchitecture code name Haswell, Intel microarchitecture code name Ivy Bridge, Intel microarchitecture code name Sandy Bridge, processors based on the
Intel Core microarchitecture, Enhanced Intel Core microarchitecture, Intel microarchitecture code name
Nehalem). These features are:

•

Microarchitectures that enable executing instructions with high throughput at high clock rates, a high
speed cache hierarchy and high speed system bus.

•
•
•
•

Multicore architecture available across Intel Core processor and Intel Xeon processor families.

•
•
•
•
•

Intel® Advanced Vector Extensions (Intel® AVX).

Hyper-Threading Technology1 (HT Technology) support.
Intel 64 architecture on Intel 64 processors.
SIMD instruction extensions: MMX technology, Streaming SIMD Extensions (SSE), Streaming SIMD
Extensions 2 (SSE2), Streaming SIMD Extensions 3 (SSE3), Supplemental Streaming SIMD
Extensions 3 (SSSE3), SSE4.1, and SSE4.2.
Half-precision floating-point conversion and RDRAND.
Fused Multiply Add Extensions.
Intel® Advanced Vector Extensions 2 (Intel® AVX2).
ADX and RDSEED.

The Intel Core 2, Intel Core 2 Extreme, Intel Core 2 Quad processor family, Intel Xeon processor 3000,
3200, 5100, 5300, 7300 series are based on the high-performance and power-efficient Intel Core microarchitecture. Intel Xeon processor 3100, 3300, 5200, 5400, 7400 series, Intel Core 2 Extreme processor
QX9600, QX9700 series, Intel Core 2 Quad Q9000 series, Q8000 series are based on the enhanced Intel
Core microarchitecture. Intel Core i7 processor is based on Intel microarchitecture code name Nehalem.
Intel® Xeon® processor 5600 series, Intel Xeon processor E7 and Intel Core i7, i5, i3 processors are
based on Intel microarchitecture code name Westmere.
The Intel® Xeon® processor E5 family, Intel® Xeon® processor E3-1200 family, Intel® Xeon® processor
E7-8800/4800/2800 product families, Intel® CoreTM i7-3930K processor, and 2nd generation Intel®
Core™ i7-2xxx, Intel® Core™ i5-2xxx, Intel® Core™ i3-2xxx processor series are based on the Intel®
microarchitecture code name Sandy Bridge.
The Intel® Xeon® processor E3-1200 v2 product family and the 3rd generation Intel® Core™ processors
are based on the Ivy Bridge microarchitecture and support Intel 64 architecture. The Intel® Xeon®
processor E5 v2 and E7 v2 families are based on the Ivy Bridge-E microarchitecture, support Intel 64
architecture and multiple physical processor packages in a platform.
The Intel® Xeon® processor E3-1200 v3 product family and 4th Generation Intel® Core™ processors are
based on the Haswell microarchitecture and support Intel 64 architecture. The Intel® Xeon® processor
E5 26xx v3 family is based on the Haswell-E microarchitecture, supports Intel 64 architecture and
multiple physical processor packages in a platform.
Intel® Core™ M processors, 5th generation Intel Core processors and Intel Xeon processor E3-1200 v4
series are based on the Broadwell microarchitecture and support Intel 64 architecture.
The 6th generation Intel Core processors, Intel Xeon processor E3-1500m v5 are based on the Skylake
microarchitecture and support Intel 64 architecture.
1. Hyper-Threading Technology requires a computer system with an Intel processor supporting HT Technology and an HT
Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software
used.

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1

THE SKYLAKE MICROARCHITECTURE

The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures.
The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

32K L1 Instruction
Cache

BPU

Decoded Icache
(DSB)

MSROM
4 uops/cycle

Legacy Decode
Pipeline
5 uops/cycle

6 uops/cycle

Instruction Decode Queue (IDQ,, or micro-op queue)
Allocate/Rename/Retire/MoveElimination/ZeroIdiom
Scheduler

Port 0

Int ALU,
Vec FMA,
Vec MUL,
Vec Add,
Vec ALU,
Vec Shft,
Divide,
Branch2

Port 1

Int ALU,
Fast LEA,
Vec FMA,
Vec MUL,
Vec Add,
Vec ALU,
Vec Shft,
Int MUL,
Slow LEA

Port 5

Int ALU,
Fast LEA,
Vec SHUF,
Vec ALU,
CVT

Port 6

Int ALU,
Int Shft,
Branch1,

Port 2
LD/STA

256K L2 Cache
(Unified)

Port 3
LD/STA
Port 4
STD

32K L1 Data Cache

Port 7
STA

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
The Skylake microarchitecture offers the following enhancements:

•
•
•
•
•
•
•

Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.
Improved front end throughput.
Improved branch predictor.
Improved divider throughput and latency.
Lower power consumption.
Improved SMT performance with Hyper-Threading Technology.
Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore subsystem consisting of a number of components including a ring interconnect to multiple slices of L3 (an
off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A
four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

2-2

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1.1

The Front End

The front end in the Skylake microarchitecture provides the following improvements over previous
generation microarchitectures:

•

Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations.

•
•

The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations.

•

The LSD in the IDQ can detect loops up to 64 uops per logical processor irrespective ST or SMT
operation.

•

Improved Branch Predictor.

The IDQ can hold 64 uops per logical processor vs. 28 uops per logical processor in previous
generations when two sibling logical processors in the same core are active (2x64 vs. 2x28 per core).
If only one logical processor is active in the core, the IDQ can hold 64 uops (64 vs. 56 uops in ST
operation).

2.1.2

The Out-of-Order Execution Engine

The Out of Order and execution engine changes in Skylake microarchitecture include:

•
•
•
•

Larger buffers enable deeper OOO execution compared to previous generations.
Improved throughput and latency for divide/sqrt and approximate reciprocals.
Identical latency and throughput for all operations running on FMA units.
Longer pause latency enables better power efficiency and better SMT performance resource utilization.

Table 2-1 summarizes the OOO engine’s capability to dispatch different types of operations to various
ports.

Table 2-1. Dispatch Port and Execution Stacks of the Skylake Microarchitecture
Port 0

Port 1

Port 2, 3

ALU,

ALU,

LD

Vec ALU

Fast LEA,

STA

Port 4
STD

Port 5

Port 6

ALU,

ALU,

Fast LEA,

Shft,

Vec ALU

Vec ALU,

Vec Shft,

Vec Shft,

Vec Shuffle,

Vec Add,

Vec Add,

Vec Mul,

Vec Mul,

FMA,

FMA

DIV,

Slow Int

Branch2

Slow LEA

Port 7
STA

Branch1

Table 2-2 lists execution units and common representative instructions that rely on these units.
Throughput improvements across the SSE, AVX and general-purpose instruction sets are related to the
number of units for the respective operations, and the varieties of instructions that execute using a
particular unit.

2-3

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-2. Skylake Microarchitecture Execution Units and Representative Instructions1
Execution
Unit

# of
Unit

Instructions

ALU

4

add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup*

SHFT

2

sal, shl, rol, adc, sarx, adcx, adox, etc.

Slow Int

1

mul, imul, bsr, rcl, shld, mulx, pdep, etc.

BM

2

andn, bextr, blsi, blsmsk, bzhi, etc

Vec ALU

3

(v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*,
(v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd

Vec_Shft

2

(v)psllv*, (v)psrlv*, vector shift count in imm8

Vec Add

2

(v)addp*, (v)cmpp*, (v)max*, (v)min*, (v)padds*, (v)paddus*, (v)psign, (v)pabs, (v)pavgb,
(v)pcmpeq*, (v)pmax, (v)cvtps2dq, (v)cvtdq2ps, (v)cvtsd2si, (v)cvtss2si

Shuffle

1

(v)shufp*, vperm*, (v)pack*, (v)unpck*, (v)punpck*, (v)pshuf*, (v)pslldq, (v)alignr, (v)pmovzx*,
vbroadcast*, (v)pslldq, (v)psrldq, (v)pblendw

Vec Mul

2

(v)mul*, (v)pmul*, (v)pmadd*,

SIMD Misc

1

STTNI, (v)pclmulqdq, (v)psadw, vector shift count in xmm,

FP Mov

1

(v)movsd/ss, (v)movd gpr,

DIVIDE

1

divp*, divs*, vdiv*, sqrt*, vsqrt*, rcp*, vrcp*, rsqrt*, idiv

NOTES:
1. Execution unit mapping to MMX instructions are not covered in this table. See Section 11.16.5 on MMX instruction
throughput remedy.
A significant portion of the SSE, AVX and general-purpose instructions also have latency improvements.
Appendix C lists the specific details. Software-visible latency exposure of an instruction sometimes may
include additional contributions that depend on the relationship between micro-ops flows of the producer
instruction and the micro-op flows of the ensuing consumer instruction. For example, a two-uop instruction like VPMULLD may experience two cumulative bypass delays of 1 cycle each from each of the two
micro-ops of VPMULLD.
Table 2-3 describes the bypass delay in cycles between a producer uop and the consumer uop. The leftmost column lists a variety of situations characteristic of the producer micro-op. The top row lists a
variety of situations characteristic of the consumer micro-op.

2-4

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-3. Bypass Delay Between Producer and Consumer Micro-ops
SIMD/0,1/
1

FMA/0,1/
4

VIMUL/0,1/
4

SIMD/5/1,3

SHUF/5/1,
3

V2I/0/3

I2V/5/1

SIMD/0,1/1

0

1

1

0

0

0

NA

FMA/0,1/4

1

0

1

0

0

0

NA

VIMUL/0,1/4

1

0

1

0

0

0

NA

SIMD/5/1,3

0

1

1

0

0

0

NA

SHUF/5/1,3

0

0

1

0

0

0

NA

V2I/0/3

NA

NA

NA

NA

NA

NA

NA

I2V/5/1

0

0

1

0

0

0

NA

The attributes that are relevant to the producer/consumer micro-ops for bypass are a triplet of abbreviation/one or more port number/latency cycle of the uop. For example:

•
•
•

“SIMD/0,1/1” applies to 1-cycle vector SIMD uop dispatched to either port 0 or port 1.
“VIMUL/0,1/4” applies to 4-cycle vector integer multiply uop dispatched to either port 0 or port 1.
“SIMD/5/1,3” applies to either 1-cycle or 3-cycle non-shuffle uop dispatched to port 5.

2.1.3

Cache and Memory Subsystem

The cache hierarchy of the Skylake microarchitecture has the following enhancements:

•
•
•

Higher Cache bandwidth compared to previous generations.

•
•
•

Page split load penalty down from 100 cycles in previous generation to 5 cycles.

•
•

Reduced performance penalty for a software prefetch that specifies a NULL pointer.

Simultaneous handling of more loads and stores enabled by enlarged buffers.
Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier
generations.
L3 write bandwidth increased from 4 cycles per line in previous generation to 2 per line.
Support for the CLFLUSHOPT instruction to flush cache lines and manage memory ordering of flushed
data using SFENCE.
L2 associativity changed from 8 ways to 4 ways.

2-5

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-4. Cache Parameters of the Skylake Microarchitecture
Level

Capacity /
Associativity

Line Size
(bytes)

Fastest
Latency1

Peak Bandwidth
(bytes/cyc)

Sustained Bandwidth
(bytes/cyc)

Update
Policy

First Level Data

32 KB/ 8

64

4 cycle

96 (2x32B Load +
1*32B Store)

~81

Writeback

Instruction

32 KB/8

64

N/A

N/A

N/A

N/A

Second Level

256KB/4

64

12 cycle

64

~29

Writeback

64

44

32

~18

Writeback

Third Level
(Shared L3)

Up to 2MB
per core/Up
to 16 ways

NOTES:
1. Software-visible latency will vary depending on access patterns and other factors.
The TLB hierarchy consists of dedicated level one TLB for instruction cache, TLB for L1D, plus unified TLB
for L2. The partition column of Table 2-5 indicates the resource sharing policy when Hyper-Threading
Technology is active.

Table 2-5. TLB Parameters of the Skylake Microarchitecture
Level

Page Size

Entries

Associativity

Partition

Instruction

4KB

128

8 ways

dynamic

Instruction

2MB/4MB

8 per thread

First Level Data

4KB

64

4

fixed

First Level Data

2MB/4MB

32

4

fixed

First Level Data

1GB

4

4

fixed

Second Level

Shared by 4KB and 2/4MB pages

1536

12

fixed

Second Level

1GB

16

4

fixed

2.2

fixed

THE HASWELL MICROARCHITECTURE

The Haswell microarchitecture builds on the successes of the Sandy Bridge and Ivy Bridge microarchitectures. The basic pipeline functionality of the Haswell microarchitecture is depicted in Figure 2-2. In
general, most of the features described in Section 2.2.1 - Section 2.2.4 also apply to the Broadwell
microarchitecture. Enhancements of the Broadwell microarchitecture are summarized in Section 2.2.6.

2-6

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

32K L1 Instruction Cache

Pre-Decode

MSROM

Instruction Queue

Decoder
IDQ
BPU

Uop Cache (DSB)

Load Buffers, Store
Buffers, Reorder Buffers

Allocate/Rename/Retire/
MoveElimination/ZeroIdiom
Scheduler

Port 0

ALU,
SHFT,
VEC LOG,
VEC SHFT,
FP mul,
FMA,
DIV,
STTNI,
Branch2

Port 1

Port 5

ALU,
Fast LEA,
VEC ALU,
VEC LOG,
FP mul,
FMA,
FP add,
Slow Int

ALU,
Fast LEA,
VEC ALU,
VEC LOG,
VEC SHUF,

Port 6

Port 4

Port 2

Port 3

Port 7

ALU, Shft

STD

LD/STA

LD/STA

STA

Primary
Branch
Memory Control

32K L1 Data Cache
Line Fill Buffers
256K L2 Cache (Unified)

Figure 2-2. CPU Core Pipeline Functionality of the Haswell Microarchitecture
The Haswell microarchitecture offers the following innovative features:

•
•
•
•
•
•
•
•
•
•
•
•

Support for Intel Advanced Vector Extensions 2 (Intel AVX2), FMA.
Support for general-purpose, new instructions to accelerate integer numeric encryption.
Support for Intel® Transactional Synchronization Extensions (Intel® TSX).
Each core can dispatch up to 8 micro-ops per cycle.
256-bit data path for memory operation, FMA, AVX floating-point and AVX2 integer execution units.
Improved L1D and L2 cache bandwidth.
Two FMA execution pipelines.
Four arithmetic logical units (ALUs).
Three store address ports.
Two branch execution units.
Advanced power management features for IA processor core and uncore sub-systems.
Support for optional fourth level cache.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore subsystem consisting of a number of components including a ring interconnect to multiple slices of L3 (an
off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. An
example of the system integration view of four CPU cores with uncore components is illustrated in
Figure 2-3.

2-7

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

PCIe

DMI
DRAM

Disp
Eng

PEG

DMI

PCIe
Brdg

IMc

System Agent

CPU Core

L3 Slice

CPU Core

L3 Slice

CPU Core

L3 Slice

CPU Core

L3 Slice

Legend:

Uncore
CPU Core

Processor Graphics/
Media Engine

Figure 2-3. Four Core System Integration of the Haswell Microarchitecture

2.2.1

The Front End

The front end of Intel microarchitecture code name Haswell builds on that of Intel microarchitecture code
name Sandy Bridge and Intel microarchitecture code name Ivy Bridge, see Section 2.3.2 and Section
2.3.7. Additional enhancements in the front end include:

•
•

The uop cache (or decoded ICache) is partitioned equally between two logical processors.

•

The LSD in the micro-op queue (or IDQ) can detect small loops up to 56 micro-ops. The 56-entry
micro-op queue is shared by two logical processors if Hyper-Threading Technology is active (Intel
microarchitecture Sandy Bridge provides duplicated 28-entry micro-op queue in each core).

The instruction decoders will alternate between each active logical processor. If one sibling logical
processor is idle, the active logical processor will use the decoders continuously.

2.2.2

The Out-of-Order Engine

The key components and significant improvements to the out-of-order engine are summarized below:
Renamer: The Renamer moves micro-ops from the micro-op queue to bind to the dispatch ports in the
Scheduler with execution resources. Zero-idiom, one-idiom and zero-latency register move operations
are performed by the Renamer to free up the Scheduler and execution core for improved performance.
Scheduler: The Scheduler controls the dispatch of micro-ops onto the dispatch ports. There are eight
dispatch ports to support the out-of-order execution core. Four of the eight ports provided execution
resources for computational operations. The other 4 ports support memory operations of up to two 256bit load and one 256-bit store operation in a cycle.

2-8

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Execution Core: The scheduler can dispatch up to eight micro-ops every cycle, one on each port. Of the
four ports providing computational resources, each provides an ALU, two of these execution pipes
provided dedicated FMA units. With the exception of division/square-root, STTNI/AESNI units, most
floating-point and integer SIMD execution units are 256-bit wide. The four dispatch ports servicing
memory operations consist with two dual-use ports for load and store-address operation. Plus a dedicated 3rd store-address port and one dedicated store-data port. All memory ports can handle 256-bit
memory micro-ops. Peak floating-point throughput, at 32 single-precision operations per cycle and 16
double-precision operations per cycle using FMA, is twice that of Intel microarchitecture code name
Sandy Bridge.
The out-of-order engine can handle 192 uops in flight compared to 168 in Intel microarchitecture code
name Sandy Bridge.

2.2.3

Execution Engine

Table 2-6 summarizes which operations can be dispatched on which port.

Table 2-6. Dispatch Port and Execution Stacks of the Haswell Microarchitecture
Port 0

Port 1

Port 2, 3

ALU,

ALU,

Load_Addr,

Shift

Fast LEA,

Store_addr

Port 4
Store_data

Port 5

Port 6

ALU,

ALU,

Fast LEA,

Shift,

BM

BM

JEU

SIMD_Log,
SIMD misc,
SIMD_Shifts

SIMD_ALU,
SIMD_Log

SIMD_ALU,
SIMD_Log,

FMA/FP_mul,
Divide

FMA/FP_mul,
FP_add

Shuffle

2nd_Jeu

slow_int,

Port 7
Store_addr,
Simple_AGU

FP mov,
AES

Table 2-7 lists execution units and common representative instructions that rely on these units. Table 2-7
also includes some instructions that are available only on processors based on the Broadwell microarchitecture.

2-9

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-7. Haswell Microarchitecture Execution Units and Representative Instructions
Execution
Unit

# of
Ports

Instructions

ALU

4

add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa

SHFT

2

sal, shl, rol, adc, sarx, (adcx, adox)1 etc.

Slow Int

1

mul, imul, bsr, rcl, shld, mulx, pdep, etc.

BM

2

andn, bextr, blsi, blsmsk, bzhi, etc

SIMD Log

3

(v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)blendp*, vpblendd

SIMD_Shft

1

(v)psl*, (v)psr*

SIMD ALU

2

(v)padd*, (v)psign, (v)pabs, (v)pavgb, (v)pcmpeq*, (v)pmax, (v)pcmpgt*

Shuffle

1

(v)shufp*, vperm*, (v)pack*, (v)unpck*, (v)punpck*, (v)pshuf*, (v)pslldq, (v)alignr, (v)pmovzx*,
vbroadcast*, (v)pslldq, (v)pblendw

SIMD Misc

1

(v)pmul*, (v)pmadd*, STTNI, (v)pclmulqdq, (v)psadw, (v)pcmpgtq, vpsllvd, (v)bendv*, (v)plendw,

FP Add

1

(v)addp*, (v)cmpp*, (v)max*, (v)min*,

FP Mov

1

(v)movap*, (v)movup*, (v)movsd/ss, (v)movd gpr, (v)andp*, (v)orp*

DIVIDE

1

divp*, divs*, vdiv*, sqrt*, vsqrt*, rcp*, vrcp*, rsqrt*, idiv

NOTES:
1. Only available in processors based on the Broadwell microarchitecture and support CPUID ADX feature flag.
The reservation station (RS) is expanded to 60 entries deep (compared to 54 entries in Intel microarchitecture code name Sandy Bridge). It can dispatch up to eight micro-ops in one cycle if the micro-ops are
ready to execute. The RS dispatch a micro-op through an issue port to a specific execution cluster,
arranged in several stacks to handle specific data types or granularity of data.
When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a
delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floatingpoint operations. In some of the cases the data transition is done using a micro-op that is added to the
instruction flow. Table 2-29 describes how data, written back after execution, can bypass to micro-op
execution in the following cycles.

2-10

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-8. Bypass Delay Between Producer and Consumer Micro-ops (cycles)
From/To

INT

SSE-INT/
AVX-INT

INT

•
•

SSE-INT/
AVX-INT

micro-op (port 1)

SSE-FP/
AVX-FP_LOW

micro-op (port 1)

X87/
AVX-FP_High

micro-op (port 1) + 3
cycle delay

micro-op (port 5)
micro-op (port 6) +
1 cycle

•
•

micro-op (port 5)
micro-op (port 6) + 1
cycle

X87/
AVX-FP_High
micro-op (port 5) + 3
cycle delay

1 cycle delay
1 cycle delay

Load

2.2.4

SSE-FP/
AVX-FP_LOW

micro-op (port 5) +
1cycle delay
micro-op (port 5) +
1cycle delay

1 cycle delay

1 cycle delay

2 cycle delay

Cache and Memory Subsystem

The cache hierarchy is similar to prior generations, including an instruction cache, a first-level data cache
and a second-level unified cache in each core, and a 3rd-level unified cache with size dependent on
specific product configuration. The 3rd-level cache is organized as multiple cache slices, the size of each
slice may depend on product configurations, connected by a ring interconnect. The exact details of the
cache topology is reported by CPUID leaf 4. The 3rd level cache resides in the “uncore” sub-system that
is shared by all the processor cores. In some product configurations, a fourth level cache is also
supported. Table 2-27 provides more details of the cache hierarchy.

Table 2-9. Cache Parameters of the Haswell Microarchitecture
Level

Capacity/Ass
ociativity

Line Size
(bytes)

Fastest
Latency1

Throughput Peak Bandwidth
(clocks)
(bytes/cyc)

Update
Policy

First Level Data

32 KB/ 8

64

4 cycle

0.52

64 (Load) + 32 (Store)

Writeback

Instruction

32 KB/8

64

N/A

N/A

N/A

N/A

Second Level

256KB/8

64

11 cycle

Varies

64

Writeback

64

~34

Varies

Third Level
(Shared L3)

Varies

Writeback

NOTES:
1. Software-visible latency will vary depending on access patterns and other factors. L3 latency can vary due to clock
ratios between the processor core and uncore.
2. First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data.

2-11

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The TLB hierarchy consists of dedicated level one TLB for instruction cache, TLB for L1D, plus unified TLB
for L2.

Table 2-10. TLB Parameters of the Haswell Microarchitecture
Level

Page Size

Entries

Associativity

Partition

Instruction

4KB

128

4 ways

dynamic

Instruction

2MB/4MB

8 per thread

First Level Data

4KB

64

4

fixed

First Level Data

2MB/4MB

32

4

fixed

First Level Data

1GB

4

4

fixed

Second Level

Shared by 4KB and 2/4MB pages

1024

8

fixed

2.2.4.1

fixed

Load and Store Operation Enhancements

The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. The unified
L2 can service one cache line (64 bytes) each cycle. Additionally, there are 72 load buffers and 42 store
buffers available to support micro-ops execution in-flight.

2.2.5

The Haswell-E Microarchitecture

Intel processors based on the Haswell-E microarchitecture comprises the same processor cores as
described in the Haswell microarchitecture, but provides more advanced uncore and integrated I/O capabilities. Processors based on the Haswell-E microarchitecture support platforms with multiple sockets.
The Haswell-E microarchitecture supports versatile processor architectures and platform configurations
for scalability and high performance. Some of capabilities provided by the uncore and integrated I/O subsystem of the Haswell-E microarchitecture include:

•
•
•
•

Support for multiple Intel QPI interconnects in multi-socket configurations.
Up to two integrated memory controllers per physical processor.
Up to 40 lanes of PCI Express* 3.0 links per physical processor.
Up to 18 processor cores connected by two ring interconnects to the L3 in each physical processor.

An example of a possible 12-core processor implementation using the Haswell-E microarchitecture is
illustrated in Figure 2-4. The capabilities of the uncore and integrated I/O sub-system vary across the
processor family implementing the Haswell-E microarchitecture. For details, please consult the data
sheets of respective Intel Xeon E5 v3 processors.

2-12

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Legend:

PCIe

Uncore

QPI
CPU Core

Integrated I/O

QPII Links

Sbox
Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Core

L3 Slice

Sbox

DRAM

Home Agent
Memory Controller

DRAM

DRAM

Home Agent
Memory Controller

DRAM

Figure 2-4. An Example of the Haswell-E Microarchitecture Supporting 12 Processor Cores

2.2.6

The Broadwell Microarchitecture

Intel Core M processors are based on the Broadwell microarchitecture. The Broadwell microarchitecture
builds from the Haswell microarchitecture and provides several enhancements. This section covers
enhanced features of the Broadwell microarchitecture.

•

Floating-point multiply instruction latency is improved from 5 cycles in prior generation to 3 cycle in
the Broadwell microarchitecture. This applies to AVX, SSE and FP instruction sets.

•
•

The throughput of gather instructions has been improved significantly, see Table C-5.
The PCLMULQDQ instruction implementation is a single uop in the Broadwell microarchitecture with
improved latency and throughput.

The TLB hierarchy consists of dedicated level one TLB for instruction cache, TLB for L1D, plus unified TLB
for L2.

Table 2-11. TLB Parameters of the Broadwell Microarchitecture
Level

Page Size

Entries

Associativity

Partition

Instruction

4KB

128

4 ways

dynamic

Instruction

2MB/4MB

8 per thread

First Level Data

4KB

64

4

fixed

First Level Data

2MB/4MB

32

4

fixed

First Level Data

1GB

4

4

fixed

Second Level

Shared by 4KB and 2MB pages

1536

6

fixed

Second Level

1GB pages

16

4

fixed

fixed

2-13

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.3

INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE

Intel® microarchitecture code name Sandy Bridge builds on the successes of Intel® Core™ microarchitecture and Intel microarchitecture code name Nehalem. It offers the following innovative features:

•

Intel Advanced Vector Extensions (Intel AVX)
— 256-bit floating-point instruction set extensions to the 128-bit Intel Streaming SIMD Extensions,
providing up to 2X performance benefits relative to 128-bit code.
— Non-destructive destination encoding offers more flexible coding techniques.
— Supports flexible migration and co-existence between 256-bit AVX code, 128-bit AVX code and
legacy 128-bit SSE code.

•

Enhanced front end and execution engine
— New decoded ICache component that improves front end bandwidth and reduces branch misprediction penalty.
— Advanced branch prediction.
— Additional macro-fusion support.
— Larger dynamic execution window.
— Multi-precision integer arithmetic enhancements (ADC/SBB, MUL/IMUL).
— LEA bandwidth improvement.
— Reduction of general execution stalls (read ports, writeback conflicts, bypass latency, partial
stalls).
— Fast floating-point exception handling.
— XSAVE/XRSTORE performance improvements and XSAVEOPT new instruction.

•

Cache hierarchy improvements for wider data path
— Doubling of bandwidth enabled by two symmetric ports for memory operation.
— Simultaneous handling of more in-flight loads and stores enabled by increased buffers.
— Internal bandwidth of two loads and one store each cycle.
— Improved prefetching.
— High bandwidth low latency LLC architecture.
— High bandwidth ring architecture of on-die interconnect.

•

System-on-a-chip support
— Integrated graphics and media engine in second generation Intel Core processors.
— Integrated PCIE controller.
— Integrated memory controller.

•

Next generation Intel Turbo Boost Technology
— Leverage TDP headroom to boost performance of CPU cores and integrated graphic unit.

2.3.1

Intel® Microarchitecture Code Name Sandy Bridge Pipeline Overview

Figure 2-5 depicts the pipeline and major components of a processor core that’s based on Intel microarchitecture code name Sandy Bridge. The pipeline consists of

•

An in-order issue front end that fetches instructions and decodes them into micro-ops (micro-operations). The front end feeds the next pipeline stages with a continuous stream of micro-ops from the
most likely path that the program will execute.

2-14

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

An out-of-order, superscalar execution engine that dispatches up to six micro-ops to execution, per
cycle. The allocate/rename block reorders micro-ops to "dataflow" order so they can execute as soon
as their sources are ready and execution resources are available.

•

An in-order retirement unit that ensures that the results of execution of the micro-ops, including any
exceptions they may have encountered, are visible according to the original program order.

The flow of an instruction in the pipeline can be summarized in the following progression:
1. The Branch Prediction Unit chooses the next block of code to execute from the program. The
processor searches for the code in the following resources, in this order:
a. Decoded ICache.
b. Instruction Cache, via activating the legacy decode pipeline.
c.

L2 cache, last level cache (LLC) and memory, as necessary.

32K L1 Instruction Cache

Pre-decode

Instr Queue

Branch Predictor
Load
Buffers

Store
Buffers

Decoders

1.5K uOP Cache

Reorder
Buffers

Allocate/Rename/Retire
In-order
out-of-order

Port 0
ALU
V-Mul
V-Shuffle

Fdiv
256- FP MUL
256- FP Blend

Scheduler

Port 1

Port 5

ALU

ALU
JMP

V-Add
V-Shuffle
256- FP Add

Port 2
Load
StAddr

Port 3

Port 4

Load
StAddr

STD

256- FP Shuf
256- FP Bool
256- FP Blend
Memory Control
48 bytes/cycle

256K L2 Cache (Unified)

Line Fill
Buffers

32K L1 Data Cache

Figure 2-5. Intel Microarchitecture Code Name Sandy Bridge Pipeline Functionality
2. The micro-ops corresponding to this code are sent to the Rename/retirement block. They enter into
the scheduler in program order, but execute and are de-allocated from the scheduler according to
data-flow order. For simultaneously ready micro-ops, FIFO ordering is nearly always maintained.
Micro-op execution is executed using execution resources arranged in three stacks. The execution
units in each stack are associated with the data type of the instruction.
Branch mispredictions are signaled at branch execution. It re-steers the front end which delivers
micro-ops from the correct path. The processor can overlap work preceding the branch misprediction with work from the following corrected path.

2-15

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

3. Memory operations are managed and reordered to achieve parallelism and maximum performance.
Misses to the L1 data cache go to the L2 cache. The data cache is non-blocking and can handle
multiple simultaneous misses.
4. Exceptions (Faults, Traps) are signaled at retirement (or attempted retirement) of the faulting
instruction.
Each processor core based on Intel microarchitecture code name Sandy Bridge can support two logical
processor if Intel Hyper-Threading Technology is enabled.

2.3.2

The Front End

This section describes the key characteristics of the front end. Table 2-12 lists the components of the
front end, their functions, and the problems they address.

Table 2-12. Components of the Front End of Intel Microarchitecture Code Name Sandy Bridge
Component

Functions

Performance Challenges

Instruction Cache

32-Kbyte backing store of instruction bytes

Fast access to hot code instruction bytes

Legacy Decode Pipeline

Decode instructions to micro-ops, delivered to
the micro-op queue and the Decoded ICache.

Provides the same decode latency and
bandwidth as prior Intel processors.
Decoded ICache warm-up

Decoded ICache

Provide stream of micro-ops to the micro-op
queue.

MSROM

Complex instruction micro-op flow store,
accessible from both Legacy Decode Pipeline
and Decoded ICache

Branch Prediction Unit
(BPU)

Determine next block of code to be executed
and drive lookup of Decoded ICache and legacy
decode pipelines.

Improves performance and energy
efficiency through reduced branch
mispredictions.

Micro-op queue

Queues micro-ops from the Decoded ICache
and the legacy decode pipeline.

Hide front end bubbles; provide execution
micro-ops at a constant rate.

2.3.2.1

Provides higher micro-op bandwidth at
lower latency and lower power than the
legacy decode pipeline

Legacy Decode Pipeline

The Legacy Decode Pipeline comprises the instruction translation lookaside buffer (ITLB), the instruction
cache (ICache), instruction predecode, and instruction decode units.
Instruction Cache and ITLB
An instruction fetch is a 16-byte aligned lookup through the ITLB and into the instruction cache. The
instruction cache can deliver every cycle 16 bytes to the instruction pre-decoder. Table 2-12 compares
the ICache and ITLB with prior generation.

Table 2-13. ICache and ITLB of Intel Microarchitecture Code Name Sandy Bridge
Component

Intel microarchitecture code name Sandy
Bridge

Intel microarchitecture code name
Nehalem

ICache Size

32-Kbyte

32-Kbyte

ICache Ways

8

4

ITLB 4K page entries

128

128

ITLB large page (2M or
4M) entries

8

7

2-16

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Upon ITLB miss there is a lookup to the Second level TLB (STLB) that is common to the DTLB and the
ITLB. The penalty of an ITLB miss and a STLB hit is seven cycles.
Instruction PreDecode
The predecode unit accepts the 16 bytes from the instruction cache and determines the length of the
instructions.
The following length changing prefixes (LCPs) imply instruction length that is different from the default
length of instructions. Therefore they cause an additional penalty of three cycles per LCP during length
decoding. Previous processors incur a six-cycle penalty for each 16-byte chunk that has one or more
LCPs in it. Since usually there is no more than one LCP in a 16-byte chunk, in most cases, Intel microarchitecture code name Sandy Bridge introduces an improvement over previous processors.

•

Operand Size Override (66H) preceding an instruction with a word/double immediate data. This
prefix might appear when the code uses 16 bit data types, unicode processing, and image
processing.

•

Address Size Override (67H) preceding an instruction with a modr/m in real, big real, 16-bit
protected or 32-bit protected modes. This prefix may appear in boot code sequences.

•

The REX prefix (4xh) in the Intel® 64 instruction set can change the size of two classes of instructions: MOV offset and MOV immediate. Despite this capability, it does not cause an LCP penalty and
hence is not considered an LCP.

Instruction Decode
There are four decoding units that decode instruction into micro-ops. The first can decode all IA-32 and
Intel 64 instructions up to four micro-ops in size. The remaining three decoding units handle singlemicro-op instructions. All four decoding units support the common cases of single micro-op flows
including micro-fusion and macro-fusion.
Micro-ops emitted by the decoders are directed to the micro-op queue and to the Decoded ICache.
Instructions longer than four micro-ops generate their micro-ops from the MSROM. The MSROM bandwidth is four micro-ops per cycle. Instructions whose micro-ops come from the MSROM can start from
either the legacy decode pipeline or from the Decoded ICache.
MicroFusion
Micro-fusion fuses multiple micro-ops from the same instruction into a single complex micro-op. The
complex micro-op is dispatched in the out-of-order execution core as many times as it would if it were
not micro-fused.
Micro-fusion enables you to use memory-to-register operations, also known as the complex instruction
set computer (CISC) instruction set, to express the actual program operation without worrying about a
loss of decode bandwidth. Micro-fusion improves instruction bandwidth delivered from decode to retirement and saves power.
Coding an instruction sequence by using single-uop instructions will increases the code size, which can
decrease fetch bandwidth from the legacy pipeline.
The following are examples of micro-fused micro-ops that can be handled by all decoders.

•

All stores to memory, including store immediate. Stores execute internally as two separate functions,
store-address and store-data.

•

All instructions that combine load and computation operations (load+op), for example:

•
•
•

•

FADD DOUBLE PTR [RDI+RSI*8]
XOR RAX, QWORD PTR [RBP+32]

All instructions of the form "load and jump," for example:

•
•

•

ADDPS XMM9, OWORD PTR [RSP+40]

JMP [RDI+200]
RET

CMP and TEST with immediate operand and memory

2-17

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

An instruction with RIP relative addressing is not micro-fused in the following cases:

•

An additional immediate is needed, for example:

•
•

•

CMP [RIP+400], 27
MOV [RIP+3000], 142

The instruction is a control flow instruction with an indirect target specified using RIP-relative
addressing, for example:

•

JMP [RIP+5000000]

In these cases, an instruction that can not be micro-fused will require decoder 0 to issue two micro-ops,
resulting in a slight loss of decode bandwidth.
In 64-bit code, the usage of RIP Relative addressing is common for global data. Since there is no microfusion in these cases, performance may be reduced when porting 32-bit code to 64-bit code.
Macro-Fusion
Macro-fusion merges two instructions into a single micro-op. In Intel Core microarchitecture, this hardware optimization is limited to specific conditions specific to the first and second of the macro-fusable
instruction pair.

•

The first instruction of the macro-fused pair modifies the flags. The following instructions can be
macro-fused:
— In Intel microarchitecture code name Nehalem: CMP, TEST.
— In Intel microarchitecture code name Sandy Bridge: CMP, TEST, ADD, SUB, AND, INC, DEC
— These instructions can fuse if

•
•

•

The first source / destination operand is a register.
The second source operand (if exists) is one of: immediate, register, or non RIP-relative
memory.

The second instruction of the macro-fusable pair is a conditional branch. Table 3-1 describes, for each
instruction, what branches it can fuse with.

Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second
instruction is a conditional branch that starts at byte 0 of the next cache line.
Since these pairs are common in many types of applications, macro-fusion improves performance even
on non-recompiled binaries.
Each macro-fused instruction executes with a single dispatch. This reduces latency and frees execution
resources. You also gain increased rename and retire bandwidth, increased virtual storage, and power
savings from representing more work in fewer bits.

2.3.2.2

Decoded ICache

The Decoded ICache is essentially an accelerator of the legacy decode pipeline. By storing decoded
instructions, the Decoded ICache enables the following features:

•
•
•

Reduced latency on branch mispredictions.
Increased micro-op delivery bandwidth to the out-of-order engine.
Reduced front end power consumption.

The Decoded ICache caches the output of the instruction decoder. The next time the micro-ops are
consumed for execution the decoded micro-ops are taken from the Decoded ICache. This enables skipping the fetch and decode stages for these micro-ops and reduces power and latency of the Front End.
The Decoded ICache provides average hit rates of above 80% of the micro-ops; furthermore, "hot spots"
typically have hit rates close to 100%.
Typical integer programs average less than four bytes per instruction, and the front end is able to race
ahead of the back end, filling in a large window for the scheduler to find instruction level parallelism.
However, for high performance code with a basic block consisting of many instructions, for example, Intel
2-18

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

SSE media algorithms or excessively unrolled loops, the 16 instruction bytes per cycle is occasionally a
limitation. The 32-byte orientation of the Decoded ICache helps such code to avoid this limitation.
The Decoded ICache automatically improves performance of programs with temporal and spatial locality.
However, to fully utilize the Decoded ICache potential, you might need to understand its internal organization.
The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six
micro-ops. The Decoded ICache can ideally hold up to 1536 micro-ops.
The following are some of the rules how the Decoded ICache is filled with micro-ops:

•

All micro-ops in a Way represent instructions which are statically contiguous in the code and have
their EIPs within the same aligned 32-byte region.

•

Up to three Ways may be dedicated to the same 32-byte aligned chunk, allowing a total of 18 microops to be cached per 32-byte region of the original IA program.

•
•
•
•
•
•
•

A multi micro-op instruction cannot be split across Ways.
Up to two branches are allowed per Way.
An instruction which turns on the MSROM consumes an entire Way.
A non-conditional branch is the last micro-op in a Way.
Micro-fused micro-ops (load+op and stores) are kept as one micro-op.
A pair of macro-fused instructions is kept as one micro-op.
Instructions with 64-bit immediate require two slots to hold the immediate.

When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered
from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching microops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can
incur a penalty.
The Decoded ICache is virtually included in the Instruction cache and ITLB. That is, any instruction with
micro-ops in the Decoded ICache has its original instruction bytes present in the instruction cache.
Instruction cache evictions must also be evicted from the Decoded ICache, which evicts only the necessary lines.
There are cases where the entire Decoded ICache is flushed. One reason for this can be an ITLB entry
eviction. Other reasons are not usually visible to the application programmer, as they occur when important controls are changed, for example, mapping in CR3, or feature and mode enabling in CR0 and CR4.
There are also cases where the Decoded ICache is disabled, for instance, when the CS base address is
NOT set to zero.

2.3.2.3

Branch Prediction

Branch prediction predicts the branch target and enables the processor to begin executing instructions
long before the branch true execution path is known. All branches utilize the branch prediction unit (BPU)
for prediction. This unit predicts the target address not only based on the EIP of the branch but also
based on the execution path through which execution reached this EIP. The BPU can efficiently predict
the following branch types:

•
•
•
•

Conditional branches.
Direct calls and jumps.
Indirect calls and jumps.
Returns.

2.3.2.4

Micro-op Queue and the Loop Stream Detector (LSD)

The micro-op queue decouples the front end and the out-of order engine. It stays between the micro-op
generation and the renamer as shown in Figure 2-5. This queue helps to hide bubbles which are introduced between the various sources of micro-ops in the front end and ensures that four micro-ops are
delivered for execution, each cycle.
2-19

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The micro-op queue provides post-decode functionality for certain instructions types. In particular, loads
combined with computational operations and all stores, when used with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache. In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination, one does the load and the other does
the operation. A typical example is the following "load plus operation" instruction:
ADD

RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )

Similarly, the following store instruction has three register sources and is broken into "generate store
address" and "generate store data" sub-components.
MOV

[ESP+ECX*4+12345678], AL

The additional micro-ops generated by unlamination use the rename and retirement bandwidth.
However, it has an overall power benefit. For code that is dominated by indexed addressing (as often
happens with array processing), recoding algorithms to use base (or base+displacement) addressing can
sometimes improve performance by keeping the load plus operation and store instructions fused.
The Loop Stream Detector (LSD)
The Loop Stream Detector was introduced in Intel® Core microarchitectures. The LSD detects small loops
that fit in the micro-op queue and locks them down. The loop streams from the micro-op queue, with no
more fetching, decoding, or reading micro-ops from any of the caches, until a branch mis-prediction
inevitably ends it.
The loops with the following attributes qualify for LSD/micro-op queue replay:

•
•
•
•
•

Up to eight chunk fetches of 32-instruction-bytes.
Up to 28 micro-ops (~28 instructions).
All micro-ops are also resident in the Decoded ICache.
Can contain no more than eight taken branches and none of them can be a CALL or RET.
Cannot have mismatched stack operations. For example, more PUSH than POP instructions.

Many calculation-intensive loops, searches and software string moves match these characteristics.
Use the loop cache functionality opportunistically. For high performance code, loop unrolling is generally
preferable for performance even when it overflows the LSD capability.

2.3.3

The Out-of-Order Engine

The Out-of-Order engine provides improved performance over prior generations with excellent power
characteristics. It detects dependency chains and sends them to execution out-of-order while maintaining the correct data flow. When a dependency chain is waiting for a resource, such as a second-level
data cache line, it sends micro-ops from another chain to the execution core. This increases the overall
rate of instructions executed per cycle (IPC).
The out-of-order engine consists of two blocks, shown in Figure 2-5: Core Functional Diagram, the
Rename/retirement block, and the Scheduler.
The Out-of-Order engine contains the following major components:
Renamer. The Renamer component moves micro-ops from the front end to the execution core. It eliminates false dependencies among micro-ops, thereby enabling out-of-order execution of micro-ops.
Scheduler. The Scheduler component queues micro-ops until all source operands are ready. Schedules
and dispatches ready micro-ops to the available execution units in as close to a first in first out (FIFO)
order as possible.
Retirement. The Retirement component retires instructions and micro-ops in order and handles faults
and exceptions.

2-20

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.3.3.1

Renamer

The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine.
Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this
is equivalent to the issue port can dispatch six micro-ops per cycle. In this process, the out-of-order core
carries out the following steps:

•

Renames architectural sources and destinations of the micro-ops to micro-architectural sources and
destinations.

•
•

Allocates resources to the micro-ops. For example, load or store buffers.
Binds the micro-op to an appropriate dispatch port.

Some micro-ops can execute to completion during rename and are removed from the pipeline at that
point, effectively costing no execution bandwidth. These include:

•
•
•
•

Zero idioms (dependency breaking idioms).
NOP.
VZEROUPPER.
FXCHG.

The renamer can allocate two branches each cycle, compared to one branch each cycle in the previous
microarchitecture. This can eliminate some bubbles in execution.
Micro-fused load and store operations that use an index register are decomposed to two micro-ops,
hence consume two out of the four slots the Renamer can use every cycle.
Dependency Breaking Idioms
Instruction parallelism can be improved by using common instructions to clear register contents to zero.
The renamer can detect them on the zero evaluation of the destination register.
Use one of these dependency breaking idioms to clear a register when possible.

•
•
•
•
•
•
•

XOR REG,REG
SUB REG,REG
PXOR/VPXOR XMMREG,XMMREG
PSUBB/W/D/Q XMMREG,XMMREG
VPSUBB/W/D/Q XMMREG,XMMREG
XORPS/PD XMMREG,XMMREG
VXORPS/PD YMMREG, YMMREG

Since zero idioms are detected and removed by the renamer, they have no execution latency.
There is another dependency breaking idiom - the "ones idiom".

•

CMPEQ

XMM1, XMM1; "ones idiom" set all elements to all "ones"

In this case, the micro-op must execute, however, since it is known that regardless of the input data the
output data is always "all ones" the micro-op dependency upon its sources does not exist as with the zero
idiom and it can execute as soon as it finds a free execution port.

2.3.3.2

Scheduler

The scheduler controls the dispatch of micro-ops onto their execution ports. In order to do this, it must
identify which micro-ops are ready and where its sources come from: a register file entry, or a bypass
directly from an execution unit. Depending on the availability of dispatch ports and writeback buses, and
the priority of ready micro-ops, the scheduler selects which micro-ops are dispatched every cycle.

2-21

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.3.4

The Execution Core

The execution core is superscalar and can process instructions out of order. The execution core optimizes
overall performance by handling the most common operations efficiently, while minimizing potential
delays.
The out-of-order execution core improves execution unit organization over prior generation in the
following ways:

•
•
•
•

Reduction in read port stalls.
Reduction in writeback conflicts and delays.
Reduction in power.
Reduction of SIMD FP assists dealing with denormal inputs and underflow outputs.

Some high precision FP algorithms need to operate with FTZ=0 and DAZ=0, i.e. permitting underflowed
intermediate results and denormal inputs to achieve higher numerical precision at the expense of
reduced performance on prior generation microarchitectures due to SIMD FP assists. The reduction of
SIMD FP assists in Intel microarchitecture code name Sandy Bridge applies to the following SSE instructions (and AVX variants): ADDPD/ADDPS, MULPD/MULPS, DIVPD/DIVPS, and CVTPD2PS.
The out-of-order core consist of three execution stacks, where each stack encapsulates a certain type of
data. The execution core contains the following execution stacks:

•
•
•

General purpose integer.
SIMD integer and floating-point.
X87.

The execution core also contains connections to and from the cache hierarchy. The loaded data is fetched
from the caches and written back into one of the stacks.
The scheduler can dispatch up to six micro-ops every cycle, one on each port. The following table
summarizes which operations can be dispatched on which port.

Table 2-14. Dispatch Port and Execution Stacks
Port 0
Integer

SSE-Int,
AVX-Int,

ALU, Shift

Mul, Shift,
STTNI, Int-Div,

Port 1

Port 2

Port 3

Port 4

ALU,

Load_Addr,

Load_Addr

Store_data

Fast LEA,

Store_addr

Store_addr

Port 5
ALU,
Shift,

Slow LEA,

Branch,

MUL

Fast LEA

ALU, Shuf,
Blend, 128bMov

Store_data

ALU, Shuf,
Shift, Blend,
128b-Mov

MMX

128b-Mov

SSE-FP,

Mul, Div, Blend,
256b-Mov

Add, CVT

Store_data

Shuf, Blend,
256b-Mov

Mul, Div, Blend,
256b-Mov

Add, CVT

Store_data

Shuf, Blend,
256b-Mov

AVX-FP_low
X87,
AVX-FP_High

After execution, the data is written back on a writeback bus corresponding to the dispatch port and the
data type of the result. Micro-ops that are dispatched on the same port but have different latencies may
need the write back bus at the same cycle. In these cases the execution of one of the micro-ops is
delayed until the writeback bus is available. For example, MULPS (five cycles) and BLENDPS (one cycle)
may collide if both are ready for execution on port 0: first the MULPS and four cycles later the BLENDPS.
Intel microarchitecture code name Sandy Bridge eliminates such collisions as long as the micro-ops write

2-22

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

the results to different stacks. For example, integer ADD (one cycle) can be dispatched four cycles after
MULPS (five cycles) since the integer ADD uses the integer stack while the MULPS uses the FP stack.
When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a
one- or two-cycle delay can occur. The delay occurs also for transitions between Intel SSE integer and
Intel SSE floating-point operations. In some of the cases the data transition is done using a micro-op that
is added to the instruction flow. The following table describes how data, written back after execution, can
bypass to micro-op execution in the following cycles.

Table 2-15. Execution Core Writeback Latency (cycles)
Integer

SSE-Int, AVX-Int,

SSE-FP,

X87,

MMX

AVX-FP_low

AVX-FP_High

Integer

0

micro-op (port 0)

micro-op (port 0)

micro-op (port 0) +
1 cycle

SSE-Int, AVX-Int,
MMX

micro-op (port 5) or
micro-op (port 5) +1
cycle

0

1 cycle delay

0

SSE-FP,

micro-op (port 5) or
micro-op (port 5) +1
cycle

1 cycle delay

0

micro-op (port 5) +1
cycle

0

micro-op (port 5)
+1 cycle

0

AVX-FP_High

micro-op (port 5) +1
cycle

Load

0

1 cycle delay

1 cycle delay

2 cycle delay

2.3.5

Cache Hierarchy

AVX-FP_low
X87,

The cache hierarchy contains a first level instruction cache, a first level data cache (L1 DCache) and a
second level (L2) cache, in each core. The L1D cache may be shared by two logical processors if the
processor support Intel HyperThreading Technology. The L2 cache is shared by instructions and data. All
cores in a physical processor package connect to a shared last level cache (LLC) via a ring connection.
The caches use the services of the Instruction Translation Lookaside Buffer (ITLB), Data Translation
Lookaside Buffer (DTLB) and Shared Translation Lookaside Buffer (STLB) to translate linear addresses to
physical address. Data coherency in all cache levels is maintained using the MESI protocol. For more
information, see the Intel® 64 IA-32 Architectures Software Developer's Manual, Volume 3. Cache hierarchy details can be obtained at run-time using the CPUID instruction. see the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 2A.

Table 2-16. Cache Parameters
Level

Capacity

Associativity
(ways)

Line Size
(bytes)

Write Update
Policy

Inclusive

L1 Data

32 KB

8

64

Writeback

-

Instruction

32 KB

8

N/A

N/A

-

L2 (Unified)

256 KB

8

64

Writeback

No

Third Level (LLC)

Varies, query
CPUID leaf 4

Varies with cache
size

64

Writeback

Yes

2-23

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.3.5.1

Load and Store Operation Overview

This section provides an overview of the load and store operations.
Loads
When an instruction reads data from a memory location that has write-back (WB) type, the processor
looks for it in the caches and memory. Table 2-17 shows the access lookup order and best case latency.
The actual latency can vary depending on the cache queue occupancy, LLC ring occupancy, memory
components, and their parameters.

Table 2-17. Lookup Order and Load Latency
Level

Latency (cycles)

Bandwidth (per core per cycle)

L1 Data

41

2 x16 bytes

L2 (Unified)

12

1 x 32 bytes

Third Level (LLC)

26-312

1 x 32 bytes

L2 and L1 DCache in other cores
if applicable

43- clean hit;
60 - dirty hit

NOTES:
1. Subject to execution core bypass restriction shown in Table 2-15.
2. Latency of L3 varies with product segment and sku. The values apply to second generation Intel Core processor families.
The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the
LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1
caches. If there is an indication in the LLC that other cores may hold the line of interest and its state
might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called
“clean” if it does not require fetching data from the other core caches. The lookup is called “dirty” if modified data has to be fetched from the other core caches and transferred to the loading core.
The latencies shown above are the best-case scenarios. Sometimes a modified cache line has to be
evicted to make space for a new cache line. The modified cache line is evicted in parallel to bringing the
new data and does not require additional latency. However, when data is written back to memory, the
eviction uses cache bandwidth and possibly memory bandwidth as well. Therefore, when multiple cache
misses require the eviction of modified lines within a short time, there is an overall degradation in cache
response time. Memory access latencies vary based on occupancy of the memory controller queues,
DRAM configuration, DDR parameters, and DDR paging behavior (if the requested page is a page-hit,
page-miss or page-empty).
Stores
When an instruction writes data to a memory location that has a write back memory type, the processor
first ensures that it has the line containing this memory location in its L1 DCache, in Exclusive or Modified
MESI state. If the cache line is not there, in the right state, the processor fetches it from the next levels
of the memory hierarchy using a Read for Ownership request. The processor looks for the cache line in
the following locations, in the specified order:
1. L1 DCache
2. L2
3. Last Level Cache
4. L2 and L1 DCache in other cores, if applicable
5. Memory
Once the cache line is in the L1 DCache, the new data is written to it, and the line is marked as Modified.
Reading for ownership and storing the data happens after instruction retirement and follows the order of
store instruction retirement. Therefore, the store latency usually does not affect the store instruction
itself. However, several sequential stores that miss the L1 DCache may have cumulative latency that can

2-24

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

affect performance. As long as the store does not complete, its entry remains occupied in the store
buffer. When the store buffer becomes full, new micro-ops cannot enter the execution pipe and execution
might stall.

2.3.5.2

L1 DCache

The L1 DCache is the first level data cache. It manages all load and store requests from all types through
its internal data structures. The L1 DCache:

•
•
•

Enables loads and stores to issue speculatively and out of order.
Ensures that retired loads and stores have the correct data upon retirement.
Ensures that loads and stores follow the memory ordering rules of the IA-32 and Intel 64 instruction
set architecture.

Table 2-18. L1 Data Cache Components
Component

Intel microarchitecture code name
Sandy Bridge

Intel microarchitecture code name
Nehalem

Data Cache Unit (DCU)

32KB, 8 ways

32KB, 8 ways

Load buffers

64 entries

48 entries

Store buffers

36 entries

32 entries

Line fill buffers (LFB)

10 entries

10 entries

The DCU is organized as 32 KBytes, eight-way set associative. Cache line size is 64-bytes arranged in
eight banks.
Internally, accesses are up to 16 bytes, with 256-bit Intel AVX instructions utilizing two 16-byte
accesses. Two load operations and one store operation can be handled each cycle.
The L1 DCache maintains requests which cannot be serviced immediately to completion. Some reasons
for requests that are delayed: cache misses, unaligned access that splits across cache lines, data not
ready to be forwarded from a preceding store, loads experiencing bank collisions, and load block due to
cache line replacement.
The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up
to 36 store operations from allocation until the store value is committed to the cache, or written to the
line fill buffers (LFB) in the case of non-temporal stores.
The L1 DCache can handle multiple outstanding cache misses and continue to service incoming stores
and loads. Up to 10 requests of missing cache lines can be managed simultaneously using the LFB.
The L1 DCache is a write-back write-allocate cache. Stores that hit in the DCU do not update the lower
levels of the memory hierarchy. Stores that miss the DCU allocate a cache line.
Loads
The L1 DCache architecture can service two loads per cycle, each of which can be up to 16 bytes. Up to
32 loads can be maintained at different stages of progress, from their allocation in the out of order engine
until the loaded value is returned to the execution core.
Loads can:

•

Read data before preceding stores when the load address and store address ranges are known not to
conflict.

•
•

Be carried out speculatively, before preceding branches are resolved.
Take cache misses out of order and in an overlapped manner.

Loads cannot:

•
•

Speculatively take any sort of fault or trap.
Speculatively access uncacheable memory.

2-25

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The common load latency is five cycles. When using a simple addressing mode, base plus offset that is
smaller than 2048, the load latency can be four cycles. This technique is especially useful for pointerchasing code. However, overall latency varies depending on the target register data type due to stack
bypass. See Section 2.3.4 for more information.
The following table lists overall load latencies. These latencies assume the common case of flat segment,
that is, segment base address is zero. If segment base is not zero, load latency increases.

Table 2-19. Effect of Addressing Modes on Load Latency
Data Type/Addressing Mode

Base + Offset > 2048;
Base + Index [+ Offset]

Base + Offset < 2048

Integer

5

4

MMX, SSE, 128-bit AVX

6

5

X87

7

6

256-bit AVX

7

7

Stores
Stores to memory are executed in two phases:

•

Execution phase. Fills the store buffers with linear and physical address and data. Once store address
and data are known, the store data can be forwarded to the following load operations that need it.

•

Completion phase. After the store retires, the L1 DCache moves its data from the store buffers to the
DCU, up to 16 bytes per cycle.

Address Translation
The DTLB can perform three linear to physical address translations every cycle, two for load addresses
and one for a store address. If the address is missing in the DTLB, the processor looks for it in the STLB,
which holds data and instruction address translations. The penalty of a DTLB miss that hits the STLB is
seven cycles. Large page support include 1G byte pages, in addition to 4K and 2M/4M pages.
The DTLB and STLB are four way set associative. The following table specifies the number of entries in
the DTLB and STLB.

Table 2-20. DTLB and STLB Parameters
TLB
DTLB

STLB

Page Size

Entries

4KB

64

2MB/4MB

32

1GB

4

4KB

512

Store Forwarding
If a load follows a store and reloads the data that the store writes to memory, the data can forward
directly from the store operation to the load. This process, called store to load forwarding, saves cycles
by enabling the load to obtain the data directly from the store operation instead of through memory. You
can take advantage of store forwarding to quickly move complex structures without losing the ability to
forward the subfields. The memory control unit can handle store forwarding situations with less restrictions compared to previous micro-architectures.
The following rules must be met to enable store to load forwarding:

•
•
•

The store must be the last store to that address, prior to the load.
The store must contain all data being loaded.
The load is from a write-back memory type and neither the load nor the store are non-temporal
accesses.

2-26

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Stores cannot forward to loads in the following cases:

•

Four byte and eight byte loads that cross eight byte boundary, relative to the preceding 16- or 32byte store.

•

Any load that crosses a 16-byte boundary of a 32-byte store.

Table 2-21 to Table 2-24 detail the store to load forwarding behavior. For a given store size, all the loads
that may overlap are shown and specified by ‘F’. Forwarding from 32 byte store is similar to forwarding
from each of the 16 byte halves of the store. Cases that cannot forward are shown as ‘N’.

Table 2-21. Store Forwarding Conditions (1 and 2 byte stores)
Load Alignment
Store
Size

Load
Size

0

1

1

1

F

2

1

F

F

2

F

N

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Table 2-22. Store Forwarding Conditions (4-16 byte stores)
Load Alignment
Store
Size

Load
Size

0

1

2

3

4

1

F

F

F

F

2

F

F

F

N

4

F

N

N

N

1

F

F

F

2

F

F

4

F

8

8

16

4

5

6

7

8

9

10

11

12

13

14

15

F

F

F

F

F

F

F

F

F

F

N

F

F

F

F

N

N

N

F

N

N

N

N

N

N

N

1

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

2

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

N

4

F

F

F

F

F

N

N

N

F

F

F

F

F

N

N

N

8

F

N

N

N

N

N

N

N

F

N

N

N

N

N

N

N

16

F

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

2-27

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-23. 32-byte Store Forwarding Conditions (0-15 byte alignment)
Load Alignment
Store
Size

Load
Size

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

32

1

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

2

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

N

4

F

F

F

F

F

N

N

N

F

F

F

F

F

N

N

N

8

F

N

N

N

N

N

N

N

F

N

N

N

N

N

N

N

16

F

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

32

F

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

Table 2-24. 32-byte Store Forwarding Conditions (16-31 byte alignment)
Load Alignment
Store
Size

Load
Size

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

1

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

2

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

N

4

F

F

F

F

F

N

N

N

F

F

F

F

F

N

N

N

8

F

N

N

N

N

N

N

N

F

N

N

N

N

N

N

N

16

F

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

32

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

Memory Disambiguation
A load operation may depend on a preceding store. Many microarchitectures block loads until all
preceding store addresses are known. The memory disambiguator predicts which loads will not depend
on any previous stores. When the disambiguator predicts that a load does not have such a dependency,
the load takes its data from the L1 data cache even when the store address is unknown. This hides the
load latency. Eventually, the prediction is verified. If an actual conflict is detected, the load and all
succeeding instructions are re-executed.
The following loads are not disambiguated. The execution of these loads is stalled until addresses of all
previous stores are known.

•
•

Loads that cross the 16-byte boundary
32-byte Intel AVX loads that are not 32-byte aligned.

The memory disambiguator always assumes dependency between loads and earlier stores that have the
same address bits 0:11.
Bank Conflict
Since 16-byte loads can cover up to three banks, and two loads can happen every cycle, it is possible that
six of the eight banks may be accessed per cycle, for loads. A bank conflict happens when two load
accesses need the same bank (their address has the same 2-4 bit value) in different sets, at the same
time. When a bank conflict occurs, one of the load accesses is recycled internally.
In many cases two loads access exactly the same bank in the same cache line, as may happen when
popping operands off the stack, or any sequential accesses. In these cases, conflict does not occur and
the loads are serviced simultaneously.

2-28

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.3.5.3

Ring Interconnect and Last Level Cache

The system-on-a-chip design provides a high bandwidth bi-directional ring bus to connect between the
IA cores and various sub-systems in the uncore. In the second generation Intel Core processor 2xxx
series, the uncore subsystem include a system agent, the graphics unit (GT) and the last level cache
(LLC).
The LLC consists of multiple cache slices. The number of slices is equal to the number of IA cores. Each
slice has logic portion and data array portion. The logic portion handles data coherency, memory
ordering, access to the data array portion, LLC misses and writeback to memory, and more. The data
array portion stores cache lines. Each slice contains a full cache port that can supply 32 bytes/cycle.
The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a
hash function, such that addresses are uniformly distributed. The data array in a cache block may have
4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution
among the cache blocks from the software point of view, this does not appear as a normal N-way cache.
From the processor cores and the GT view, the LLC act as one shared cache with multiple ports and bandwidth that scales with the number of cores. The LLC hit latency, ranging between 26-31 cycles, depends
on the core location relative to the LLC block, and how far the request needs to travel on the ring.
The number of cache-slices increases with the number of cores, therefore the ring and LLC are not likely
to be a bandwidth limiter to core operation.
The GT sits on the same ring interconnect, and uses the LLC for its data operations as well. In this respect
it is very similar to an IA core. Therefore, high bandwidth graphic applications using cache bandwidth and
significant cache footprint, can interfere, to some extent, with core operations.
All the traffic that cannot be satisfied by the LLC, such as LLC misses, dirty line writeback, non-cacheable
operations, and MMIO/IO operations, still travels through the cache-slice logic portion and the ring, to
the system agent.
In the Intel Xeon Processor E5 Family, the uncore subsystem does not include the graphics unit (GT).
Instead, the uncore subsystem contains many more components, including an LLC with larger capacity
and snooping capabilities to support multiple processors, Intel® QuickPath Interconnect interfaces that
can support multi-socket platforms, power management control hardware, and a system agent capable
of supporting high-bandwidth traffic from memory and I/O devices.
In the Intel Xeon processor E5 2xxx or 4xxx families, the LLC capacity generally scales with the number
of processor cores with 2.5 MBytes per core.

2.3.5.4

Data Prefetching

Data can be speculatively loaded to the L1 DCache using software prefetching, hardware prefetching, or
any combination of the two.
You can use the four Streaming SIMD Extensions (SSE) prefetch instructions to enable softwarecontrolled prefetching. These instructions are hints to bring a cache line of data into the desired levels of
the cache hierarchy. The software-controlled prefetch is intended for prefetching data, but not for
prefetching code.
The rest of this section describes the various hardware prefetching mechanisms provided by Intel microarchitecture code name Sandy Bridge and their improvement over previous processors. The goal of the
prefetchers is to automatically predict which data the program is about to consume. If this data is not
close-by to the execution core or inner cache, the prefetchers bring it from the next levels of cache hierarchy and memory. Prefetching has the following effects:

•
•

Improves performance if data is arranged sequentially in the order used in the program.

•

On rare occasions, if the algorithm's working set is tuned to occupy most of the cache and unneeded
prefetches evict lines required by the program, hardware prefetcher may cause severe performance
degradation due to cache capacity of L1.

May cause slight performance degradation due to bandwidth issues, if access patterns are sparse
instead of local.

2-29

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Data Prefetch to L1 Data Cache
Data prefetching is triggered by load operations when the following conditions are met:

•
•
•
•
•

Load is from writeback memory type.
The prefetched data is within the same 4K byte page as the load instruction that triggered it.
No fence is in progress in the pipeline.
Not many other load misses are in progress.
There is not a continuous stream of stores.

Two hardware prefetchers load data to the L1 DCache:

•

Data cache unit (DCU) prefetcher. This prefetcher, also known as the streaming prefetcher, is
triggered by an ascending access to very recently loaded data. The processor assumes that this
access is part of a streaming algorithm and automatically fetches the next line.

•

Instruction pointer (IP)-based stride prefetcher. This prefetcher keeps track of individual load
instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the
next address which is the sum of the current address and the stride. This prefetcher can prefetch
forward or backward and can detect strides of up to 2K bytes.

Data Prefetch to the L2 and Last Level Cache
The following two hardware prefetchers fetched data from memory to the L2 cache and last level cache:
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending
sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store
operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or
backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache
lines must be in the same 4K page.
The streamer and spatial prefetcher prefetch the data to the last level cache. Typically data is brought
also to the L2 unless the L2 cache is heavily loaded with missing demand requests.
Enhancement to the streamer includes the following features:

•

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20
lines ahead of the load request.

•

Adjusts dynamically to the number of outstanding requests per core. If there are not many
outstanding requests, the streamer prefetches further ahead. If there are many outstanding
requests it prefetches to the LLC only and less far ahead.

•

When cache lines are far ahead, it prefetches to the last level cache only and not to the L2. This
method avoids replacement of useful cache lines in the L2 cache.

•

Detects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain
one forward and one backward stream can be maintained.

2.3.6

System Agent

The system agent implemented in the second generation Intel Core processor family contains the
following components:

•

An arbiter that handles all accesses from the ring domain and from I/O (PCIe* and DMI) and routes
the accesses to the right place.

•

PCIe controllers connect to external PCIe devices. The PCIe controllers have different configuration
possibilities the varies with product segment specifics: x16+x4, x8+x8+x4, x8+x4+x4+x4.

•
•

DMI controller connects to the PCH chipset.
Integrated display engine, Flexible Display Interconnect, and Display Port, for the internal graphic
operations.

2-30

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

Memory controller.

All main memory traffic is routed from the arbiter to the memory controller. The memory controller in the
second generation Intel Core processor 2xxx series support two channels of DDR, with data rates of
1066MHz, 1333MHz and 1600MHz, and 8 bytes per cycle, depending on the unit type, system configuration and DRAMs. Addresses are distributed between memory channels based on a local hash function
that attempts to balance the load between the channels in order to achieve maximum bandwidth and
minimum hotspot collisions.
For best performance, populate both channels with equal amounts of memory, preferably the exact same
types of DIMMs. In addition, using more ranks for the same amount of memory, results in somewhat
better memory bandwidth, since more DRAM pages can be open simultaneously. For best performance,
populate the system with the highest supported speed DRAM (1333MHz or 1600MHz data rates,
depending on the max supported frequency) with the best DRAM timings.
The two channels have separate resources and handle memory requests independently. The memory
controller contains a high-performance out-of-order scheduler that attempts to maximize memory bandwidth while minimizing latency. Each memory channel contains a 32 cache-line write-data-buffer. Writes
to the memory controller are considered completed when they are written to the write-data-buffer. The
write-data-buffer is flushed out to main memory at a later time, not impacting write latency.
Partial writes are not handled efficiently on the memory controller and may result in read-modify-write
operations on the DDR channel if the partial-writes do not complete a full cache-line in time. Software
should avoid creating partial write transactions whenever possible and consider alternative, such as buffering the partial writes into full cache line writes.
The memory controller also supports high-priority isochronous requests (such as USB isochronous, and
Display isochronous requests). High bandwidth of memory requests from the integrated display engine
takes up some of the memory bandwidth and impacts core access latency to some degree.

2.3.7

Intel® Microarchitecture Code Name Ivy Bridge

Third generation Intel Core processors are based on Intel microarchitecture code name Ivy Bridge. Most
of the features described in Section 2.3.1 - Section 2.3.6 also apply to Intel microarchitecture code name
Ivy Bridge. This section covers feature differences in microarchitecture that can affect coding and performance.
Support for new instructions enabling include:

•
•
•

Numeric conversion to and from half-precision floating-point values.
Hardware-based random number generator compliant to NIST SP 800-90A.
Reading and writing to FS/GS base registers in any ring to improve user-mode threading support.

For details about using the hardware based random number generator instruction RDRAND, please refer
to the article available from Intel Software Network at http://software.intel.com/en-us/articles/download-the-latest-bull-mountain-software-implementation-guide/?wapkw=bull+mountain.
A small number of microarchitectural enhancements that can be beneficial to software:

•

Hardware prefetch enhancement: A next-page prefetcher (NPP) is added in Intel microarchitecture
code name Ivy Bridge. The NPP is triggered by sequential accesses to cache lines approaching the
page boundary, either upwards or downwards.

•

Zero-latency register move operation: A subset of register-to-register MOV instructions are executed
at the front end, conserving scheduling and execution resource in the out-of-order engine.

•

Front end enhancement: In Intel microarchitecture code name Sandy Bridge, the micro-op queue is
statically partitioned to provide 28 entries for each logical processor, irrespective of software
executing in single thread or multiple threads. If one logical processor is not active in Intel microarchitecture code name Ivy Bridge, then a single thread executing on that processor core can use the
56 entries in the micro-op queue. In this case, the LSD can handle larger loop structure that would
require more than 28 entries.

2-31

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

The latency and throughput of some instructions have been improved over those of Intel microarchitecture code name Sandy Bridge. For example, 256-bit packed floating-point divide and square root
operations are faster; ROL and ROR instructions are also improved.

2.4

INTEL® CORE™ MICROARCHITECTURE AND ENHANCED INTEL®
CORE™ MICROARCHITECTURE

Intel Core microarchitecture introduces the following features that enable high performance and powerefficient performance for single-threaded as well as multi-threaded workloads:

•

Intel® Wide Dynamic Execution enables each processor core to fetch, dispatch, execute with high
bandwidths and retire up to four instructions per cycle. Features include:
— Fourteen-stage efficient pipeline.
— Three arithmetic logical units.
— Four decoders to decode up to five instruction per cycle.
— Macro-fusion and micro-fusion to improve front end throughput.
— Peak issue rate of dispatching up to six micro-ops per cycle.
— Peak retirement bandwidth of up to four micro-ops per cycle.
— Advanced branch prediction.
— Stack pointer tracker to improve efficiency of executing function/procedure entries and exits.

•

Intel® Advanced Smart Cache delivers higher bandwidth from the second level cache to the core,
optimal performance and flexibility for single-threaded and multi-threaded applications. Features
include:
— Optimized for multicore and single-threaded execution environments.
— 256 bit internal data path to improve bandwidth from L2 to first-level data cache.
— Unified, shared second-level cache of 4 Mbyte, 16 way (or 2 MByte, 8 way).

•

Intel® Smart Memory Access prefetches data from memory in response to data access patterns
and reduces cache-miss exposure of out-of-order execution. Features include:
— Hardware prefetchers to reduce effective latency of second-level cache misses.
— Hardware prefetchers to reduce effective latency of first-level data cache misses.
— Memory disambiguation to improve efficiency of speculative execution engine.

•

Intel® Advanced Digital Media Boost improves most 128-bit SIMD instructions with single-cycle
throughput and floating-point operations. Features include:
— Single-cycle throughput of most 128-bit SIMD instructions (except 128-bit shuffle, pack, unpack
operations)
— Up to eight floating-point operations per cycle
— Three issue ports available to dispatching SIMD instructions for execution.

The Enhanced Intel Core microarchitecture supports all of the features of Intel Core microarchitecture
and provides a comprehensive set of enhancements.

•

Intel® Wide Dynamic Execution includes several enhancements:
— A radix-16 divider replacing previous radix-4 based divider to speedup long-latency operations
such as divisions and square roots.
— Improved system primitives to speedup long-latency operations such as RDTSC, STI, CLI, and VM
exit transitions.

•

Intel® Advanced Smart Cache provides up to 6 MBytes of second-level cache shared between two
processor cores (quad-core processors have up to 12 MBytes of L2); up to 24 way/set associativity.

2-32

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

Intel® Smart Memory Access supports high-speed system bus up 1600 MHz and provides more
efficient handling of memory operations such as split cache line load and store-to-load forwarding
situations.

•

Intel® Advanced Digital Media Boost provides 128-bit shuffler unit to speedup shuffle, pack,
unpack operations; adds support for 47 SSE4.1 instructions.

In the sub-sections of 2.1.x, most of the descriptions on Intel Core microarchitecture also applies to
Enhanced Intel Core microarchitecture. Differences between them are note explicitly.

2.4.1

Intel® Core™ Microarchitecture Pipeline Overview

The pipeline of the Intel Core microarchitecture contains:

•

An in-order issue front end that fetches instruction streams from memory, with four instruction
decoders to supply decoded instruction (micro-ops) to the out-of-order execution core.

•

An out-of-order superscalar execution core that can issue up to six micro-ops per cycle (see
Table 2-26) and reorder micro-ops to execute as soon as sources are ready and execution resources
are available.

•

An in-order retirement unit that ensures the results of execution of micro-ops are processed and
architectural states are updated according to the original program order.

Intel Core 2 Extreme processor X6800, Intel Core 2 Duo processors and Intel Xeon processor 3000, 5100
series implement two processor cores based on the Intel Core microarchitecture. Intel Core 2 Extreme
quad-core processor, Intel Core 2 Quad processors and Intel Xeon processor 3200 series, 5300 series
implement four processor cores. Each physical package of these quad-core processors contains two
processor dies, each die containing two processor cores. The functionality of the subsystems in each core
are depicted in Figure 2-6.

Instruction Fetch and P reD ecode

Instruction Q ueue
M icrocode
ROM

D ecode
Shared L2 C ache
U p to 10.7 G B/s
FS B

R enam e/Alloc
R etirem ent U nit
(R e-O rder B uffer)

Scheduler

ALU
B ranch
M M X/SS E/FP
M ove

ALU
FAdd
M M X /SSE

ALU
FM ul
M M X/S SE

Load

Store

L1D C ache and D TLB
O M 198 08

Figure 2-6. Intel Core Microarchitecture Pipeline Functionality

2-33

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.4.2

Front End

The front ends needs to supply decoded instructions (micro-ops) and sustain the stream to a six-issue
wide out-of-order engine. The components of the front end, their functions, and the performance challenges to microarchitectural design are described in Table 2-25.

Table 2-25. Components of the Front End
Component

Functions

Performance Challenges

Branch Prediction Unit
(BPU)

•

Helps the instruction fetch unit fetch the
most likely instruction to be executed by
predicting the various branch types:
conditional, indirect, direct, call, and
return. Uses dedicated hardware for each
type.

•
•

Enables speculative execution.
Improves speculative execution
efficiency by reducing the amount of
code in the “non-architected path”1
to be fetched into the pipeline.

Instruction Fetch Unit

•

Prefetches instructions that are likely to
be executed
Caches frequently-used instructions
Predecodes and buffers instructions,
maintaining a constant bandwidth despite
irregularities in the instruction stream

•

Variable length instruction format
causes unevenness (bubbles) in
decode bandwidth.
Taken branches and misaligned
targets causes disruptions in the
overall bandwidth delivered by the
fetch unit.

Decodes up to four instructions, or up to
five with macro-fusion
Stack pointer tracker algorithm for
efficient procedure entry and exit
Implements the Macro-Fusion feature,
providing higher performance and
efficiency
The Instruction Queue is also used as a
loop cache, enabling some loops to be
executed with both higher bandwidth
and lower power

•

•
•
Instruction Queue and
Decode Unit

•
•
•
•

•

•
•

Varying amounts of work per
instruction requires expansion into
variable numbers of micro-ops.
Prefix adds a dimension of decoding
complexity.
Length Changing Prefix (LCP) can
cause front end bubbles.

NOTES:
1. Code paths that the processor thought it should execute but then found out it should go in another path and therefore
reverted from its initial intention.

2.4.2.1

Branch Prediction Unit

Branch prediction enables the processor to begin executing instructions long before the branch outcome
is decided. All branches utilize the BPU for prediction. The BPU contains the following features:

•
•

16-entry Return Stack Buffer (RSB). It enables the BPU to accurately predict RET instructions.
Front end queuing of BPU lookups. The BPU makes branch predictions for 32 bytes at a time, twice
the width of the fetch engine. This enables taken branches to be predicted with no penalty.
Even though this BPU mechanism generally eliminates the penalty for taken branches, software
should still regard taken branches as consuming more resources than do not-taken branches.

The BPU makes the following types of predictions:

•

Direct Calls and Jumps. Targets are read as a target array, without regarding the taken or not-taken
prediction.

•

Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having
targets that vary in accordance with recent program behavior.

•

Conditional branches. Predicts the branch target and whether or not the branch will be taken.

For information about optimizing software for the BPU, see Section 3.4, “Optimizing the Front End.”

2-34

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.4.2.2

Instruction Fetch Unit

The instruction fetch unit comprises the instruction translation lookaside buffer (ITLB), an instruction
prefetcher, the instruction cache and the predecode logic of the instruction queue (IQ).

Instruction Cache and ITLB
An instruction fetch is a 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers. A hit in the instruction cache causes 16 bytes to be delivered to the instruction
predecoder. Typical programs average slightly less than 4 bytes per instruction, depending on the code
being executed. Since most instructions can be decoded by all decoders, an entire fetch can often be
consumed by the decoders in one cycle.
A misaligned target reduces the number of instruction bytes by the amount of offset into the 16 byte
fetch quantity. A taken branch reduces the number of instruction bytes delivered to the decoders since
the bytes after the taken branch are not decoded. Branches are taken approximately every 10 instructions in typical integer code, which translates into a “partial” instruction fetch every 3 or 4 cycles.
Due to stalls in the rest of the machine, front end starvation does not usually cause performance degradation. For extremely fast code with larger instructions (such as SSE2 integer media kernels), it may be
beneficial to use targeted alignment to prevent instruction starvation.

Instruction PreDecode
The predecode unit accepts the sixteen bytes from the instruction cache or prefetch buffers and carries
out the following tasks:

•
•
•

Determine the length of the instructions.
Decode all prefixes associated with instructions.
Mark various properties of instructions for the decoders (for example, “is branch.”).

The predecode unit can write up to six instructions per cycle into the instruction queue. If a fetch contains
more than six instructions, the predecoder continues to decode up to six instructions per cycle until all
instructions in the fetch are written to the instruction queue. Subsequent fetches can only enter predecoding after the current fetch completes.
For a fetch of seven instructions, the predecoder decodes the first six in one cycle, and then only one in
the next cycle. This process would support decoding 3.5 instructions per cycle. Even if the instruction per
cycle (IPC) rate is not fully optimized, it is higher than the performance seen in most applications. In
general, software usually does not have to take any extra measures to prevent instruction starvation.
The following instruction prefixes cause problems during length decoding. These prefixes can dynamically change the length of instructions and are known as length changing prefixes (LCPs):

•
•

Operand Size Override (66H) preceding an instruction with a word immediate data.
Address Size Override (67H) preceding an instruction with a mod R/M in real, 16-bit protected or 32bit protected modes.

When the predecoder encounters an LCP in the fetch line, it must use a slower length decoding algorithm.
With the slower length decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the
usual 1 cycle.
Normal queuing within the processor pipeline usually cannot hide LCP penalties.
The REX prefix (4xh) in the Intel 64 architecture instruction set can change the size of two classes of
instruction: MOV offset and MOV immediate. Nevertheless, it does not cause an LCP penalty and hence
is not considered an LCP.

2.4.2.3

Instruction Queue (IQ)

The instruction queue is 18 instructions deep. It sits between the instruction predecode unit and the
instruction decoders. It sends up to five instructions per cycle, and supports one macro-fusion per cycle.
It also serves as a loop cache for loops smaller than 18 instructions. The loop cache operates as described
below.
2-35

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

A Loop Stream Detector (LSD) resides in the BPU. The LSD attempts to detect loops which are candidates
for streaming from the instruction queue (IQ). When such a loop is detected, the instruction bytes are
locked down and the loop is allowed to stream from the IQ until a misprediction ends it. When the loop
plays back from the IQ, it provides higher bandwidth at reduced power (since much of the rest of the
front end pipeline is shut off).
The LSD provides the following benefits:

•
•
•
•

No loss of bandwidth due to taken branches.
No loss of bandwidth due to misaligned instructions.
No LCP penalties, as the pre-decode stage has already been passed.
Reduced front end power consumption, because the instruction cache, BPU and predecode unit can
be idle.

Software should use the loop cache functionality opportunistically. Loop unrolling and other code optimizations may make the loop too big to fit into the LSD. For high performance code, loop unrolling is generally preferable for performance even when it overflows the loop cache capability.

2.4.2.4

Instruction Decode

The Intel Core microarchitecture contains four instruction decoders. The first, Decoder 0, can decode
Intel 64 and IA-32 instructions up to 4 micro-ops in size. Three other decoders handle single micro-op
instructions. The microsequencer can provide up to 3 micro-ops per cycle, and helps decode instructions
larger than 4 micro-ops.
All decoders support the common cases of single micro-op flows, including: micro-fusion, stack pointer
tracking and macro-fusion. Thus, the three simple decoders are not limited to decoding single micro-op
instructions. Packing instructions into a 4-1-1-1 template is not necessary and not recommended.
Macro-fusion merges two instructions into a single micro-op. Intel Core microarchitecture is capable of
one macro-fusion per cycle in 32-bit operation (including compatibility sub-mode of the Intel 64 architecture), but not in 64-bit mode because code that uses longer instructions (length in bytes) more often is
less likely to take advantage of hardware support for macro-fusion.

2.4.2.5

Stack Pointer Tracker

The Intel 64 and IA-32 architectures have several commonly used instructions for parameter passing and
procedure entry and exit: PUSH, POP, CALL, LEAVE and RET. These instructions implicitly update the
stack pointer register (RSP), maintaining a combined control and parameter stack without software
intervention. These instructions are typically implemented by several micro-ops in previous microarchitectures.
The Stack Pointer Tracker moves all these implicit RSP updates to logic contained in the decoders themselves. The feature provides the following benefits:

•

Improves decode bandwidth, as PUSH, POP and RET are single micro-op instructions in Intel Core
microarchitecture.

•
•

Conserves execution bandwidth as the RSP updates do not compete for execution resources.

•

Improves power efficiency as the RSP updates are carried out on small, dedicated hardware.

Improves parallelism in the out of order execution engine as the implicit serial dependencies between
micro-ops are removed.

2.4.2.6

Micro-fusion

Micro-fusion fuses multiple micro-ops from the same instruction into a single complex micro-op. The
complex micro-op is dispatched in the out-of-order execution core. Micro-fusion provides the following
performance advantages:

•

Improves instruction bandwidth delivered from decode to retirement.

2-36

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

Reduces power consumption as the complex micro-op represents more work in a smaller format (in
terms of bit density), reducing overall “bit-toggling” in the machine for a given amount of work and
virtually increasing the amount of storage in the out-of-order execution engine.

Many instructions provide register flavors and memory flavors. The flavor involving a memory operand
will decodes into a longer flow of micro-ops than the register version. Micro-fusion enables software to
use memory to register operations to express the actual program behavior without worrying about a loss
of decode bandwidth.

2.4.3

Execution Core

The execution core of the Intel Core microarchitecture is superscalar and can process instructions out of
order. When a dependency chain causes the machine to wait for a resource (such as a second-level data
cache line), the execution core executes other instructions. This increases the overall rate of instructions
executed per cycle (IPC).
The execution core contains the following three major components:

•

Renamer — Moves micro-ops from the front end to the execution core. Architectural registers are
renamed to a larger set of microarchitectural registers. Renaming eliminates false dependencies
known as read-after-read and write-after-read hazards.

•

Reorder buffer (ROB) — Holds micro-ops in various stages of completion, buffers completed microops, updates the architectural state in order, and manages ordering of exceptions. The ROB has 96
entries to handle instructions in flight.

•

Reservation station (RS) — Queues micro-ops until all source operands are ready, schedules and
dispatches ready micro-ops to the available execution units. The RS has 32 entries.

The initial stages of the out of order core move the micro-ops from the front end to the ROB and RS. In
this process, the out of order core carries out the following steps:

•
•
•
•

Allocates resources to micro-ops (for example: these resources could be load or store buffers).
Binds the micro-op to an appropriate issue port.
Renames sources and destinations of micro-ops, enabling out of order execution.
Provides data to the micro-op when the data is either an immediate value or a register value that has
already been calculated.

The following list describes various types of common operations and how the core executes them efficiently:

•

Micro-ops with single-cycle latency — Most micro-ops with single-cycle latency can be executed
by multiple execution units, enabling multiple streams of dependent operations to be executed
quickly.

•

Frequently-used μops with longer latency — These micro-ops have pipelined execution units so
that multiple micro-ops of these types may be executing in different parts of the pipeline simultaneously.

•

Operations with data-dependent latencies — Some operations, such as division, have data
dependent latencies. Integer division parses the operands to perform the calculation only on
significant portions of the operands, thereby speeding up common cases of dividing by small
numbers.

•

Floating-point operations with fixed latency for operands that meet certain restrictions —
Operands that do not fit these restrictions are considered exceptional cases and are executed with
higher latency and reduced throughput. The lower-throughput cases do not affect latency and
throughput for more common cases.

•

Memory operands with variable latency, even in the case of an L1 cache hit — Loads that are
not known to be safe from forwarding may wait until a store-address is resolved before executing.
The memory order buffer (MOB) accepts and processes all memory operations. See Section 2.4.4 for
more information about the MOB.

2-37

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.4.3.1

Issue Ports and Execution Units

The scheduler can dispatch up to six micro-ops per cycle through the issue ports. The issue ports of Intel
Core microarchitecture and Enhanced Intel Core microarchitecture are depicted in Table 2-26, the former
is denoted by its CPUID signature of DisplayFamily_DisplayModel value of 06_0FH, the latter denoted by
the corresponding signature value of 06_17H. The table provides latency and throughput data of
common integer and floating-point (FP) operations for each issue port in cycles.

Table 2-26. Issue Ports of Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture
Executable operations

Latency, Throughput

Comment1

Signature =
06_0FH

Signature =
06_17H

Integer ALU

1, 1

1, 1

Includes 64-bit mode integer MUL;

Integer SIMD ALU

1, 1

1, 1

Issue port 0; Writeback port 0;

FP/SIMD/SSE2 Move and Logic

1, 1

1, 1

Single-precision (SP) FP MUL

4, 1

4, 1

Issue port 0; Writeback port 0

Double-precision FP MUL

5, 1

5, 1

FP MUL (X87)

5, 2

5, 2

Issue port 0; Writeback port 0

FP Shuffle

1, 1

1, 1

FP shuffle does not handle QW shuffle.

Integer ALU

1, 1

1, 1

Excludes 64-bit mode integer MUL;

Integer SIMD ALU

1, 1

1, 1

Issue port 1; Writeback port 1;

FP/SIMD/SSE2 Move and Logic

1, 1

1, 1

FP ADD

3, 1

3, 1

QW Shuffle

1, 12

1, 13

Integer loads

3, 1

3, 1

FP loads

4, 1

4, 1

Store address4

3, 1

3, 1

DIV/SQRT

Store data

Issue port 1; Writeback port 1;
Issue port 2; Writeback port 2;
Issue port 3;

5.

Issue Port 4;

Integer ALU

1, 1

1, 1

Integer SIMD ALU

1, 1

1, 1

FP/SIMD/SSE2 Move and Logic

1, 1

1, 1

QW shuffles

1, 12

128-bit Shuffle/Pack/Unpack

2-4,

2-46

Issue port 5; Writeback port 5;

1, 13
1-3,

Issue port 5; Writeback port 5;

17

NOTES:
1. Mixing operations of different latencies that use the same port can result in writeback bus conflicts; this can reduce overall throughput.
2. 128-bit instructions executes with longer latency and reduced throughput.
3. Uses 128-bit shuffle unit in port 5.
4. Prepares the store forwarding and store retirement logic with the address of the data being stored.
5. Prepares the store forwarding and store retirement logic with the data being stored.
6. Varies with instructions; 128-bit instructions are executed using QW shuffle units.
7. Varies with instructions, 128-bit shuffle unit replaces QW shuffle units in Intel Core microarchitecture.
In each cycle, the RS can dispatch up to six micro-ops. Each cycle, up to 4 results may be written back to
the RS and ROB, to be used as early as the next cycle by the RS. This high execution bandwidth enables
execution bursts to keep up with the functional expansion of the micro-fused micro-ops that are decoded
and retired.

2-38

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The execution core contains the following three execution stacks:

•
•
•

SIMD integer.
Regular integer.
x87/SIMD floating-point.

The execution core also contains connections to and from the memory cluster. See Figure 2-7.

EXE
Data Cache
Unit
0,1,5
SIMD
Integer

Integer/
SIMD
MUL

0,1,5

0,1,5

Integer

Floating
Point

dtlb
Memory ordering
store forwarding

Load

2

Store (address)

3

Store (data)

4

Figure 2-7. Execution Core of Intel Core Microarchitecture
Notice that the two dark squares inside the execution block (in grey color) and appear in the path
connecting the integer and SIMD integer stacks to the floating-point stack. This delay shows up as an
extra cycle called a bypass delay. Data from the L1 cache has one extra cycle of latency to the floatingpoint unit. The dark-colored squares in Figure 2-7 represent the extra cycle of latency.

2.4.4

Intel® Advanced Memory Access

The Intel Core microarchitecture contains an instruction cache and a first-level data cache in each core.
The two cores share a 2 or 4-MByte L2 cache. All caches are writeback and non-inclusive. Each core
contains:

•

L1 data cache, known as the data cache unit (DCU) — The DCU can handle multiple outstanding
cache misses and continue to service incoming stores and loads. It supports maintaining cache
coherency. The DCU has the following specifications:
— 32-KBytes size.
— 8-way set associative.
— 64-bytes line size.

•

Data translation lookaside buffer (DTLB) — The DTLB in Intel Core microarchitecture
implements two levels of hierarchy. Each level of the DTLB have multiple entries and can support
either 4-KByte pages or large pages. The entries of the inner level (DTLB0) is used for loads. The
entries in the outer level (DTLB1) support store operations and loads that missed DTLB0. All entries
are 4-way associative. Here is a list of entries in each DTLB:

2-39

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

— DTLB1 for large pages: 32 entries.
— DTLB1 for 4-KByte pages: 256 entries.
— DTLB0 for large pages: 16 entries.
— DTLB0 for 4-KByte pages: 16 entries.
An DTLB0 miss and DTLB1 hit causes a penalty of 2 cycles. Software only pays this penalty if the
DTLB0 is used in some dispatch cases. The delays associated with a miss to the DTLB1 and PMH are
largely non-blocking due to the design of Intel Smart Memory Access.

•
•

Page miss handler (PMH)
A memory ordering buffer (MOB) — Which:
— Enables loads and stores to issue speculatively and out of order.
— Ensures retired loads and stores have the correct data upon retirement.
— Ensures loads and stores follow memory ordering rules of the Intel 64 and IA-32 architectures.

The memory cluster of the Intel Core microarchitecture uses the following to speed up memory operations:

•
•
•
•
•
•
•
•
•

128-bit load and store operations.
Data prefetching to L1 caches.
Data prefetch logic for prefetching to the L2 cache.
Store forwarding.
Memory disambiguation.
8 fill buffer entries.
20 store buffer entries.
Out of order execution of memory operations.
Pipelined read-for-ownership operation (RFO).

For information on optimizing software for the memory cluster, see Section 3.6, “Optimizing Memory
Accesses.”

2.4.4.1

Loads and Stores

The Intel Core microarchitecture can execute up to one 128-bit load and up to one 128-bit store per
cycle, each to different memory locations. The microarchitecture enables execution of memory operations out of order with respect to other instructions and with respect to other memory operations.
Loads can:

•
•
•
•

Issue before preceding stores when the load address and store address are known not to conflict.
Be carried out speculatively, before preceding branches are resolved.
Take cache misses out of order and in an overlapped manner.
Issue before preceding stores, speculating that the store is not going to be to a conflicting address.

Loads cannot:

•
•

Speculatively take any sort of fault or trap.
Speculatively access the uncacheable memory type.

Faulting or uncacheable loads are detected and wait until retirement, when they update the programmer
visible state. x87 and floating-point SIMD loads add 1 additional clock latency.
Stores to memory are executed in two phases:

•

Execution phase — Prepares the store buffers with address and data for store forwarding.
Consumes dispatch ports, which are ports 3 and 4.

2-40

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

Completion phase — The store is retired to programmer-visible memory. It may compete for cache
banks with executing loads. Store retirement is maintained as a background task by the memory
order buffer, moving the data from the store buffers to the L1 cache.

2.4.4.2

Data Prefetch to L1 caches

Intel Core microarchitecture provides two hardware prefetchers to speed up data accessed by a program
by prefetching to the L1 data cache:

•

Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is
triggered by an ascending access to very recently loaded data. The processor assumes that this
access is part of a streaming algorithm and automatically fetches the next line.

•

Instruction pointer (IP)- based strided prefetcher — This prefetcher keeps track of individual
load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to
the next address which is the sum of the current address and the stride. This prefetcher can prefetch
forward or backward and can detect strides of up to half of a 4KB-page, or 2 KBytes.

Data prefetching works on loads only when the following conditions are met:

•
•
•
•
•
•

Load is from writeback memory type.
Prefetch request is within the page boundary of 4 Kbytes.
No fence or lock is in progress in the pipeline.
Not many other load misses are in progress.
The bus is not very busy.
There is not a continuous stream of stores.

DCU Prefetching has the following effects:

•

Improves performance if data in large structures is arranged sequentially in the order used in the
program.

•

May cause slight performance degradation due to bandwidth issues if access patterns are sparse
instead of local.

•

On rare occasions, if the algorithm's working set is tuned to occupy most of the cache and unneeded
prefetches evict lines required by the program, hardware prefetcher may cause severe performance
degradation due to cache capacity of L1.

In contrast to hardware prefetchers relying on hardware to anticipate data traffic, software prefetch
instructions relies on the programmer to anticipate cache miss traffic, software prefetch act as hints to
bring a cache line of data into the desired levels of the cache hierarchy. The software-controlled prefetch
is intended for prefetching data, but not for prefetching code.

2.4.4.3

Data Prefetch Logic

Data prefetch logic (DPL) prefetches data to the second-level (L2) cache based on past request patterns
of the DCU from the L2. The DPL maintains two independent arrays to store addresses from the DCU: one
for upstreams (12 entries) and one for down streams (4 entries). The DPL tracks accesses to one 4K byte
page in each entry. If an accessed page is not in any of these arrays, then an array entry is allocated.
The DPL monitors DCU reads for incremental sequences of requests, known as streams. Once the DPL
detects the second access of a stream, it prefetches the next cache line. For example, when the DCU
requests the cache lines A and A+1, the DPL assumes the DCU will need cache line A+2 in the near
future. If the DCU then reads A+2, the DPL prefetches cache line A+3. The DPL works similarly for
“downward” loops.
The Intel Pentium M processor introduced DPL. The Intel Core microarchitecture added the following
features to DPL:

•

The DPL can detect more complicated streams, such as when the stream skips cache lines. DPL may
issue 2 prefetch requests on every L2 lookup. The DPL in the Intel Core microarchitecture can run up
to 8 lines ahead from the load request.

2-41

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

DPL in the Intel Core microarchitecture adjusts dynamically to bus bandwidth and the number of
requests. DPL prefetches far ahead if the bus is not busy, and less far ahead if the bus is busy.

•

DPL adjusts to various applications and system configurations.

Entries for the two cores are handled separately.

2.4.4.4

Store Forwarding

If a load follows a store and reloads the data that the store writes to memory, the Intel Core microarchitecture can forward the data directly from the store to the load. This process, called store to load
forwarding, saves cycles by enabling the load to obtain the data directly from the store operation instead
of through memory.
The following rules must be met for store to load forwarding to occur:

•
•
•
•
•

The store must be the last store to that address prior to the load.
The store must be equal or greater in size than the size of data being loaded.
The load cannot cross a cache line boundary.
The load cannot cross an 8-Byte boundary. 16-Byte loads are an exception to this rule.
The load must be aligned to the start of the store address, except for the following exceptions:
— An aligned 64-bit store may forward either of its 32-bit halves.
— An aligned 128-bit store may forward any of its 32-bit quarters.
— An aligned 128-bit store may forward either of its 64-bit halves.

Software can use the exceptions to the last rule to move complex structures without losing the ability to
forward the subfields.
In Enhanced Intel Core microarchitecture, the alignment restrictions to permit store forwarding to
proceed have been relaxed. Enhanced Intel Core microarchitecture permits store-forwarding to proceed
in several situations that the succeeding load is not aligned to the preceding store. Figure 2-8 shows six
situations (in gradient-filled background) of store-forwarding that are permitted in Enhanced Intel Core
microarchitecture but not in Intel Core microarchitecture. The cases with backward slash background
depicts store-forwarding that can proceed in both Intel Core microarchitecture and Enhanced Intel Core
microarchitecture.

2-42

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Byte 0

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

8 byte boundary

Byte 6

Byte 7

8 byte boundary

Store 32 bit
Load 32 bit
Load 16 bit

Example: 7 byte misalignment

Load 8

Load 16 bit

Load 8

Load 8

Load 8

Store 64 bit
Load 64 bit
Example: 1 byte misalignment

Load 32 bit
Load 16 bit
Load 8

Load 8

Load 32 bit
Load 16 bit

Load 8

Load 8

Load 16 bit
Load 8

Load 8

Load 16 bit
Load 8

Load 8

Store 64 bit
Load 64 bit

Store

Load 32 bit
Load 16 bit
Load 8

Load 8

Load 32 bit
Load 16 bit

Load 8

Load 8

Load 16 bit
Load 8

Load 8

Store-forwarding (SF) can not proceed
Load 16 bit

Load 8

Load 8

SF proceed in Enhanced Intel Core microarchitectu
SF proceed

Figure 2-8. Store-Forwarding Enhancements in Enhanced Intel Core Microarchitecture

2.4.4.5

Memory Disambiguation

A load instruction micro-op may depend on a preceding store. Many microarchitectures block loads until
all preceding store address are known.
The memory disambiguator predicts which loads will not depend on any previous stores. When the
disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1
data cache.
Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed.

2.4.5

Intel® Advanced Smart Cache

The Intel Core microarchitecture optimized a number of features for two processor cores on a single die.
The two cores share a second-level cache and a bus interface unit, collectively known as Intel Advanced
Smart Cache. This section describes the components of Intel Advanced Smart Cache. Figure 2-9 illustrates the architecture of the Intel Advanced Smart Cache.

2-43

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Core 1

Core 0

Branch
Prediction

Retirement

Fetch/
Decode

Execution

L1 Data
Cache

Branch
Prediction

Retirement

Fetch/
Decode

Execution

L1 Data
Cache

L1 Instr.
Cache

L1 Instr.
Cache

L2 Cache

Bus Interface Unit
System Bus

Figure 2-9. Intel Advanced Smart Cache Architecture
Table 2-27 details the parameters of caches in the Intel Core microarchitecture. For information on
enumerating the cache hierarchy identification using the deterministic cache parameter leaf of CPUID
instruction, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.

Table 2-27. Cache Parameters of Processors based on Intel Core Microarchitecture
Line Size
(bytes)

Access
Latency
(clocks)

Access
Throughput
(clocks)

Write Update
Policy

1

Writeback

Level

Capacity

Associativity
(ways)

First Level

32 KB

8

64

3

Instruction

32 KB

8

N/A

N/A

N/A

N/A

2

Second Level
(Shared L2)1

2, 4 MB

8 or 16

64

14

2

Writeback

Second Level
(Shared L2)3

3, 6MB

12 or 24

64

152

2

Writeback

Third Level4

8, 12, 16 MB 16

64

~110

12

Writeback

NOTES:
1. Intel Core microarchitecture (CPUID signature DisplayFamily = 06H, DisplayModel = 0FH).
2. Software-visible latency will vary depending on access patterns and other factors.
3. Enhanced Intel Core microarchitecture (CPUID signature DisaplyFamily = 06H, DisplayModel = 17H or 1DH).
4. Enhanced Intel Core microarchitecture (CPUID signature DisaplyFamily = 06H, DisplayModel = 1DH).

2-44

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.4.5.1

Loads

When an instruction reads data from a memory location that has write-back (WB) type, the processor
looks for the cache line that contains this data in the caches and memory in the following order:
1. DCU of the initiating core.
2. DCU of the other core and second-level cache.
3. System memory.
The cache line is taken from the DCU of the other core only if it is modified, ignoring the cache line availability or state in the L2 cache.
Table 2-28 shows the characteristics of fetching the first four bytes of different localities from the
memory cluster. The latency column provides an estimate of access latency. However, the actual latency
can vary depending on the load of cache, memory components, and their parameters.

Table 2-28. Characteristics of Load and Store Operations in Intel Core Microarchitecture
Data Locality

Load

Store

Latency

Throughput

Latency

Throughput

DCU

3

1

2

1

DCU of the other core in
modified state

14 + 5.5 bus cycles

14 + 5.5 bus cycles

14 + 5.5 bus cycles

2nd-level cache

14

3

14

3

Memory

14 + 5.5 bus cycles +
memory

Depends on bus read
protocol

14 + 5.5 bus cycles +
memory

Depends on bus
write protocol

Sometimes a modified cache line has to be evicted to make space for a new cache line. The modified
cache line is evicted in parallel to bringing the new data and does not require additional latency. However,
when data is written back to memory, the eviction uses cache bandwidth and possibly bus bandwidth as
well. Therefore, when multiple cache misses require the eviction of modified lines within a short time,
there is an overall degradation in cache response time.

2.4.5.2

Stores

When an instruction writes data to a memory location that has WB memory type, the processor first
ensures that the line is in Exclusive or Modified state in its own DCU. The processor looks for the cache
line in the following locations, in the specified order:
1. DCU of initiating core.
2. DCU of the other core and L2 cache.
3. System memory.
The cache line is taken from the DCU of the other core only if it is modified, ignoring the cache line availability or state in the L2 cache. After reading for ownership is completed, the data is written to the firstlevel data cache and the line is marked as modified.
Reading for ownership and storing the data happens after instruction retirement and follows the order of
retirement. Therefore, the store latency does not effect the store instruction itself. However, several
sequential stores may have cumulative latency that can affect performance. Table 2-28 presents store
latencies depending on the initial cache line location.

2-45

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.5

INTEL® MICROARCHITECTURE CODE NAME NEHALEM

Intel microarchitecture code name Nehalem provides the foundation for many innovative features of
Intel Core i7 processors and Intel Xeon processor 3400, 5500, and 7500 series. It builds on the success
of 45 nm enhanced Intel Core microarchitecture and provides the following feature enhancements:

•

Enhanced processor core
— Improved branch prediction and recovery from misprediction.
— Enhanced loop streaming to improve front end performance and reduce power consumption.
— Deeper buffering in out-of-order engine to extract parallelism.
— Enhanced execution units to provide acceleration in CRC, string/text processing and data
shuffling.

•

Hyper-Threading Technology
— Provides two hardware threads (logical processors) per core.
— Takes advantage of 4-wide execution engine, large L3, and massive memory bandwidth.

•

Smart Memory Access
— Integrated memory controller provides low-latency access to system memory and scalable
memory bandwidth.
— New cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic.
— Two level TLBs and increased TLB size.
— Fast unaligned memory access.

•

Dedicated Power management Innovations
— Integrated microcontroller with optimized embedded firmware to manage power consumption.
— Embedded real-time sensors for temperature, current, and power.
— Integrated power gate to turn off/on per-core power consumption.
— Versatility to reduce power consumption of memory, link subsystems.

Intel microarchitecture code name Westmere is a 32 nm version of Intel microarchitecture code name
Nehalem. All of the features of latter also apply to the former.

2.5.1

Microarchitecture Pipeline

Intel microarchitecture code name Nehalem continues the four-wide microarchitecture pipeline
pioneered by the 65nm Intel Core microarchitecture. Figure 2-10 illustrates the basic components of the
pipeline of Intel microarchitecture code name Nehalem as implemented in Intel Core i7 processor, only
two of the four cores are sketched in the Figure 2-10 pipeline diagram.

2-46

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Instruction Fetch and
PreDecode

Instruction Fetch and
PreDecode

Instruction Queue

Instruction Queue
Microcode
ROM

Microcode
ROM

Decode

Decode

Rename/Alloc

Rename/Alloc

Retirement Unit
(Re-Order Buffer)

Retirement Unit
(Re-Order Buffer)

Scheduler

Scheduler

EXE
Unit
Cluster
0

EXE
Unit
Cluster
1

EXE
Unit
Cluster
5

Load

Stor
e

L1D Cache and DTLB

L2 Cache

EXE
Unit
Cluster
0

EXE
Unit
Cluster
1

EXE
Unit
Cluster
5

Stor
e

Load

L1D Cache and DTLB

L2 Cache
Other L2
Inclusive L3 Cache by all cores

OM19808p

Intel QPI Link Logic

Figure 2-10. Intel Microarchitecture Code Name Nehalem Pipeline Functionality
The length of the pipeline in Intel microarchitecture code name Nehalem is two cycles longer than its
predecessor in 45 nm Intel Core 2 processor family, as measured by branch misprediction delay. The
front end can decode up to 4 instructions in one cycle and supports two hardware threads by decoding
the instruction streams between two logical processors in alternate cycles. The front end includes
enhancement in branch handling, loop detection, MSROM throughput, etc. These are discussed in subsequent sections.
The scheduler (or reservation station) can dispatch up to six micro-ops in one cycle through six issue
ports (five issue ports are shown in Figure 2-10; store operation involves separate ports for store
address and store data but is depicted as one in the diagram).
The out-of-order engine has many execution units that are arranged in three execution clusters shown in
Figure 2-10. It can retire four micro-ops in one cycle, same as its predecessor.

2-47

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.5.2

Front End Overview

Figure 2-11 depicts the key components of the front end of the microarchitecture. The instruction fetch
unit (IFU) can fetch up to 16 bytes of aligned instruction bytes each cycle from the instruction cache to
the instruction length decoder (ILD). The instruction queue (IQ) buffers the ILD-processed instructions
and can deliver up to four instructions in one cycle to the instruction decoder.

MSROM
4 micro-ops per cycle

ICache

4
ILD

IDQ
4 micro-ops
per cycle
max

IQ
1

I Fetch U
1
Instr.
Length
Decoder

1
Instr. Queue

LSD

Instr. Decoder

Br. Predict U

Instr. Decoder
Queue

Figure 2-11. Front End of Intel Microarchitecture Code Name Nehalem
The instruction decoder has three decoder units that can decode one simple instruction per cycle per
unit. The other decoder unit can decode one instruction every cycle, either simple instruction or complex
instruction made up of several micro-ops. Instructions made up of more than four micro-ops are delivered from the MSROM. Up to four micro-ops can be delivered each cycle to the instruction decoder queue
(IDQ).
The loop stream detector is located inside the IDQ to improve power consumption and front end efficiency for loops with a short sequence of instructions.
The instruction decoder supports micro-fusion to improve front end throughput, increase the effective
size of queues in the scheduler and re-order buffer (ROB). The rules for micro-fusion are similar to those
of Intel Core microarchitecture.
The instruction queue also supports macro-fusion to combine adjacent instructions into one micro-ops
where possible. In previous generations of Intel Core microarchitecture, macro-fusion support for
CMP/Jcc sequence is limited to the CF and ZF flag, and macrofusion is not supported in 64-bit mode.
In Intel microarchitecture code name Nehalem , macro-fusion is supported in 64-bit mode, and the
following instruction sequences are supported:

•

CMP or TEST can be fused when comparing (unchanged):
REG-REG. For example: CMP EAX,ECX; JZ label
REG-IMM. For example: CMP EAX,0x80; JZ label
REG-MEM. For example: CMP EAX,[ECX]; JZ label
MEM-REG. For example: CMP [EAX],ECX; JZ label

•

TEST can fused with all conditional jumps (unchanged).

2-48

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

CMP can be fused with the following conditional jumps. These conditional jumps check carry flag (CF)
or zero flag (ZF). The list of macro-fusion-capable conditional jumps are (unchanged):
JA or JNBE
JAE or JNB or JNC
JE or JZ
JNA or JBE
JNAE or JC or JB
JNE or JNZ

•

CMP can be fused with the following conditional jumps in Intel microarchitecture code name
Nehalem, (this is an enhancement):
JL or JNGE
JGE or JNL
JLE or JNG
JG or JNLE

The hardware improves branch handling in several ways. Branch target buffer has increased to increase
the accuracy of branch predictions. Renaming is supported with return stack buffer to reduce mispredictions of return instructions in the code. Furthermore, hardware enhancement improves the handling of
branch misprediction by expediting resource reclamation so that the front end would not be waiting to
decode instructions in an architected code path (the code path in which instructions will reach retirement) while resources were allocated to executing mispredicted code path. Instead, new micro-ops
stream can start forward progress as soon as the front end decodes the instructions in the architected
code path.

2.5.3

Execution Engine

The IDQ (Figure 2-11) delivers micro-op stream to the allocation/renaming stage (Figure 2-10) of the
pipeline. The out-of-order engine supports up to 128 micro-ops in flight. Each micro-ops must be allocated with the following resources: an entry in the re-order buffer (ROB), an entry in the reservation
station (RS), and a load/store buffer if a memory access is required.
The allocator also renames the register file entry of each micro-op in flight. The input data associated
with a micro-op are generally either read from the ROB or from the retired register file.
The RS is expanded to 36 entry deep (compared to 32 entries in previous generation). It can dispatch up
to six micro-ops in one cycle if the micro-ops are ready to execute. The RS dispatch a micro-op through
an issue port to a specific execution cluster, each cluster may contain a collection of integer/FP/SIMD
execution units.
The result from the execution unit executing a micro-op is written back to the register file, or forwarded
through a bypass network to a micro-op in-flight that needs the result. Intel microarchitecture code
name Nehalem can support write back throughput of one register file write per cycle per port. The bypass
network consists of three domains of integer/FP/SIMD. Forwarding the result within the same bypass
domain from a producer micro-op to a consumer micro is done efficiently in hardware without delay.
Forwarding the result across different bypass domains may be subject to additional bypass delays. The
bypass delays may be visible to software in addition to the latency and throughput characteristics of individual execution units. The bypass delays between a producer micro-op and a consumer micro-op across
different bypass domains are shown in Table 2-29.

2-49

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-29. Bypass Delay Between Producer and Consumer Micro-ops (cycles)
FP

Integer

SIMD

FP

0

2

2

Integer

2

0

1

SIMD

2

1

0

2.5.3.1

Issue Ports and Execution Units

Table 2-30 summarizes the key characteristics of the issue ports and the execution unit latency/throughputs for common operations in the microarchitecture.

Table 2-30. Issue Ports of Intel Microarchitecture Code Name Nehalem
Port

Executable
operations

Latency

Throughpu
t

Domain

Port 0

Integer ALU

1

1

Integer

Integer Shift

1

1

Port 0
Port 0

Integer SIMD ALU

1

1

Integer SIMD Shuffle

1

1

Single-precision (SP)
FP MUL

4

1

Double-precision FP MUL

5

1

5

1

1

1

DIV/SQRT

1

1

FP MUL (X87)
FP/SIMD/SSE2 Move and
Logic
FP Shuffle
Port 1

Port 1

Integer ALU

1

1

Integer LEA

1

1

Integer Mul

3

1

Integer SIMD MUL

1

1

Integer SIMD Shift

1

1

PSAD

3

1

SIMD
FP

Integer

SIMD

StringCompare
Port 1

FP ADD

3

1

FP

Port 2

Integer loads

4

1

Integer

Port 3

Store address

5

1

Integer

Port 4

Store data

Port 5

Port 5

2-50

Integer

Integer ALU

1

1

Integer Shift

1

1

Jmp

1

1

Integer SIMD ALU

1

1

Integer SIMD Shuffle

1

1

Integer

SIMD

Comment

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Table 2-30. Issue Ports of Intel Microarchitecture Code Name Nehalem (Contd.)
Port

Executable
operations

Latency

Throughpu
t

Domain

Port 5

FP/SIMD/SSE2 Move and
Logic

1

1

FP

2.5.4

Comment

Cache and Memory Subsystem

Intel microarchitecture code name Nehalem contains an instruction cache, a first-level data cache and a
second-level unified cache in each core (see Figure 2-10). Each physical processor may contain several
processor cores and a shared collection of sub-systems that are referred to as “uncore“. Specifically in
Intel Core i7 processor, the uncore provides a unified third-level cache shared by all cores in the physical
processor, Intel QuickPath Interconnect links and associated logic. The L1 and L2 caches are writeback
and non-inclusive.
The shared L3 cache is writeback and inclusive, such that a cache line that exists in either L1 data cache,
L1 instruction cache, unified L2 cache also exists in L3. The L3 is designed to use the inclusive nature to
minimize snoop traffic between processor cores. Table 2-31 lists characteristics of the cache hierarchy.
The latency of L3 access may vary as a function of the frequency ratio between the processor and the
uncore sub-system.

Table 2-31. Cache Parameters of Intel Core i7 Processors
Line Size
(bytes)

Access
Latency
(clocks)

Access
Throughput
(clocks)

Write Update
Policy

Level

Capacity

Associativity
(ways)

First Level Data

32 KB

8

64

4

1

Writeback

Instruction

32 KB

4

N/A

N/A

N/A

N/A

64

101

Varies

Writeback

64

35-40+2

Varies

Writeback

Second Level
Third Level
(Shared L3)2

256KB
8MB

8
16

NOTES:
1. Software-visible latency will vary depending on access patterns and other factors.
2. Minimal L3 latency is 35 cycles if the frequency ratio between core and uncore is unity.
The Intel microarchitecture code name Nehalem implements two levels of translation lookaside buffer
(TLB). The first level consists of separate TLBs for data and code. DTLB0 handles address translation for
data accesses, it provides 64 entries to support 4KB pages and 32 entries for large pages. The ITLB
provides 64 entries (per thread) for 4KB pages and 7 entries (per thread) for large pages.
The second level TLB (STLB) handles both code and data accesses for 4KB pages. It support 4KB page
translation operation that missed DTLB0 or ITLB. All entries are 4-way associative. Here is a list of entries
in each DTLB:

•
•
•

STLB for 4-KByte pages: 512 entries (services both data and instruction look-ups).
DTLB0 for large pages: 32 entries.
DTLB0 for 4-KByte pages: 64 entries.

An DTLB0 miss and STLB hit causes a penalty of 7cycles. Software only pays this penalty if the DTLB0 is
used in some dispatch cases. The delays associated with a miss to the STLB and PMH are largely nonblocking.

2-51

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.5.5

Load and Store Operation Enhancements

The memory cluster of Intel microarchitecture code name Nehalem provides the following enhancements
to speed up memory operations:

•
•
•
•
•

Peak issue rate of one 128-bit load and one 128-bit store operation per cycle.
Deeper buffers for load and store operations: 48 load buffers, 32 store buffers and 10 fill buffers.
Fast unaligned memory access and robust handling of memory alignment hazards.
Improved store-forwarding for aligned and non-aligned scenarios.
Store forwarding for most address alignments.

2.5.5.1

Efficient Handling of Alignment Hazards

The cache and memory subsystems handles a significant percentage of instructions in every workload.
Different address alignment scenarios will produce varying performance impact for memory and cache
operations. For example, 1-cycle throughput of L1 (see Table 2-32) generally applies to naturally-aligned
loads from L1 cache. But using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to
access data from L1 will experience varying amount of delays depending on specific microarchitectures
and alignment scenarios.

Table 2-32. Performance Impact of Address Alignments of MOVDQU from L1
Throughput (cycle)

Intel Core i7
Processor

45 nm Intel Core
Microarchitecture

65 nm Intel Core
Microarchitecture

Alignment Scenario

06_1AH

06_17H

06_0FH

16B aligned

1

2

2

Not-16B aligned, not cache split

1

~2

~2

Split cache line boundary

~4.5

~20

~20

Table 2-32 lists approximate throughput of issuing MOVDQU instructions with different address alignment scenarios to load data from the L1 cache. If a 16-byte load spans across cache line boundary,
previous microarchitecture generations will experience significant software-visible delays.
Intel microarchitecture code name Nehalem provides hardware enhancements to reduce the delays of
handling different address alignment scenarios including cache line splits.

2.5.5.2

Store Forwarding Enhancement

When a load follows a store and reloads the data that the store writes to memory, the microarchitecture
can forward the data directly from the store to the load in many cases. This situation, called store to load
forwarding, saves several cycles by enabling the load to obtain the data directly from the store operation
instead of through the memory system.
Several general rules must be met for store to load forwarding to proceed without delay:

•
•
•

The store must be the last store to that address prior to the load.
The store must be equal or greater in size than the size of data being loaded.
The load data must be completely contained in the preceding store.

Specific address alignment and data sizes between the store and load operations will determine whether
a store-forward situation may proceed with data forwarding or experience a delay via the cache/memory
sub-system. The 45 nm Enhanced Intel Core microarchitecture offers more flexible address alignment
and data sizes requirement than previous microarchitectures. Intel microarchitecture code name
Nehalem offers additional enhancement with allowing more situations to forward data expeditiously.
2-52

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The store-forwarding situations for with respect to store operations of 16 bytes are illustrated in
Figure 2-12.

Figure 2-12. Store-Forwarding Scenarios of 16-Byte Store Operations
Intel microarchitecture code name Nehalem allows store-to-load forwarding to proceed regardless of
store address alignment (The white space in the diagram does not correspond to an applicable store-toload scenario). Figure 2-13 illustrates situations for store operation of 8 bytes or less.

2-53

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Figure 2-13. Store-Forwarding Enhancement in Intel Microarchitecture Code Name Nehalem

2.5.6

REP String Enhancement

REP prefix in conjunction with MOVS/STOS instruction and a count value in ECX are frequently used to
implement library functions such as memcpy()/memset(). These are referred to as "REP string" instructions. Each iteration of these instruction can copy/write constant a value in byte/word/dword/qword
granularity The performance characteristics of using REP string can be attributed to two components:
startup overhead and data transfer throughput.
The two components of performance characteristics of REP String varies further depending on granularity, alignment, and/or count values. Generally, MOVSB is used to handle very small chunks of data.
Therefore, processor implementation of REP MOVSB is optimized to handle ECX < 4. Using REP MOVSB
with ECX > 3 will achieve low data throughput due to not only byte-granular data transfer but also additional startup overhead. The latency for MOVSB, is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX
>9 have a 50-cycle startup cost.
For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP
String exhibit step-wise increase:

•
•

Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware
optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string
latency will vary if one of the 16-byte data transfer spans across cache line boundary:
— Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds
4 cycles.
— Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data
adds 6cycles.

•

Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about
15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

Intel microarchitecture code name Nehalem improves the performance of REP strings significantly over
previous microarchitectures in several ways:

•
•

Startup overhead have been reduced in most cases relative to previous microarchitecture.
Data transfer throughput are improved over previous generation.

2-54

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

In order for REP string to operate in “fast string” mode, previous microarchitectures requires address
alignment. In Intel microarchitecture code name Nehalem, REP string can operate in “fast string”
mode even if address is not aligned to 16 bytes.

2.5.7

Enhancements for System Software

In addition to microarchitectural enhancements that can benefit both application-level and system-level
software, Intel microarchitecture code name Nehalem enhances several operations that primarily benefit
system software.
Lock primitives: Synchronization primitives using the Lock prefix (e.g. XCHG, CMPXCHG8B) executes
with significantly reduced latency than previous microarchitectures.
VMM overhead improvements: VMX transitions between a Virtual Machine (VM) and its supervisor (the
VMM) can take thousands of cycle each time on previous microarchitectures. The latency of VMX transitions has been reduced in processors based on Intel microarchitecture code name Nehalem.

2.5.8

Efficiency Enhancements for Power Consumption

Intel microarchitecture code name Nehalem is not only designed for high performance and power-efficient performance under wide range of loading situations, it also features enhancement for low power
consumption while the system idles. Intel microarchitecture code name Nehalem supports processorspecific C6 states, which have the lowest leakage power consumption that OS can manage through ACPI
and OS power management mechanisms.

2.5.9

Hyper-Threading Technology Support in Intel® Microarchitecture Code Name
Nehalem

Intel microarchitecture code name Nehalem supports Hyper-Threading Technology (HT). Its implementation of HT provides two logical processors sharing most execution/cache resources in each core. The HT
implementation in Intel microarchitecture code name Nehalem differs from previous generations of HT
implementations using Intel NetBurst microarchitecture in several areas:

•

Intel microarchitecture code name Nehalem provides four-wide execution engine, more functional
execution units coupled to three issue ports capable of issuing computational operations.

•

Intel microarchitecture code name Nehalem supports integrated memory controller that can provide
peak memory bandwidth of up to 25.6 GB/sec in Intel Core i7 processor.

•

Deeper buffering and enhanced resource sharing/partition policies:
— Replicated resource for HT operation: register state, renamed return stack buffer, large-page
ITLB.
— Partitioned resources for HT operation: load buffers, store buffers, re-order buffers, small-page
ITLB are statically allocated between two logical processors.
— Competitively-shared resource during HT operation: the reservation station, cache hierarchy, fill
buffers, both DTLB0 and STLB.
— Alternating during HT operation: front end operation generally alternates between two logical
processors to ensure fairness.
— HT unaware resources: execution units.

2.6

INTEL® HYPER-THREADING TECHNOLOGY

Intel® Hyper-Threading Technology (HT Technology) enables software to take advantage of task-level, or
thread-level parallelism by providing multiple logical processors within a physical processor package, or
within each processor core in a physical processor package. In its first implementation in the Intel Xeon
2-55

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

processor, Hyper-Threading Technology makes a single physical processor (or a processor core) appear
as two or more logical processors. Intel Xeon Phi processors based on the Knights Landing microarchitecture support 4 logical processors in each processor core; see Chapter 16 for detailed information of
Hyper-Threading Technology that is implemented in the Knights Landing microarchitecture.
Most Intel Architecture processor families support Hyper-Threading Technology with two logical processors in each processor core, or in a physical processor in early implementations. The rest of this section
describes features of the early implementation of Hyper-Threading Technology. Most of the descriptions
also apply to later Hyper-Threading Technology implementations supporting two logical processors. The
microarchitecture sections in this chapter provide additional details to individual microarchitecture and
enhancements to Hyper-Threading Technology.
The two logical processors each have a complete set of architectural registers while sharing one single
physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.
By sharing resources needed for peak demands between two logical processors, HT Technology is well
suited for multiprocessor systems to provide an additional performance boost in throughput when
compared to traditional MP systems.
Figure 2-14 shows a typical bus-based symmetric multiprocessor (SMP) based on processors supporting
HT Technology. Each logical processor can execute a software thread, allowing a maximum of two software threads to execute simultaneously on one physical processor. The two software threads execute
simultaneously, meaning that in the same clock cycle an “add” operation from logical processor 0 and
another “add” operation and load from logical processor 1 can be executed simultaneously by the execution engine.
In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor. This minimizes the die area cost of implementing HT
Technology while still achieving performance gains for multithreaded applications or multitasking workloads.

Architectural
State

Architectural
State

Architectural
State

Execution Engine

Execution Engine

Local APIC

Architectural
State

Local APIC

Local APIC

Bus Interface

Local APIC

Bus Interface

System Bus
OM15152

Figure 2-14. Hyper-Threading Technology on an SMP

2-56

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

The performance potential due to HT Technology is due to:

•

The fact that operating systems and user programs can schedule processes or threads to execute
simultaneously on the logical processors in each physical processor.

•

The ability to use on-chip execution resources at a higher level than when only a single thread is
consuming the execution resources; higher level of resource utilization can lead to higher system
throughput.

2.6.1

Processor Resources and HT Technology

The majority of microarchitecture resources in a physical processor are shared between the logical
processors. Only a few small data structures were replicated for each logical processor. This section
describes how resources are shared, partitioned or replicated.

2.6.1.1

Replicated Resources

The architectural state is replicated for each logical processor. The architecture state consists of registers
that are used by the operating system and application code to control program behavior and store data
for computations. This state includes the eight general-purpose registers, the control registers, machine
state registers, debug registers, and others. There are a few exceptions, most notably the memory type
range registers (MTRRs) and the performance monitoring resources. For a complete list of the architecture state and exceptions, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volumes 3A, 3B & 3C.
Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track execution and state changes of the two logical processors. The return stack predictor is replicated to improve branch prediction of return instructions.
In addition, a few buffers (for example, the 2-entry instruction streaming buffers) were replicated to
reduce complexity.

2.6.1.2

Partitioned Resources

Several buffers are shared by limiting the use of each logical processor to half the entries. These are
referred to as partitioned resources. Reasons for this partitioning include:

•
•

Operational fairness.
Permitting the ability to allow operations from one logical processor to bypass operations of the other
logical processor that may have stalled.

For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical
processor from making forward progress for some number of cycles. The partitioning prevents the stalled
logical processor from blocking forward progress.
In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers
include µop queues after the execution trace cache, the queues after the register rename stage, the
reorder buffer which stages instructions for retirement, and the load and store buffers.
In the case of load and store buffers, partitioning also provided an easier implementation to maintain
memory ordering for each logical processor and detect memory ordering violations.

2-57

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.6.1.3

Shared Resources

Most resources in a physical processor are fully shared to improve the dynamic utilization of the
resource, including caches and all the execution units. Some shared resources which are linearly
addressed, like the DTLB, include a logical processor ID bit to distinguish whether the entry belongs to
one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID bit:

•
•

Shared mode: The L1 data cache is fully shared by two logical processors.
Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically
across logical processors sharing the L1 data cache.

The other resources are fully shared.

2.6.2

Microarchitecture Pipeline and HT Technology

This section describes the HT Technology microarchitecture and how instructions from the two logical
processors are handled between the front end and the back end of the pipeline.
Although instructions originating from two programs or two threads execute simultaneously and not
necessarily in program order in the execution core and memory hierarchy, the front end and back end
contain several selection points to select between instructions from the two logical processors. All selection points alternate between the two logical processors unless one logical processor cannot make use of
a pipeline stage. In this case, the other logical processor has full use of every cycle of the pipeline stage.
Reasons why a logical processor may not use a pipeline stage include cache misses, branch mispredictions, and instruction dependencies.

2.6.3

Front End Pipeline

The execution trace cache is shared between two logical processors. Execution trace cache access is arbitrated by the two logical processors every clock. If a cache line is fetched for one logical processor in one
clock cycle, the next clock cycle a line would be fetched for the other logical processor provided that both
logical processors are requesting access to the trace cache.
If one logical processor is stalled or is unable to use the execution trace cache, the other logical processor
can use the full bandwidth of the trace cache until the initial logical processor’s instruction fetches return
from the L2 cache.
After fetching the instructions and building traces of µops, the µops are placed in a queue. This queue
decouples the execution trace cache from the register rename pipeline stage. As described earlier, if both
logical processors are active, the queue is partitioned so that both logical processors can make independent forward progress.

2.6.4

Execution Core

The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops
are placed in the queues waiting for execution, there is no distinction between instructions from the two
logical processors. The execution core and memory hierarchy is also oblivious to which instructions
belong to which logical processor.
After execution, instructions are placed in the re-order buffer. The re-order buffer decouples the execution stage from the retirement stage. The re-order buffer is partitioned such that each uses half the
entries.

2-58

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.6.5

Retirement

The retirement logic tracks when instructions from the two logical processors are ready to be retired. It
retires the instruction in program order for each logical processor by alternating between the two logical
processors. If one logical processor is not ready to retire any instructions, then all retirement bandwidth
is dedicated to the other logical processor.
Once stores have retired, the processor needs to write the store data into the level-one data cache.
Selection logic alternates between the two logical processors to commit store data to the cache.

2.7

INTEL® 64 ARCHITECTURE

Intel 64 architecture supports almost all features in the IA-32 Intel architecture and extends support to
run 64-bit OS and 64-bit applications in 64-bit linear address space. Intel 64 architecture provides a new
operating mode, referred to as IA-32e mode, and increases the linear address space for software to 64
bits and supports physical address space up to 40 bits.
IA-32e mode consists of two sub-modes: (1) compatibility mode enables a 64-bit operating system to
run most legacy 32-bit software unmodified, (2) 64-bit mode enables a 64-bit operating system to run
applications written to access 64-bit linear address space.
In the 64-bit mode of Intel 64 architecture, software may access:

•
•
•

64-bit flat linear addressing.
8 additional general-purpose registers (GPRs).
8 additional registers (XMM) for streaming SIMD extensions (SSE, SSE2, SSE3, SSSE3, SSE4.1,
SSE4.2, AESNI, PCLMULQDQ).
— Sixteen 256-bit YMM registers (whose lower 128 bits are overlaid to the respective XMM
registers) if AVX, F16C, AVX2 or FMA are supported.

•
•
•
•

64-bit-wide GPRs and instruction pointers.
Uniform byte-register addressing.
Fast interrupt-prioritization mechanism.
A new instruction-pointer relative-addressing mode.

2.8

SIMD TECHNOLOGY

SIMD computations (see Figure 2-15) were introduced to the architecture with MMX technology. MMX
technology allows SIMD computations to be performed on packed byte, word, and doubleword integers.
The integers are contained in a set of eight 64-bit registers called MMX registers (see Figure 2-16).
The Pentium III processor extended the SIMD computation model with the introduction of the Streaming
SIMD Extensions (SSE). SSE allows SIMD computations to be performed on operands that contain four
packed single-precision floating-point data elements. The operands can be in memory or in a set of eight
128-bit XMM registers (see Figure 2-16). SSE also extended SIMD computational capability by adding
additional 64-bit MMX instructions.
Figure 2-15 shows a typical SIMD computation. Two sets of four packed data elements (X1, X2, X3, and
X4, and Y1, Y2, Y3, and Y4) are operated on in parallel, with the same operation being performed on each
corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of
the four parallel computations are sorted as a set of four packed data elements.

2-59

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

X4

Y4

X3

Y3

OP

X4 op Y4

X2

Y2

OP

X3 op Y3

X1

Y1

OP

X2 op Y2

OP

X1 op Y1
OM15148

Figure 2-15. Typical SIMD Operations
The Pentium 4 processor further extended the SIMD computation model with the introduction of
Streaming SIMD Extensions 2 (SSE2), Streaming SIMD Extensions 3 (SSE3), and Intel Xeon processor
5100 series introduced Supplemental Streaming SIMD Extensions 3 (SSSE3).
SSE2 works with operands in either memory or in the XMM registers. The technology extends SIMD
computations to process packed double-precision floating-point data elements and 128-bit packed integers. There are 144 instructions in SSE2 that operate on two packed double-precision floating-point data
elements or on 16 packed byte, 8 packed word, 4 doubleword, and 2 quadword integers.
SSE3 enhances x87, SSE and SSE2 by providing 13 instructions that can accelerate application performance in specific areas. These include video processing, complex arithmetics, and thread synchronization. SSE3 complements SSE and SSE2 with instructions that process SIMD data asymmetrically,
facilitate horizontal computation, and help avoid loading cache line splits. See Figure 2-16.
SSSE3 provides additional enhancement for SIMD computation with 32 instructions on digital video and
signal processing.
SSE4.1, SSE4.2 and AESNI are additional SIMD extensions that provide acceleration for applications in
media processing, text/lexical processing, and block encryption/decryption.
The SIMD extensions operates the same way in Intel 64 architecture as in IA-32 architecture, with the
following enhancements:

•
•

128-bit SIMD instructions referencing XMM register can access 16 XMM registers in 64-bit mode.
Instructions that reference 32-bit general purpose registers can access 16 general purpose registers
in 64-bit mode.

2-60

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

64-bit M MX Registers

128-bit XMM Registers

MM7

XMM7
MM7

MM6

XMM6

MM5

XMM5

MM4

XMM4

MM3

XMM3

MM2

XMM2

MM1

XMM1

MM0

XMM0
OM15149

Figure 2-16. SIMD Instruction Register Usage
SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications and applications that have the following characteristics:

•
•
•
•

Inherently parallel.
Recurring memory access patterns.
Localized recurring operations performed on the data.
Data-independent control flow.

2.9

SUMMARY OF SIMD TECHNOLOGIES AND APPLICATION LEVEL
EXTENSIONS

SIMD floating-point instructions fully support the IEEE Standard 754 for Binary Floating-Point Arithmetic.
They are accessible from all IA-32 execution modes: protected mode, real address mode, and Virtual
8086 mode.
SSE, SSE2, and MMX technologies are architectural extensions. Existing software will continue to run
correctly, without modification on Intel microprocessors that incorporate these technologies. Existing
software will also run correctly in the presence of applications that incorporate SIMD technologies.
SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can
improve cache usage and application performance.
For more on SSE, SSE2, SSE3 and MMX technologies, see the following chapters in the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1:

•
•
•
•
•
•

Chapter 9, “Programming with Intel® MMX™ Technology”.
Chapter 10, “Programming with Streaming SIMD Extensions (SSE)”.
Chapter 11, “Programming with Streaming SIMD Extensions 2 (SSE2)”.
Chapter 12, “Programming with SSE3, SSSE3 and SSE4”.
Chapter 14, “Programming with AVX, FMA and AVX2”.
Chapter 15, “Programming with Intel® Transactional Synchronization Extensions”.

2-61

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.9.1

MMX™ Technology

MMX Technology introduced:

•
•

64-bit MMX registers.
Support for SIMD operations on packed byte, word, and doubleword integers.

MMX instructions are useful for multimedia and communications software.

2.9.2

Streaming SIMD Extensions

Streaming SIMD extensions introduced:

•
•
•
•
•

128-bit XMM registers.
128-bit data type with four packed single-precision floating-point operands.
Data prefetch instructions.
Non-temporal store instructions and other cacheability and memory ordering instructions.
Extra 64-bit SIMD integer support.

SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and
decoding.

2.9.3

Streaming SIMD Extensions 2

Streaming SIMD extensions 2 add the following:

•
•

128-bit data type with two packed double-precision floating-point operands.

•
•
•
•

Support for SIMD arithmetic on 64-bit integer operands.

128-bit data types for SIMD integer operation on 16-byte, 8-word, 4-doubleword, or 2-quadword
integers.
Instructions for converting between new and existing data types.
Extended support for data shuffling.
Extended support for cacheability and memory ordering operations.

SSE2 instructions are useful for 3D graphics, video decoding/encoding, and encryption.

2.9.4

Streaming SIMD Extensions 3

Streaming SIMD extensions 3 add the following:

•
•
•
•

SIMD floating-point instructions for asymmetric and horizontal computation.
A special-purpose 128-bit load instruction to avoid cache line splits.
An x87 FPU instruction to convert to integer independent of the floating-point control word (FCW).
Instructions to support thread synchronization.

SSE3 instructions are useful for scientific, video and multi-threaded applications.

2.9.5

Supplemental Streaming SIMD Extensions 3

The Supplemental Streaming SIMD Extensions 3 introduces 32 new instructions to accelerate eight
types of computations on packed integers. These include:

•
•
•

12 instructions that perform horizontal addition or subtraction operations.
6 instructions that evaluate the absolute values.
2 instructions that perform multiply and add operations and speed up the evaluation of dot products.

2-62

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

•

2 instructions that accelerate packed-integer multiply operations and produce integer values with
scaling.

•

2 instructions that perform a byte-wise, in-place shuffle according to the second shuffle control
operand.

•

6 instructions that negate packed integers in the destination operand if the signs of the corresponding element in the source operand is less than zero.

•

2 instructions that align data from the composite of two operands.

2.9.6

SSE4.1

SSE4.1 introduces 47 new instructions to accelerate video, imaging and 3D applications. SSE4.1 also
improves compiler vectorization and significantly increase support for packed dword computation. These
include:

•
•
•
•
•
•

Two instructions perform packed dword multiplies.

•
•
•
•
•
•
•

Seven instructions improve data insertion and extractions from XMM registers

Two instructions perform floating-point dot products with input/output selects.
One instruction provides a streaming hint for WC loads.
Six instructions simplify packed blending.
Eight instructions expand support for packed integer MIN/MAX.
Four instructions support floating-point round with selectable rounding mode and precision exception
override.
Twelve instructions improve packed integer format conversions (sign and zero extensions).
One instruction improves SAD (sum absolute difference) generation for small block sizes.
One instruction aids horizontal searching operations of word integers.
One instruction improves masked comparisons.
One instruction adds qword packed equality comparisons.
One instruction adds dword packing with unsigned saturation.

2.9.7

SSE4.2

SSE4.2 introduces 7 new instructions. These include:

•
•

A 128-bit SIMD integer instruction for comparing 64-bit integer data elements.
Four string/text processing instructions providing a rich set of primitives, these primitives can
accelerate:
— Basic and advanced string library functions from strlen, strcmp, to strcspn.
— Delimiter processing, token extraction for lexing of text streams.
— Parser, schema validation including XML processing.

•
•

A general-purpose instruction for accelerating cyclic redundancy checksum signature calculations.
A general-purpose instruction for calculating bit count population of integer numbers.

2.9.8

AESNI and PCLMULQDQ

AESNI introduces 7 new instructions, six of them are primitives for accelerating algorithms based on AES
encryption/decryption standard, referred to as AESNI.
The PCLMULQDQ instruction accelerates general-purpose block encryption, which can perform carry-less
multiplication for two binary numbers up to 64-bit wide.

2-63

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

Typically, algorithm based on AES standard involve transformation of block data over multiple iterations
via several primitives. The AES standard supports cipher key of sizes 128, 192, and 256 bits. The respective cipher key sizes correspond to 10, 12, and 14 rounds of iteration.
AES encryption involves processing 128-bit input data (plaintext) through a finite number of iterative
operation, referred to as “AES round”, into a 128-bit encrypted block (ciphertext). Decryption follows the
reverse direction of iterative operation using the “equivalent inverse cipher” instead of the “inverse
cipher”.
The cryptographic processing at each round involves two input data, one is the “state”, the other is the
“round key”. Each round uses a different “round key”. The round keys are derived from the cipher key
using a “key schedule” algorithm. The “key schedule” algorithm is independent of the data processing of
encryption/decryption, and can be carried out independently from the encryption/decryption phase.
The AES extensions provide two primitives to accelerate AES rounds on encryption, two primitives for
AES rounds on decryption using the equivalent inverse cipher, and two instructions to support the AES
key expansion procedure.

2.9.9

Intel® Advanced Vector Extensions

Intel® Advanced Vector Extensions offers comprehensive architectural enhancements over previous
generations of Streaming SIMD Extensions. Intel AVX introduces the following architectural enhancements:

•
•

Support for 256-bit wide vectors and SIMD register set.

•

Instruction syntax support for generalized three-operand syntax to improve instruction programming
flexibility and efficient encoding of new instruction extensions.

•

Enhancement of legacy 128-bit SIMD instruction extensions to support three-operand syntax and to
simplify compiler vectorization of high-level language expressions.

•

Support flexible deployment of 256-bit AVX code, 128-bit AVX code, legacy 128-bit code and scalar
code.

256-bit floating-point instruction set enhancement with up to 2X performance gain relative to 128-bit
Streaming SIMD extensions.

Intel AVX instruction set and 256-bit register state management detail are described in IA-32 Intel®
Architecture Software Developer’s Manual, Volumes 2A, 2B and 3A. Optimization techniques for Intel
AVX is discussed in Chapter 11, “Optimization for Intel AVX, FMA, and AVX2”.

2.9.10

Half-Precision Floating-Point Conversion (F16C)

VCVTPH2PS and VCVTPS2PH are two instructions supporting half-precision floating-point data type
conversion to and from single-precision floating-point data types. These two instruction extends on the
same programming model as Intel AVX.

2.9.11

RDRAND

The RDRAND instruction retrieves a random number supplied by a cryptographically secure, deterministic random bit generator (DBRG). The DBRG is designed to meet NIST SP 800-90A standard.

2.9.12

Fused-Multiply-ADD (FMA) Extensions

FMA extensions enhances Intel AVX with high-throughput, arithmetic capabilities covering fused
multiply-add, fused multiply-subtract, fused multiply add/subtract interleave, signed-reversed multiply
on fused multiply-add and multiply-subtract operations. FMA extensions provide 36 256-bit floatingpoint instructions to perform computation on 256-bit vectors and additional 128-bit and scalar FMA
instructions.

2-64

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.9.13

Intel AVX2

Intel AVX2 extends Intel AVX by promoting most of the 128-bit SIMD integer instructions with 256-bit
numeric processing capabilities. AVX2 instructions follow the same programming model as AVX instructions.
In addition, AVX2 provide enhanced functionalities for broadcast/permute operations on data elements,
vector shift instructions with variable-shift count per data element, and instructions to fetch non-contiguous data elements from memory.

2.9.14

General-Purpose Bit-Processing Instructions

The fourth generation Intel Core processor family introduces a collection of bit processing instructions
that operate on the general purpose registers. The majority of these instructions uses the VEX-prefix
encoding scheme to provide non-destructive source operand syntax.
There instructions are enumerated by three separate feature flags reported by CPUID. For details, see
Section 5.1 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 and CHAPTER
3, 4 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B & 2C.

2.9.15

Intel® Transactional Synchronization Extensions

The fourth generation Intel Core processor family introduces Intel® Transactional Synchronization
Extensions (Intel TSX), which aim to improve the performance of lock-protected critical sections of multithreaded applications while maintaining the lock-based programming model.
For background and details, see Chapter 15, “Programming with Intel® Transactional Synchronization
Extensions” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
Software tuning recommendations for using Intel TSX on lock-protected critical sections of multithreaded
applications are described in Chapter 12, “Intel® TSX Recommendations”.

2.9.16

RDSEED

The Intel Core M processor family introduces the RDSEED, ADCX and ADOX instructions.
The RDSEED instruction retrieves a random number supplied by a cryptographically secure, enhanced
deterministic random bit generator Enhanced NRBG). The NRBG is designed to meet the NIST SP 80090B and NIST SP 800-90C standards.

2.9.17

ADCX and ADOX Instructions

The ADCX and ADOX instructions, in conjunction with MULX instruction, enable software to speed up
calculations that require large integer numerics. Details can be found at https://wwwssl.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmeticpaper.html? and http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/largeinteger-squaring-ia-paper.html.

2-65

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-66

CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
This chapter discusses general optimization techniques that can improve the performance of applications
running on processors based on Intel microarchitecture code name Haswell, Ivy Bridge, Sandy Bridge,
Westmere, Nehalem, Enhanced Intel Core microarchitecture and Intel Core microarchitectures. These
techniques take advantage of microarchitectural described in Chapter 2, “Intel® 64 and IA-32 Processor
Architectures.” Optimization guidelines focusing on Intel multi-core processors, Hyper-Threading Technology and 64-bit mode applications are discussed in Chapter 8, “Multicore and Hyper-Threading Technology,” and Chapter 9, “64-bit Mode Coding Guidelines.”
Practices that optimize performance focus on three areas:

•
•

Tools and techniques for code generation.

•

Tuning code to the target microarchitecture (or families of microarchitecture) to improve performance.

Analysis of the performance characteristics of the workload and its interaction with microarchitectural sub-systems.

Some hints on using tools are summarized first to simplify the first two tasks. the rest of the chapter will
focus on recommendations of code generation or code tuning to the target microarchitectures.
This chapter explains optimization techniques for the Intel C++ Compiler, the Intel Fortran Compiler, and
other compilers.

3.1

PERFORMANCE TOOLS

Intel offers several tools to help optimize application performance, including compilers, performance
analyzer and multithreading tools.

3.1.1

Intel® C++ and Fortran Compilers

Intel compilers support multiple operating systems (Windows*, Linux*, Mac OS* and embedded). The
Intel compilers optimize performance and give application developers access to advanced features:

•
•
•
•
•

Flexibility to target 32-bit or 64-bit Intel processors for optimization
Compatibility with many integrated development environments or third-party compilers.
Automatic optimization features to take advantage of the target processor’s architecture.
Automatic compiler optimization reduces the need to write different code for different processors.
Common compiler features that are supported across Windows, Linux and Mac OS include:
— General optimization settings.
— Cache-management features.
— Interprocedural optimization (IPO) methods.
— Profile-guided optimization (PGO) methods.
— Multithreading support.
— Floating-point arithmetic precision and consistency support.
— Compiler optimization and vectorization reports.

GENERAL OPTIMIZATION GUIDELINES

3.1.2

General Compiler Recommendations

Generally speaking, a compiler that has been tuned for the target microarchitecture can be expected to
match or outperform hand-coding. However, if performance problems are noted with the compiled code,
some compilers (like Intel C++ and Fortran Compilers) allow the coder to insert intrinsics or inline
assembly in order to exert control over what code is generated. If inline assembly is used, the user must
verify that the code generated is of good quality and yields good performance.
Default compiler switches are targeted for common cases. An optimization may be made to the compiler
default if it is beneficial for most programs. If the root cause of a performance problem is a poor choice
on the part of the compiler, using different switches or compiling the targeted module with a different
compiler may be the solution.

3.1.3

VTune™ Performance Analyzer

VTune uses performance monitoring hardware to collect statistics and coding information of your application and its interaction with the microarchitecture. This allows software engineers to measure performance characteristics of the workload for a given microarchitecture. VTune supports all current and past
Intel processor families.
The VTune Performance Analyzer provides two kinds of feedback:

•

Indication of a performance improvement gained by using a specific coding recommendation or
microarchitectural feature.

•

Information on whether a change in the program has improved or degraded performance with
respect to a particular metric.

The VTune Performance Analyzer also provides measures for a number of workload characteristics,
including:

•

Retirement throughput of instruction execution as an indication of the degree of extractable
instruction-level parallelism in the workload.

•
•

Data traffic locality as an indication of the stress point of the cache and memory hierarchy.
Data traffic parallelism as an indication of the degree of effectiveness of amortization of data access
latency.

NOTE
Improving performance in one part of the machine does not necessarily bring significant
gains to overall performance. It is possible to degrade overall performance by improving
performance for some particular metric.
Where appropriate, coding recommendations in this chapter include descriptions of the VTune Performance Analyzer events that provide measurable data on the performance gain achieved by following the
recommendations. For more on using the VTune analyzer, refer to the application’s online help.

3.2

PROCESSOR PERSPECTIVES

Many coding recommendations for work well across modern microarchitectures from Intel Core microarchitecture to the Haswell microarchitecture. However, there are situations where a recommendation may
benefit one microarchitecture more than another. Some of these are:

•

Instruction decode throughput is important. Additionally, taking advantage of decoded ICache, Loop
Stream Detector and macrofusion can further improve front end performance.

•

Generating code to take advantage 4 decoders and employ micro-fusion and macro-fusion so that
each of three simple decoders are not restricted to handling simple instructions consisting of one
micro-op.

3-2

GENERAL OPTIMIZATION GUIDELINES

•

On processors based on Sandy Bridge, Ivy Bridge and Haswell microarchitectures, the code size for
optimal front end performance is associated with the decode ICache.

•

Dependencies for partial register writes can incur varying degree of penalties To avoid false
dependences from partial register updates, use full register updates and extended moves.

•
•

Use appropriate instructions that support dependence-breaking (e.g. PXOR, SUB, XOR, XORPS).
Hardware prefetching can reduce the effective memory latency for data and instruction accesses in
general. But different microarchitectures may require some custom modifications to adapt to the
specific hardware prefetch implementation of each microarchitecture.

3.2.1

CPUID Dispatch Strategy and Compatible Code Strategy

When optimum performance on all processor generations is desired, applications can take advantage of
the CPUID instruction to identify the processor generation and integrate processor-specific instructions
into the source code. The Intel C++ Compiler supports the integration of different versions of the code
for different target processors. The selection of which code to execute at runtime is made based on the
CPU identifiers. Binary code targeted for different processor generations can be generated under the
control of the programmer or by the compiler.
For applications that target multiple generations of microarchitectures, and where minimum binary code
size and single code path is important, a compatible code strategy is the best. Optimizing applications
using techniques developed for the Intel Core microarchitecture and combined with Intel microarchitecture code name Nehalem are likely to improve code efficiency and scalability when running on processors
based on current and future generations of Intel 64 and IA-32 processors.

3.2.2

Transparent Cache-Parameter Strategy

If the CPUID instruction supports function leaf 4, also known as deterministic cache parameter leaf, the
leaf reports cache parameters for each level of the cache hierarchy in a deterministic and forwardcompatible manner across Intel 64 and IA-32 processor families.
For coding techniques that rely on specific parameters of a cache level, using the deterministic cache
parameter allows software to implement techniques in a way that is forward-compatible with future
generations of Intel 64 and IA-32 processors, and cross-compatible with processors equipped with
different cache sizes.

3.2.3

Threading Strategy and Hardware Multithreading Support

Intel 64 and IA-32 processor families offer hardware multithreading support in two forms: dual-core
technology and HT Technology.
To fully harness the performance potential of hardware multithreading in current and future generations
of Intel 64 and IA-32 processors, software must embrace a threaded approach in application design. At
the same time, to address the widest range of installed machines, multi-threaded software should be
able to run without failure on a single processor without hardware multithreading support and should
achieve performance on a single logical processor that is comparable to an unthreaded implementation
(if such comparison can be made). This generally requires architecting a multi-threaded application to
minimize the overhead of thread synchronization. Additional guidelines on multithreading are discussed
in Chapter 8, “Multicore and Hyper-Threading Technology.”

3.3

CODING RULES, SUGGESTIONS AND TUNING HINTS

This section includes rules, suggestions and hints. They are targeted for engineers who are:

•
•

Modifying source code to enhance performance (user/source rules).
Writing assemblers or compilers (assembly/compiler rules).

3-3

GENERAL OPTIMIZATION GUIDELINES

•

Doing detailed performance tuning (tuning suggestions).

Coding recommendations are ranked in importance using two measures:

•

Local impact (high, medium, or low) refers to a recommendation’s affect on the performance of a
given instance of code.

•

Generality (high, medium, or low) measures how often such instances occur across all application
domains. Generality may also be thought of as “frequency”.

These recommendations are approximate. They can vary depending on coding style, application domain,
and other factors.
The purpose of the high, medium, and low (H, M, and L) priorities is to suggest the relative level of
performance gain one can expect if a recommendation is implemented.
Because it is not possible to predict the frequency of a particular code instance in applications, priority
hints cannot be directly correlated to application-level performance gain. In cases in which applicationlevel performance gain has been observed, we have provided a quantitative characterization of the gain
(for information only). In cases in which the impact has been deemed inapplicable, no priority is
assigned.

3.4

OPTIMIZING THE FRONT END

Optimizing the front end covers two aspects:

•

Maintaining steady supply of micro-ops to the execution engine — Mispredicted branches can disrupt
streams of micro-ops, or cause the execution engine to waste execution resources on executing
streams of micro-ops in the non-architected code path. Much of the tuning in this respect focuses on
working with the Branch Prediction Unit. Common techniques are covered in Section 3.4.1, “Branch
Prediction Optimization.”

•

Supplying streams of micro-ops to utilize the execution bandwidth and retirement bandwidth as
much as possible — For Intel Core microarchitecture and Intel Core Duo processor family, this aspect
focuses maintaining high decode throughput. In Intel microarchitecture code name Sandy Bridge,
this aspect focuses on keeping the hod code running from Decoded ICache. Techniques to maximize
decode throughput for Intel Core microarchitecture are covered in Section 3.4.2, “Fetch and Decode
Optimization.”

3.4.1

Branch Prediction Optimization

Branch optimizations have a significant impact on performance. By understanding the flow of branches
and improving their predictability, you can increase the speed of code significantly.
Optimizations that help branch prediction are:

•

Keep code and data on separate pages. This is very important; see Section 3.6, “Optimizing Memory
Accesses,” for more information.

•
•
•
•
•

Eliminate branches whenever possible.

•

Avoid putting two conditional branch instructions in a loop so that both have the same branch target
address and, at the same time, belong to (i.e. have their last bytes' addresses within) the same 16byte aligned code block.

3-4

Arrange code to be consistent with the static branch prediction algorithm.
Use the PAUSE instruction in spin-wait loops.
Inline functions and pair up calls and returns.
Unroll as necessary so that repeatedly-executed loops have sixteen or fewer iterations (unless this
causes an excessive code size increase).

GENERAL OPTIMIZATION GUIDELINES

3.4.1.1

Eliminating Branches

Eliminating branches improves performance because:

•
•

It reduces the possibility of mispredictions.
It reduces the number of required branch target buffer (BTB) entries. Conditional branches, which
are never taken, do not consume BTB resources.

There are four principal ways of eliminating branches:

•
•
•
•

Arrange code to make basic blocks contiguous.
Unroll loops, as discussed in Section 3.4.1.7, “Loop Unrolling.”
Use the CMOV instruction.
Use the SETCC instruction.

The following rules apply to branch elimination:
Assembly/Compiler Coding Rule 1. (MH impact, M generality) Arrange code to make basic blocks
contiguous and eliminate unnecessary branches.
Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV
instructions to eliminate unpredictable conditional branches where possible. Do not do this for
predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches
(because using these instructions will incur execution overhead due to the requirement for executing
both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV
trades off control flow dependence for data dependence and restricts the capability of the out-of-order
engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch
prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if
the increase in computation time is less than the expected cost of a mispredicted branch.
Consider a line of C code that has a condition dependent upon one of the constants:
X = (A < B) ? CONST1 : CONST2;
This code conditionally compares two values, A and B. If the condition is true, X is set to CONST1; otherwise it is set to CONST2. An assembly code sequence equivalent to the above C code can contain
branches that are not predictable if there are no correlation in the two values.
Example 3-1 shows the assembly code with unpredictable branches. The unpredictable branches can be
removed with the use of the SETCC instruction. Example 3-2 shows optimized code that has no
branches.
Example 3-1. Assembly Code with an Unpredictable Branch
cmp a, b
jbe L30
mov ebx const1
jmp L31
L30:
mov ebx, const2
L31:

; Condition
; Conditional branch
; ebx holds X
; Unconditional branch

Example 3-2. Code Optimization to Eliminate Branches
xor ebx, ebx
cmp A, B
setge bl

; Clear ebx (X in the C code)

; When ebx = 0 or 1
; OR the complement condition
sub ebx, 1
; ebx=11...11 or 00...00
and ebx, CONST3; CONST3 = CONST1-CONST2
add ebx, CONST2; ebx=CONST1 or CONST2

3-5

GENERAL OPTIMIZATION GUIDELINES

The optimized code in Example 3-2 sets EBX to zero, then compares A and B. If A is greater than or equal
to B, EBX is set to one. Then EBX is decreased and AND’d with the difference of the constant values. This
sets EBX to either zero or the difference of the values. By adding CONST2 back to EBX, the correct value
is written to EBX. When CONST2 is equal to zero, the last instruction can be deleted.
Another way to remove branches is to use the CMOV and FCMOV instructions. Example 3-3 shows how
to change a TEST and branch instruction sequence using CMOV to eliminate a branch. If the TEST sets
the equal flag, the value in EBX will be moved to EAX. This branch is data-dependent, and is representative of an unpredictable branch.

Example 3-3. Eliminating Branch with CMOV Instruction
test ecx, ecx
jne 1H
mov eax, ebx
1H:
; To optimize code, combine jne and mov into one cmovcc instruction that checks the equal flag
test ecx, ecx
; Test the flags
cmoveq eax, ebx
; If the equal flag is set, move
; ebx to eax- the 1H: tag no longer needed

3.4.1.2

Spin-Wait and Idle Loops

The Pentium 4 processor introduces a new PAUSE instruction; the instruction is architecturally a NOP on
Intel 64 and IA-32 processor implementations.
To the Pentium 4 and later processors, this instruction acts as a hint that the code sequence is a spin-wait
loop. Without a PAUSE instruction in such loops, the Pentium 4 processor may suffer a severe penalty
when exiting the loop because the processor may detect a possible memory order violation. Inserting the
PAUSE instruction significantly reduces the likelihood of a memory order violation and as a result
improves performance.
In Example 3-4, the code spins until memory location A matches the value stored in the register EAX.
Such code sequences are common when protecting a critical section, in producer-consumer sequences,
for barriers, or other synchronization.
Example 3-4. Use of PAUSE Instruction
lock:

loop:

cmp eax, a
jne loop
; Code in critical section:
pause
cmp eax, a
jne loop
jmp lock

3.4.1.3

Static Prediction

Branches that do not have a history in the BTB (see Section 3.4.1, “Branch Prediction Optimization”) are
predicted using a static prediction algorithm:

•
•

Predict unconditional branches to be taken.
Predict indirect branches to be NOT taken.

The following rule applies to static elimination:

3-6

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with
the static branch prediction algorithm: make the fall-through code following a conditional branch be the
likely target for a branch with a forward target, and make the fall-through code following a conditional
branch be the unlikely target for a branch with a backward target.
Example 3-5 illustrates the static branch prediction algorithm. The body of an IF-THEN conditional is
predicted.
Example 3-5. Static Branch Prediction Algorithm
//Forward condition branches not taken (fall through)
IF {....

↓
}

IF {...

↓
}

//Backward conditional branches are taken
LOOP {...
↑ −− }
//Unconditional branches taken
JMP
------→
Example 3-6 and Example 3-7 provide basic rules for a static prediction algorithm. In Example 3-6, the
backward branch (JC BEGIN) is not in the BTB the first time through; therefore, the BTB does not issue
a prediction. The static predictor, however, will predict the branch to be taken, so a misprediction will not
occur.

Example 3-6. Static Taken Prediction
Begin: mov
and
imul
shld
jc

eax, mem32
eax, ebx
eax, edx
eax, 7
Begin

The first branch instruction (JC BEGIN) in Example 3-7 is a conditional forward branch. It is not in the
BTB the first time through, but the static predictor will predict the branch to fall through. The static
prediction algorithm correctly predicts that the CALL CONVERT instruction will be taken, even before the
branch has any branch history in the BTB.

Example 3-7. Static Not-Taken Prediction
mov
and
imul
shld
jc
mov
Begin: call

eax, mem32
eax, ebx
eax, edx
eax, 7
Begin
eax, 0
Convert

3-7

GENERAL OPTIMIZATION GUIDELINES

The Intel Core microarchitecture does not use the static prediction heuristic. However, to maintain
consistency across Intel 64 and IA-32 processors, software should maintain the static prediction heuristic
as the default.

3.4.1.4

Inlining, Calls and Returns

The return address stack mechanism augments the static and dynamic predictors to optimize specifically
for calls and returns. It holds 16 entries, which is large enough to cover the call depth of most programs.
If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may degrade.
The trace cache in Intel NetBurst microarchitecture maintains branch prediction information for calls and
returns. As long as the trace with the call or return remains in the trace cache and the call and return
targets remain unchanged, the depth limit of the return address stack described above will not impede
performance.
To enable the use of the return stack mechanism, calls and returns must be matched in pairs. If this is
done, the likelihood of exceeding the stack depth in a manner that will impact performance is very low.
The following rules apply to inlining, calls, and returns:
Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with
near returns, and far calls must be matched with far returns. Pushing the return address on the stack
and jumping to the routine to be called is not recommended since it creates a mismatch in calls and
returns.
Calls and returns are expensive; use inlining for the following reasons:

•
•
•

Parameter passing overhead can be eliminated.

•

A mispredicted branch can lead to performance penalties inside a small function that are larger than
those that would occur if that function is inlined.

In a compiler, inlining a function exposes more opportunity for optimization.
If the inlined routine contains branches, the additional context of the caller may improve branch
prediction within the routine.

Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if
doing so decreases code size or if the function is small and the call site is frequently executed.
Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so
increases the working set size beyond what will fit in the trace cache.
Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested
calls and returns in rapid succession; consider transforming the program with inline to reduce the call
depth.
Assembly/Compiler Coding Rule 8. (ML impact, ML generality) Favor inlining small functions that
contain branches with poor prediction rates. If a branch misprediction results in a RETURN being
prematurely predicted as taken, a performance penalty may be incurred.
Assembly/Compiler Coding Rule 9. (L impact, L generality) If the last statement in a function is
a call to another function, consider converting the call to a jump. This will save the call/return overhead
as well as an entry in the return stack buffer.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.
Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop
branches in a 16-byte chunk.

3.4.1.5

Code Alignment

Careful arrangement of code can enhance cache and memory locality. Likely sequences of basic blocks
should be laid out contiguously in memory. This may involve removing unlikely code, such as code to
handle error conditions, from the sequence. See Section 3.7, “Prefetching,” on optimizing the instruction
prefetcher.
3-8

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16byte aligned.
Assembly/Compiler Coding Rule 13. (M impact, H generality) If the body of a conditional is not
likely to be executed, it should be placed in another part of the program. If it is highly unlikely to be
executed and code locality is an issue, it should be placed on a different code page.

3.4.1.6

Branch Type Selection

The default predicted target for indirect branches and calls is the fall-through path. Fall-through prediction is overridden if and when a hardware prediction is available for that branch. The predicted branch
target from branch prediction hardware for an indirect branch is the previously executed branch target.
The default prediction to the fall-through path is only a significant issue if no branch prediction is available, due to poor code locality or pathological branch conflict problems. For indirect calls, predicting the
fall-through path is usually not an issue, since execution will likely return to the instruction after the
associated return.
Placing data immediately following an indirect branch can cause a performance problem. If the data
consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause
resource conflicts and slow down branch recovery. Also, data immediately following indirect branches
may appear as branches to the branch predication hardware, which can branch off to execute other data
pages. This can lead to subsequent self-modifying code problems.
Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are
present, try to put the most likely target of an indirect branch immediately following the indirect
branch. Alternatively, if indirect branches are common but they cannot be predicted by branch
prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the
processor from decoding down the fall-through path.
Indirect branches resulting from code constructs (such as switch statements, computed GOTOs or calls
through pointers) can jump to an arbitrary number of locations. If the code sequence is such that the
target destination of a branch goes to the same address most of the time, then the BTB will predict accurately most of the time. Since only one taken (non-fall-through) target can be stored in the BTB, indirect
branches with multiple taken targets may have lower prediction rates.
The effective number of targets stored may be increased by introducing additional conditional branches.
Adding a conditional branch to a target is fruitful if:

•

The branch direction is correlated with the branch history leading up to that branch; that is, not just
the last target, but how it got to this branch.

•

The source/target pair is common enough to warrant using the extra branch prediction capacity. This
may increase the number of overall branch mispredictions, while improving the misprediction of
indirect branches. The profitability is lower if the number of mispredicting branches is very large.

User/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more
common taken targets and at least one of those targets is correlated with branch history leading up to
the branch, then convert the indirect branch to a tree where one or more indirect branches are
preceded by conditional branches to those targets. Apply this “peeling” procedure to the common
target of an indirect branch that correlates to branch history.
The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of
branches (even at the expense of adding more branches). The added branches must be predictable for
this to be worthwhile. One reason for such predictability is a strong correlation with preceding branch
history. That is, the directions taken on preceding branches are a good indicator of the direction of the
branch under consideration.

3-9

GENERAL OPTIMIZATION GUIDELINES

Example 3-8 shows a simple example of the correlation between a target of a preceding conditional
branch and a target of an indirect branch.
Example 3-8. Indirect Branch With Two Favored Targets
function ()
{
int n = rand();
// random integer 0 to RAND_MAX
if ( ! (n & 0x01) ) { // n will be 0 half the times
n = 0;
// updates branch history to predict taken
}
// indirect branches with multiple taken targets
// may have lower prediction rates
switch (n) {
case 0: handle_0(); break;
case 1: handle_1(); break;
case 3: handle_3(); break;
default: handle_other();
}

// common target, correlated with
// branch history that is forward taken
// uncommon
// uncommon
// common target

}
Correlation can be difficult to determine analytically, for a compiler and for an assembly language
programmer. It may be fruitful to evaluate performance with and without peeling to get the best performance from a coding effort.
An example of peeling out the most favored target of an indirect branch with correlated branch history is
shown in Example 3-9.

Example 3-9. A Peeling Technique to Reduce Indirect Branch Misprediction
function ()
{
int n = rand();
if( ! (n & 0x01) ) THEN
n = 0;
if (!n) THEN
handle_0();

// Random integer 0 to RAND_MAX
// n will be 0 half the times
// Peel out the most common target
// with correlated branch history

{
switch (n) {
case 1: handle_1(); break;
case 3: handle_3(); break;
default: handle_other();
}
}
}

3-10

// Uncommon
// Uncommon
// Make the favored target in
// the fall-through path

GENERAL OPTIMIZATION GUIDELINES

3.4.1.7

Loop Unrolling

Benefits of unrolling loops are:

•

Unrolling amortizes the branch overhead, since it eliminates branches and some of the code to
manage induction variables.

•

Unrolling allows one to aggressively schedule (or pipeline) the loop to hide latencies. This is useful if
you have enough free registers to keep variables live as you stretch out the dependence chain to
expose the critical path.

•

Unrolling exposes the code to various other optimizations, such as removal of redundant loads,
common subexpression elimination, and so on.

The potential costs of unrolling loops are:

•

Excessive unrolling or unrolling of very large loops can lead to increased code size. This can be
harmful if the unrolled loop no longer fits in the trace cache (TC).

•

Unrolling loops whose bodies contain branches increases demand on BTB capacity. If the number of
iterations of the unrolled loop is 16 or fewer, the branch predictor should be able to correctly predict
branches in the loop body that alternate direction.

Assembly/Compiler Coding Rule 15. (H impact, M generality) Unroll small loops until the
overhead of the branch and induction variable accounts (generally) for less than 10% of the execution
time of the loop.
Assembly/Compiler Coding Rule 16. (H impact, M generality) Avoid unrolling loops excessively;
this may thrash the trace cache or instruction cache.
Assembly/Compiler Coding Rule 17. (M impact, M generality) Unroll loops that are frequently
executed and have a predictable number of iterations to reduce the number of iterations to 16 or fewer.
Do this unless it increases code size so that the working set no longer fits in the trace or instruction
cache. If the loop body contains more than one conditional branch, then unroll so that the number of
iterations is 16/(# conditional branches).
Example 3-10 shows how unrolling enables other optimizations.
Example 3-10. Loop Unrolling
Before unrolling:
do i = 1, 100
if ( i mod 2 == 0 ) then a( i ) = x
else a( i ) = y
enddo
After unrolling
do i = 1, 100, 2
a( i ) = y
a( i+1 ) = x
enddo
In this example, the loop that executes 100 times assigns X to every even-numbered element and Y to
every odd-numbered element. By unrolling the loop you can make assignments more efficiently,
removing one branch in the loop body.

3.4.1.8

Compiler Support for Branch Prediction

Compilers generate code that improves the efficiency of branch prediction in Intel processors. The Intel
C++ Compiler accomplishes this by:

•
•
•
•

Keeping code and data on separate pages.
Using conditional move instructions to eliminate branches.
Generating code consistent with the static branch prediction algorithm.
Inlining where appropriate.
3-11

GENERAL OPTIMIZATION GUIDELINES

•

Unrolling if the number of iterations is predictable.

With profile-guided optimization, the compiler can lay out basic blocks to eliminate branches for the most
frequently executed paths of a function or at least improve their predictability. Branch prediction need
not be a concern at the source level. For more information, see Intel C++ Compiler documentation.

3.4.2

Fetch and Decode Optimization

Intel Core microarchitecture provides several mechanisms to increase front end throughput. Techniques
to take advantage of some of these features are discussed below.

3.4.2.1

Optimizing for Micro-fusion

An Instruction that operates on a register and a memory operand decodes into more micro-ops than its
corresponding register-register version. Replacing the equivalent work of the former instruction using
the register-register version usually require a sequence of two instructions. The latter sequence is likely
to result in reduced fetch bandwidth.
Assembly/Compiler Coding Rule 18. (ML impact, M generality) For improving fetch/decode
throughput, Give preference to memory flavor of an instruction over the register-only flavor of the
same instruction, if such instruction can benefit from micro-fusion.
The following examples are some of the types of micro-fusions that can be handled by all decoders:

•

All stores to memory, including store immediate. Stores execute internally as two separate microops: store-address and store-data.

•

All “read-modify” (load+op) instructions between register and memory, for example:
ADDPS XMM9, OWORD PTR [RSP+40]
FADD
DOUBLE PTR [RDI+RSI*8]
XOR
RAX, QWORD PTR [RBP+32]

•

All instructions of the form “load and jump,” for example:
JMP
[RDI+200]
RET

•

CMP and TEST with immediate operand and memory.

An Intel 64 instruction with RIP relative addressing is not micro-fused in the following cases:

•

When an additional immediate is needed, for example:
CMP
[RIP+400], 27
MOV
[RIP+3000], 142

•

When an RIP is needed for control flow purposes, for example:
JMP
[RIP+5000000]

In these cases, Intel Core microarchitecture and Intel microarchitecture code name Sandy Bridge
provides a 2 micro-op flow from decoder 0, resulting in a slight loss of decode bandwidth since 2 microop flow must be steered to decoder 0 from the decoder with which it was aligned.
RIP addressing may be common in accessing global data. Since it will not benefit from micro-fusion,
compiler may consider accessing global data with other means of memory addressing.

3.4.2.2

Optimizing for Macro-fusion

Macro-fusion merges two instructions to a single micro-op. Intel Core microarchitecture performs this
hardware optimization under limited circumstances.
The first instruction of the macro-fused pair must be a CMP or TEST instruction. This instruction can be
REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The second instruction (adjacent in the
instruction stream) should be a conditional branch.
Since these pairs are common ingredient in basic iterative programming sequences, macro-fusion
improves performance even on un-recompiled binaries. All of the decoders can decode one macro-fused
3-12

GENERAL OPTIMIZATION GUIDELINES

pair per cycle, with up to three other instructions, resulting in a peak decode bandwidth of 5 instructions
per cycle.
Each macro-fused instruction executes with a single dispatch. This process reduces latency, which in this
case shows up as a cycle removed from branch mispredict penalty. Software also gain all other fusion
benefits: increased rename and retire bandwidth, more storage for instructions in-flight, and power
savings from representing more work in fewer bits.
The following list details when you can use macro-fusion:

•

CMP or TEST can be fused when comparing:
REG-REG. For example: CMP EAX,ECX; JZ label
REG-IMM. For example: CMP EAX,0x80; JZ label
REG-MEM. For example: CMP EAX,[ECX]; JZ label
MEM-REG. For example: CMP [EAX],ECX; JZ label

•
•

TEST can fused with all conditional jumps.
CMP can be fused with only the following conditional jumps in Intel Core microarchitecture. These
conditional jumps check carry flag (CF) or zero flag (ZF). jump. The list of macro-fusion-capable
conditional jumps are:
JA or JNBE
JAE or JNB or JNC
JE or JZ
JNA or JBE
JNAE or JC or JB
JNE or JNZ

CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label). Macrofusion is not supported in 64-bit mode for Intel Core microarchitecture.

•

Intel microarchitecture code name Nehalem supports the following enhancements in macrofusion:
— CMP can be fused with the following conditional jumps (that was not supported in Intel Core
microarchitecture):

•
•
•
•

JL or JNGE
JGE or JNL
JLE or JNG
JG or JNLE

— Macro-fusion is support in 64-bit mode.

•

Enhanced macrofusion support in Intel microarchitecture code name Sandy Bridge is summarized in
Table 3-1 with additional information in Section 2.3.2.1 and Example 3-15:

Table 3-1. Macro-Fusible Instructions in Intel Microarchitecture Code Name Sandy Bridge
Instructions

TEST

AND

CMP

ADD

SUB

INC

DEC

JO/JNO

Y

Y

N

N

N

N

N

JC/JB/JAE/JNB

Y

Y

Y

Y

Y

N

N

JE/JZ/JNE/JNZ

Y

Y

Y

Y

Y

Y

Y

JNA/JBE/JA/JNBE

Y

Y

Y

Y

Y

N

N

JS/JNS/JP/JPE/JNP/JPO

Y

Y

N

N

N

N

N

JL/JNGE/JGE/JNL/JLE/JNG/JG/JNLE

Y

Y

Y

Y

Y

Y

Y

3-13

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 19. (M impact, ML generality) Employ macro-fusion where
possible using instruction pairs that support macro-fusion. Prefer TEST over CMP if possible. Use
unsigned variables and unsigned jumps when possible. Try to logically verify that a variable is nonnegative at the time of comparison. Avoid CMP or TEST of MEM-IMM flavor when possible. However, do
not add other instructions to avoid using the MEM-IMM flavor.

Example 3-11. Macro-fusion, Unsigned Iteration Count
Without Macro-fusion
1

With Macro-fusion

C code

for (int i = 0; i < 1000; i++)
a++;

for ( unsigned int2 i = 0; i < 1000; i++)
a++;

Disassembly

for (int i = 0; i < 1000; i++)
mov
dword ptr [ i ], 0
jmp
First
Loop:
mov
eax, dword ptr [ i ]
add
eax, 1
mov
dword ptr [ i ], eax

for ( unsigned int i = 0; i < 1000; i++)
xor
eax, eax
mov
dword ptr [ i ], eax
jmp
First
Loop:
mov
eax, dword ptr [ i ]
add
eax, 1
mov
dword ptr [ i ], eax

First:
cmp
jge

First:
cmp
jae

dword ptr [ i ], 3E8H3
End
a++;
mov
eax, dword ptr [ a ]
addqq eax,1
mov
dword ptr [ a ], eax
jmp
Loop
End:

mov
add
mov
jmp
End:

eax, 3E8H 4
End
a++;
eax, dword ptr [ a ]
eax, 1
dword ptr [ a ], eax
Loop

NOTES:
1. Signed iteration count inhibits macro-fusion.
2. Unsigned iteration count is compatible with macro-fusion.
3. CMP MEM-IMM, JGE inhibit macro-fusion.
4. CMP REG-IMM, JAE permits macro-fusion.

Example 3-12. Macro-fusion, If Statement
Without Macro-fusion

With Macro-fusion

C code

int1

a = 7;
if ( a < 77 )
a++;
else
a--;

unsigned int2 a = 7;
if ( a < 77 )
a++;
else
a--;

Disassembly

int a = 7;
mov
dword ptr [ a ], 7
if (a < 77)
cmp
dword ptr [ a ], 4DH 3
jge
Dec

unsigned int a = 7;
mov
dword ptr [ a ], 7
if ( a < 77 )
mov
eax, dword ptr [ a ]
cmp
eax, 4DH
jae
Dec

3-14

GENERAL OPTIMIZATION GUIDELINES

Example 3-12. Macro-fusion, If Statement (Contd.)
Without Macro-fusion

With Macro-fusion

a++;
mov
eax, dword ptr [ a ]
add
eax, 1
mov
dword ptr [a], eax
else
jmp
End
a--;
Dec:
mov
eax, dword ptr [ a ]
sub
eax, 1
mov
dword ptr [ a ], eax
End::

add
mov
else
jmp
Dec:
sub
mov
End::

a++;
eax,1
dword ptr [ a ], eax
End
a--;
eax, 1
dword ptr [ a ], eax

NOTES:
1. Signed iteration count inhibits macro-fusion.
2. Unsigned iteration count is compatible with macro-fusion.
3. CMP MEM-IMM, JGE inhibit macro-fusion.
Assembly/Compiler Coding Rule 20. (M impact, ML generality) Software can enable macro
fusion when it can be logically determined that a variable is non-negative at the time of comparison;
use TEST appropriately to enable macro-fusion when comparing a variable with 0.
Example 3-13. Macro-fusion, Signed Variable
Without Macro-fusion
test
ecx, ecx
jle
OutSideTheIF
cmp
ecx, 64H
jge
OutSideTheIF

OutSideTheIF:

With Macro-fusion
test
ecx, ecx
jle
OutSideTheIF
cmp
ecx, 64H
jae
OutSideTheIF

OutSideTheIF:

For either signed or unsigned variable ‘a’; “CMP a,0” and “TEST a,a” produce the same result as far as the
flags are concerned. Since TEST can be macro-fused more often, software can use “TEST a,a” to replace
“CMP a,0” for the purpose of enabling macro-fusion.
Example 3-14. Macro-fusion, Signed Comparison
C Code
Without Macro-fusion

With Macro-fusion

if (a == 0)

cmp a, 0
jne lbl
...
lbl:

test a, a
jne lbl
...
lbl:

if ( a >= 0)

cmp a, 0
jl lbl;
...
lbl:

test a, a
jl lbl
...
lbl:

Intel microarchitecture code name Sandy Bridge enables more arithmetic and logic instructions to
macro-fuse with conditional branches. In loops where the ALU ports are already congested, performing
one of these macro-fusions can relieve the pressure, as the macro-fused instruction consumes only port
5, instead of an ALU port plus port 5.
In Example 3-15, the “add/cmp/jnz” loop contains two ALU instructions that can be dispatched via either
port 0, 1, 5. So there is higher probability of port 5 might bind to either ALU instruction causing JNZ to
3-15

GENERAL OPTIMIZATION GUIDELINES

wait a cycle. The “sub/jnz” loop, the likelihood of ADD/SUB/JNZ can be dispatched in the same cycle is
increased because only SUB is free to bind with either port 0, 1, 5.

Example 3-15. Additional Macro-fusion Benefit in Intel Microarchitecture Code Name Sandy Bridge
Add + cmp + jnz alternative
Loop control with sub + jnz
lea
xor
xor
loop:
add
add
cmp
jnz

3.4.2.3

rdx, buff
rcx, rcx
eax, eax
eax, [rdx + 4 * rcx]
rcx, 1
rcx, LEN
loop

lea
xor
xor
loop:
add
sub
jnz

rdx, buff - 4
rcx, LEN
eax, eax
eax, [rdx + 4 * rcx]
rcx, 1
loop

Length-Changing Prefixes (LCP)

The length of an instruction can be up to 15 bytes in length. Some prefixes can dynamically change the
length of an instruction that the decoder must recognize. Typically, the pre-decode unit will estimate the
length of an instruction in the byte stream assuming the absence of LCP. When the predecoder encounters an LCP in the fetch line, it must use a slower length decoding algorithm. With the slower length
decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the usual 1 cycle. Normal
queuing throughout of the machine pipeline generally cannot hide LCP penalties.
The prefixes that can dynamically change the length of a instruction include:

•
•

Operand size prefix (0x66).
Address size prefix (0x67).

The instruction MOV DX, 01234h is subject to LCP stalls in processors based on Intel Core microarchitecture, and in Intel Core Duo and Intel Core Solo processors. Instructions that contain imm16 as part of
their fixed encoding but do not require LCP to change the immediate size are not subject to LCP stalls.
The REX prefix (4xh) in 64-bit mode can change the size of two classes of instruction, but does not cause
an LCP penalty.
If the LCP stall happens in a tight loop, it can cause significant performance degradation. When decoding
is not a bottleneck, as in floating-point heavy code, isolated LCP stalls usually do not cause performance
degradation.
Assembly/Compiler Coding Rule 21. (MH impact, MH generality) Favor generating code using
imm8 or imm32 values instead of imm16 values.
If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.

Double LCP Stalls
Instructions that are subject to LCP stalls and cross a 16-byte fetch line boundary can cause the LCP stall
to trigger twice. The following alignment situations can cause LCP stalls to trigger twice:

•

An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is
between the MODR/M and the SIB bytes.

•

An instruction starts at offset 13 of a fetch line references a memory location using register and
immediate byte offset addressing mode.

The first stall is for the 1st fetch line, and the 2nd stall is for the 2nd fetch line. A double LCP stall causes
a decode penalty of 11 cycles.

3-16

GENERAL OPTIMIZATION GUIDELINES

The following examples cause LCP stall once, regardless of their fetch-line location of the first byte of the
instruction:
ADD DX, 01234H
ADD word ptr [EDX], 01234H
ADD word ptr 012345678H[EDX], 01234H
ADD word ptr [012345678H], 01234H
The following instructions cause a double LCP stall when starting at offset 13 of a fetch line:
ADD word ptr [EDX+ESI], 01234H
ADD word ptr 012H[EDX], 01234H
ADD word ptr 012345678H[EDX+ESI], 01234H
To avoid double LCP stalls, do not use instructions subject to LCP stalls that use SIB byte encoding or
addressing mode with byte displacement.

False LCP Stalls
False LCP stalls have the same characteristics as LCP stalls, but occur on instructions that do not have
any imm16 value.
False LCP stalls occur when (a) instructions with LCP that are encoded using the F7 opcodes, and (b) are
located at offset 14 of a fetch line. These instructions are: not, neg, div, idiv, mul, and imul. False LCP
experiences delay because the instruction length decoder can not determine the length of the instruction
before the next fetch line, which holds the exact opcode of the instruction in its MODR/M byte.
The following techniques can help avoid false LCP stalls:

•
•

Upcast all short operations from the F7 group of instructions to long, using the full 32 bit version.
Ensure that the F7 opcode never starts at offset 14 of a fetch line.

Assembly/Compiler Coding Rule 22. (M impact, ML generality) Ensure instructions using 0xF7
opcode byte does not start at offset 14 of a fetch line; and avoid using these instruction to operate on
16-bit data, upcast short data to 32 bits.
Example 3-16. Avoiding False LCP Delays with 0xF7 Group Instructions
A Sequence Causing Delay in the Decoder
Alternate Sequence to Avoid Delay
neg word ptr a

3.4.2.4

movsx eax, word ptr a
neg
eax
mov
word ptr a, AX

Optimizing the Loop Stream Detector (LSD)

Loops that fit the following criteria are detected by the LSD and replayed from the instruction queue to
feed the decoder in Intel Core microarchitecture:

•
•
•
•

Must be less than or equal to four 16-byte fetches.
Must be less than or equal to 18 instructions.
Can contain no more than four taken branches and none of them can be a RET.
Should usually have more than 64 iterations.

Loop Stream Detector in Intel microarchitecture code name Nehalem is improved by:

•

Caching decoded micro-operations in the instruction decoder queue (IDQ, see Section 2.5.2) to feed
the rename/alloc stage.

•

The size of the LSD is increased to 28 micro-ops.

3-17

GENERAL OPTIMIZATION GUIDELINES

The LSD and micro-op queue implementation continue to improve in Sandy Bridge and Haswell microarchitectures. They have the following characteristics:

Table 3-2. Small Loop Criteria Detected by Sandy Bridge and Haswell Microarchitectures
Sandy Bridge and Ivy Bridge microarchitectures

Haswell microarchitecture

Up to 8 chunk fetches of 32 instruction bytes

8 chunk fetches if HTT active, 11 chunk fetched if HTT
off

Up to 28 micro ops

28 micro-ops if HTT active, 56 micro-ops if HTT off

All micro-ops resident in Decoded Icache ( i.e. DSB), but not
from MSROM

All micro-ops resident in DSB, including micro-ops from
MSRROM

No more than 8 taken branches

Relaxed

Exclude CALL and RET

Exclude CALL and RET

Mismatched stack operation disqualify

Same

Many calculation-intensive loops, searches and software string moves match these characteristics. These
loops exceed the BPU prediction capacity and always terminate in a branch misprediction.
Assembly/Compiler Coding Rule 23. (MH impact, MH generality) Break up a loop long sequence
of instructions into loops of shorter instruction blocks of no more than the size of LSD.
Assembly/Compiler Coding Rule 24. (MH impact, M generality) Avoid unrolling loops containing
LCP stalls, if the unrolled block exceeds the size of LSD.

3.4.2.5

Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name
Sandy Bridge

The LSD holds micro-ops that construct small “infinite” loops. Micro-ops from the LSD are allocated in the
out-of-order engine. The loop in the LSD ends with a taken branch to the beginning of the loop. The taken
branch at the end of the loop is always the last micro-op allocated in the cycle. The instruction at the
beginning of the loop is always allocated at the next cycle. If the code performance is bound by front end
bandwidth, unused allocation slots result in a bubble in allocation, and can cause performance degradation.
Allocation bandwidth in Intel microarchitecture code name Sandy Bridge is four micro-ops per cycle.
Performance is best, when the number of micro-ops in the LSD result in the least number of unused allocation slots. You can use loop unrolling to control the number of micro-ops that are in the LSD.
In the Example 3-17, the code sums all array elements. The original code adds one element per iteration.
It has three micro-ops per iteration, all allocated in one cycle. Code throughput is one load per cycle.
When unrolling the loop once there are five micro-ops per iteration, which are allocated in two cycles.
Code throughput is still one load per cycle. Therefore there is no performance gain.
When unrolling the loop twice there are seven micro-ops per iteration, still allocated in two cycles. Since
two loads can be executed in each cycle this code has a potential throughput of three load operations in
two cycles.
.
Example 3-17. Unrolling Loops in LSD to Optimize Emission Bandwidth
No Unrolling
Unroll once
lp: add eax, [rsi + 4* rcx]
dec rcx
jnz lp

3-18

lp: add eax, [rsi + 4* rcx]
add eax, [rsi + 4* rcx +4]
add rcx, -2
jnz lp

Unroll Twice
lp: add eax, [rsi + 4* rcx]
add eax, [rsi + 4* rcx +4]
add eax, [rsi + 4* rcx + 8]
add rcx, -3
jnz lp

GENERAL OPTIMIZATION GUIDELINES

3.4.2.6

Optimization for Decoded ICache

The decoded ICache is a new feature in Intel microarchitecture code name Sandy Bridge. Running the
code from the Decoded ICache has two advantages:

•
•

Higher bandwidth of micro-ops feeding the out-of-order engine.
The front end does not need to decode the code that is in the Decoded ICache. This saves power.

There is overhead in switching between the Decoded ICache and the legacy decode pipeline. If your code
switches frequently between the front end and the Decoded ICache, the penalty may be higher than
running only from the legacy pipeline
To ensure “hot” code is feeding from the decoded ICache:

•

Make sure each hot code block is less than about 500 instructions. Specifically, do not unroll to more
than 500 instructions in a loop. This should enable Decoded ICache residency even when hyperthreading is enabled.

•

For applications with very large blocks of calculations inside a loop, consider loop-fission: split the
loop into multiple loops that fit in the Decoded ICache, rather than a single loop that overflows.

•

If an application can be sure to run with only one thread per core, it can increase hot code block size
to about 1000 instructions.

Dense Read-Modify-Write Code
The Decoded ICache can hold only up to 18 micro-ops per each 32 byte aligned memory chunk. Therefore, code with a high concentration of instructions that are encoded in a small number of bytes, yet have
many micro-ops, may overflow the 18 micro-op limitation and not enter the Decoded ICache. Readmodify-write (RMW) instructions are a good example of such instructions.
RMW instructions accept one memory source operand, one register source operand, and use the source
memory operand as the destination. The same functionality can be achieved by two or three instructions:
the first reads the memory source operand, the second performs the operation with the second register
source operand, and the last writes the result back to memory. These instructions usually result in the
same number of micro-ops but use more bytes to encode the same functionality.
One case where RMW instructions may be used extensively is when the compiler optimizes aggressively
for code size.
Here are some possible solutions to fit the hot code in the Decoded ICache:

•

Replace RMW instructions with two or three instructions that have the same functionality. For
example, “adc [rdi], rcx“ is only three bytes long; the equivalent sequence “adc rax, [rdi]“ + “mov
[rdi], rax“ has a footprint of six bytes.

•

Align the code so that the dense part is broken down among two different 32-byte chunks. This
solution is useful when using a tool that aligns code automatically, and is indifferent to code changes.

•

Spread the code by adding multiple byte NOPs in the loop. Note that this solution adds micro-ops for
execution.

Align Unconditional Branches for Decoded ICache
For code entering the Decoded ICache, each unconditional branch is the last micro-op occupying a
Decoded ICache Way. Therefore, only three unconditional branches per a 32 byte aligned chunk can
enter the Decoded ICache.
Unconditional branches are frequent in jump tables and switch declarations. Below are examples for
these constructs, and methods for writing them so that they fit in the Decoded ICache.
Compilers create jump tables for C++ virtual class methods or DLL dispatch tables. Each unconditional
branch consumes five bytes; therefore up to seven of them can be associated with a 32-byte chunk. Thus
jump tables may not fit in the Decoded ICache if the unconditional branches are too dense in each
32Byte-aligned chunk. This can cause performance degradation for code executing before and after the
branch table.
The solution is to add multi-byte NOP instructions among the branches in the branch table. This may
increases code size and should be used cautiously. However, these NOPs are not executed and therefore
have no penalty in later pipe stages.
3-19

GENERAL OPTIMIZATION GUIDELINES

Switch-Case constructs represents a similar situation. Each evaluation of a case condition results in an
unconditional branch. The same solution of using multi-byte NOP can apply for every three consecutive
unconditional branches that fits inside an aligned 32-byte chunk.
Two Branches in a Decoded ICache Way
The Decoded ICache can hold up to two branches in a way. Dense branches in a 32 byte aligned chunk,
or their ordering with other instructions may prohibit all the micro-ops of the instructions in the chunk
from entering the Decoded ICache. This does not happen often. When it does happen, you can space the
code with NOP instructions where appropriate. Make sure that these NOP instructions are not part of hot
code.
Assembly/Compiler Coding Rule 25. (M impact, M generality) Avoid putting explicit references to
ESP in a sequence of stack operations (POP, PUSH, CALL, RET).

3.4.2.7

Other Decoding Guidelines

Assembly/Compiler Coding Rule 26. (ML impact, L generality) Use simple instructions that are
less than eight bytes in length.
Assembly/Compiler Coding Rule 27. (M impact, MH generality) Avoid using prefixes to change
the size of immediate and displacement.
Long instructions (more than seven bytes) may limit the number of decoded instructions per cycle. Each
prefix adds one byte to the length of instruction, possibly limiting the decoder’s throughput. In addition,
multiple prefixes can only be decoded by the first decoder. These prefixes also incur a delay when
decoded. If multiple prefixes or a prefix that changes the size of an immediate or displacement cannot be
avoided, schedule them behind instructions that stall the pipe for some other reason.

3.5

OPTIMIZING THE EXECUTION CORE

The superscalar, out-of-order execution core(s) in recent generations of microarchitectures contain
multiple execution hardware resources that can execute multiple micro-ops in parallel. These resources
generally ensure that micro-ops execute efficiently and proceed with fixed latencies. General guidelines
to make use of the available parallelism are:

•

Follow the rules (see Section 3.4) to maximize useful decode bandwidth and front end throughput.
These rules include favouring single micro-op instructions and taking advantage of micro-fusion,
Stack pointer tracker and macro-fusion.

•

Maximize rename bandwidth. Guidelines are discussed in this section and include properly dealing
with partial registers, ROB read ports and instructions which causes side-effects on flags.

•

Scheduling recommendations on sequences of instructions so that multiple dependency chains are
alive in the reservation station (RS) simultaneously, thus ensuring that your code utilizes maximum
parallelism.

•

Avoid hazards, minimize delays that may occur in the execution core, allowing the dispatched microops to make progress and be ready for retirement quickly.

3.5.1

Instruction Selection

Some execution units are not pipelined, this means that micro-ops cannot be dispatched in consecutive
cycles and the throughput is less than one per cycle.
It is generally a good starting point to select instructions by considering the number of micro-ops associated with each instruction, favoring in the order of: single micro-op instructions, simple instruction with
less then 4 micro-ops, and last instruction requiring microsequencer ROM (micro-ops which are executed
out of the microsequencer involve extra overhead).

3-20

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 28. (M impact, H generality) Favor single-micro-operation
instructions. Also favor instruction with shorter latencies.
A compiler may be already doing a good job on instruction selection. If so, user intervention usually is not
necessary.
Assembly/Compiler Coding Rule 29. (M impact, L generality) Avoid prefixes, especially multiple
non-0F-prefixed opcodes.
Assembly/Compiler Coding Rule 30. (M impact, L generality) Do not use many segment
registers.
Assembly/Compiler Coding Rule 31. (M impact, M generality) Avoid using complex instructions
(for example, enter, leave, or loop) that have more than four µops and require multiple cycles to
decode. Use sequences of simple instructions instead.
Assembly/Compiler Coding Rule 32. (MH impact, M generality) Use push/pop to manage stack
space and address adjustments between function calls/returns instead of enter/leave. Using enter
instruction with non-zero immediates can experience significant delays in the pipeline in addition to
misprediction.
Theoretically, arranging instructions sequence to match the 4-1-1-1 template applies to processors
based on Intel Core microarchitecture. However, with macro-fusion and micro-fusion capabilities in the
front end, attempts to schedule instruction sequences using the 4-1-1-1 template will likely provide
diminishing returns.
Instead, software should follow these additional decoder guidelines:

•

If you need to use multiple micro-op, non-microsequenced instructions, try to separate by a few
single micro-op instructions. The following instructions are examples of multiple micro-op instruction
not requiring micro-sequencer:
ADC/SBB
CMOVcc
Read-modify-write instructions

•

If a series of multiple micro-op instructions cannot be separated, try breaking the series into a
different equivalent instruction sequence. For example, a series of read-modify-write instructions
may go faster if sequenced as a series of read-modify + store instructions. This strategy could
improve performance even if the new code sequence is larger than the original one.

3.5.1.1

Use of the INC and DEC Instructions

The INC and DEC instructions modify only a subset of the bits in the flag register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are
on the critical path because they are used to change an address for a load on which many other instructions depend.
Assembly/Compiler Coding Rule 33. (M impact, H generality) INC and DEC instructions should
be replaced with ADD or SUB instructions, because ADD and SUB overwrite all flags, whereas INC and
DEC do not, therefore creating false dependencies on earlier instructions that set the flags.

3.5.1.2

Integer Divide

Typically, an integer divide is preceded by a CWD or CDQ instruction. Depending on the operand size,
divide instructions use DX:AX or EDX:EAX for the dividend. The CWD or CDQ instructions sign-extend AX
or EAX into DX or EDX, respectively. These instructions have denser encoding than a shift and move
would be, but they generate the same number of micro-ops. If AX or EAX is known to be positive, replace
these instructions with:
xor dx, dx
or
xor edx, edx

3-21

GENERAL OPTIMIZATION GUIDELINES

Modern compilers typically can transform high-level language expression involving integer division
where the divisor is a known integer constant at compile time into a faster sequence using IMUL instruction instead. Thus programmers should minimize integer division expression with divisor whose value
can not be known at compile time.
Alternately, if certain known divisor value are favored over other unknown ranges, software may consider
isolating the few favored, known divisor value into constant-divisor expressions.
Section 9.2.4 describes more detail of using MUL/IMUL to replace integer divisions.

3.5.1.3

Using LEA

In Intel microarchitecture code name Sandy Bridge, there are two significant changes to the performance characteristics of LEA instruction:

•

LEA can be dispatched via port 1 and 5 in most cases, doubling the throughput over prior generations. However this apply only to LEA instructions with one or two source operands.

Example 3-18. Independent Two-Operand LEA Example
mov
mov
mov
loop:
lea
lea
and
and
dec
jg

•

edx, N
eax, X
ecx, Y

ecx, [ecx = ecx *2]
eax, [eax = eax *5]
ecx, 0xff
eax, 0xff
edx
loop

For LEA instructions with three source operands and some specific situations, instruction latency has
increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

3-22

GENERAL OPTIMIZATION GUIDELINES

.

Example 3-19. Alternative to Three-Operand LEA
3 operand LEA is slower
Two-operand LEA alternative

Alternative 2

#define K 1
uint32 an = 0;
uint32 N= mi_N;
mov ecx, N
xor esi, esi;
xor edx, edx;
cmp ecx, 2;
jb finished;
dec ecx;

#define K 1
uint32 an = 0;
uint32 N= mi_N;
mov ecx, N
xor esi, esi;
xor edx, edx;
cmp ecx, 2;
jb finished;
dec ecx;

#define K 1
uint32 an = 0;
uint32 N= mi_N;
mov ecx, N
xor esi, esi;
mov edx, K;
cmp ecx, 2;
jb finished;
mov eax, 2
dec ecx;

loop1:
mov edi, esi;
lea esi, [K+esi+edx];
and esi, 0xFF;
mov edx, edi;
dec ecx;
jnz loop1;
finished:
mov [an] ,esi;

loop1:
mov edi, esi;
lea esi, [K+edx];
lea esi, [esi+edx];
and esi, 0xFF;
mov edx, edi;
dec ecx;
jnz loop1;
finished:
mov [an] ,esi;

loop1:
mov edi, esi;
lea esi, [esi+edx];
and esi, 0xFF;
lea edx, [edi +K];
dec ecx;
jnz loop1;
finished:
mov [an] ,esi;

In some cases with processor based on Intel NetBurst microarchitecture, the LEA instruction or a
sequence of LEA, ADD, SUB and SHIFT instructions can replace constant multiply instructions. The LEA
instruction can also be used as a multiple operand addition instruction, for example:
LEA ECX, [EAX + EBX + 4 + A]
Using LEA in this way may avoid register usage by not tying up registers for operands of arithmetic
instructions. This use may also save code space.
If the LEA instruction uses a shift by a constant amount then the latency of the sequence of µops is
shorter if adds are used instead of a shift, and the LEA instruction may be replaced with an appropriate
sequence of µops. This, however, increases the total number of µops, leading to a trade-off.
Assembly/Compiler Coding Rule 34. (ML impact, L generality) If an LEA instruction using the
scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth
out of the trace cache are the critical factor, then use the LEA instruction.

3.5.1.4

ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge

The throughput of ADC and SBB in Intel microarchitecture code name Sandy Bridge is 1 cycle, compared
to 1.5-2 cycles in prior generation. These two instructions are useful in numeric handling of integer data
types that are wider than the maximum width of native hardware.

3-23

GENERAL OPTIMIZATION GUIDELINES

Example 3-20. Examples of 512-bit Additions
//Add 64-bit to 512 Number
lea
rsi, gLongCounter
lea
rdi, gStepValue
mov
rax, [rdi]
xor
rcx, rcx
oop_start:
mov
r10, [rsi+rcx]
add
r10, rax
mov
[rsi+rcx], r10

l

mov
adc
mov

r10, [rsi+rcx+8]
r10, 0
[rsi+rcx+8], r10

mov
adc
mov
mov
adc
mov

r10, [rsi+rcx+16]
r10, 0
[rsi+rcx+16], r10
r10, [rsi+rcx+24]
r10, 0
[rsi+rcx+24], r10

mov
adc
mov

r10, [rsi+rcx+32]
r10, 0
[rsi+rcx+32], r10

mov r10, [rsi+rcx+40]
adc r10, 0
mov [rsi+rcx+40], r10

mov r10, [rsi+rcx+48]
adc r10, 0
mov [rsi+rcx+48], r10
mov r10, [rsi+rcx+56]
adc r10, 0
mov [rsi+rcx+56], r10
add rcx, 64
cmp rcx, SIZE
jnz loop_start

3.5.1.5

// 512-bit Addition
loop1:
mov
rax, [StepValue]
add
rax, [LongCounter]
mov
LongCounter, rax
mov
rax, [StepValue+8]
adc
rax, [LongCounter+8]
mov
LongCounter+8, rax
mov
rax, [StepValue+16]
adc
rax, [LongCounter+16]

mov
mov
adc

LongCounter+16, rax
rax, [StepValue+24]
rax, [LongCounter+24]

mov
mov
adc

LongCounter+24, rax
rax, [StepValue+32]
rax, [LongCounter+32]

mov
mov
adc

LongCounter+32, rax
rax, [StepValue+40]
rax, [LongCounter+40]

mov
mov
adc

LongCounter+40, rax
rax, [StepValue+48]
rax, [LongCounter+48]

mov
mov
adc

LongCounter+48, rax
rax, [StepValue+56]
rax, [LongCounter+56]

mov
dec
jnz

LongCounter+56, rax
rcx
loop1

Bitwise Rotation

Bitwise rotation can choose between rotate with count specified in the CL register, an immediate constant
and by 1 bit. Generally, The rotate by immediate and rotate by register instructions are slower than
rotate by 1 bit. The rotate by 1 instruction has the same latency as a shift.

3-24

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE by register or
ROTATE by immediate instructions. If possible, replace with a ROTATE by 1 instruction.
In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1-cycle throughput,
SHLD/SHRD using the same register as source and destination by an immediate constant has 1-cycle
latency with 0.5 cycle throughput. The “ROL/ROR reg, imm8” instruction has two micro-ops with the
latency of 1-cycle for the rotate register result and 2-cycles for the flags, if used.
In Intel microarchitecture code name Ivy Bridge, The “ROL/ROR reg, imm8” instruction with immediate
greater than 1, is one micro-op with one-cycle latency when the overflow flag result is used. When the
immediate is one, dependency on the overflow flag result of ROL/ROR by a subsequent instruction will
see the ROL/ROR instruction with two-cycle latency.

3.5.1.6

Variable Bit Count Rotation and Shift

In Intel microarchitecture code name Sandy Bridge, The “ROL/ROR/SHL/SHR reg, cl” instruction has
three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing
better performance in many common usages. When these instructions update partial flag results that are
subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline,
experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full
three micro-ops flow to use the updated partial flag result has additional delay. Consider the looped
sequence below:
loop:
shl eax, cl
add ebx, eax
dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
jnz loop
The DEC instruction does not modify the carry flag. Consequently, the SHL EAX, CL instruction needs to
execute the three micro-ops flow in subsequent iterations. The SUB instruction will update all flags. So
replacing DEC with SUB will allow SHL EAX, CL to execute the two micro-ops flow.

3.5.1.7

Address Calculations

For computing addresses, use the addressing modes rather than general-purpose computations. Internally, memory reference instructions can have four operands:

•
•
•
•

Relocatable load-time constant.
Immediate constant.
Base register.
Scaled index register.

Note that the latency and throughput of LEA with more than two operands are slower (see Section
3.5.1.3) in Intel microarchitecture code name Sandy Bridge. Addressing modes that uses both base and
index registers will consume more read port resource in the execution engine and may experience more
stalls due to availability of read port resources. Software should take care by selecting the speedy version
of address calculation.
In the segmented model, a segment register may constitute an additional operand in the linear address
calculation. In many cases, several integer instructions can be eliminated by fully using the operands of
memory references.

3-25

GENERAL OPTIMIZATION GUIDELINES

3.5.1.8

Clearing Registers and Dependency Breaking Idioms

Code sequences that modifies partial register can experience some delay in its dependency chain, but
can be avoided by using dependency breaking idioms.
In processors based on Intel Core microarchitecture, a number of instructions can help clear execution
dependency when software uses these instruction to clear register content to zero. The instructions
include:
XOR REG, REG
SUB REG, REG
XORPS/PD XMMREG, XMMREG
PXOR XMMREG, XMMREG
SUBPS/PD XMMREG, XMMREG
PSUBB/W/D/Q XMMREG, XMMREG
In processors based on Intel microarchitecture code name Sandy Bridge, the instruction listed above plus
equivalent AVX counter parts are also zero idioms that can be used to break dependency chains. Furthermore, they do not consume an issue port or an execution unit. So using zero idioms are preferable than
moving 0’s into the register. The AVX equivalent zero idioms are:
VXORPS/PD XMMREG, XMMREG
VXORPS/PD YMMREG, YMMREG
VPXOR XMMREG, XMMREG
VSUBPS/PD XMMREG, XMMREG
VSUBPS/PD YMMREG, YMMREG
VPSUBB/W/D/Q XMMREG, XMMREG
In Intel Core Solo and Intel Core Duo processors, the XOR, SUB, XORPS, or PXOR instructions can be
used to clear execution dependencies on the zero evaluation of the destination register.
The Pentium 4 processor provides special support for XOR, SUB, and PXOR operations when executed
within the same register. This recognizes that clearing a register does not depend on the old value of the
register. The XORPS and XORPD instructions do not have this special support. They cannot be used to
break dependence chains.
Assembly/Compiler Coding Rule 36. (M impact, ML generality) Use dependency-breaking-idiom
instructions to set a register to 0, or to break a false dependence chain resulting from re-use of
registers. In contexts where the condition codes must be preserved, move 0 into the register instead.
This requires more code space than using XOR and SUB, but avoids setting the condition codes.
Example 3-21 of using pxor to break dependency idiom on a XMM register when performing negation on
the elements of an array.
int a[4096], b[4096], c[4096];
For ( int i = 0; i < 4096; i++ )
C[i] = - ( a[i] + b[i] );

3-26

GENERAL OPTIMIZATION GUIDELINES

Example 3-21. Clearing Register to Break Dependency While Negating Array Elements
Negation (-x = (x XOR (-1)) - (-1) without breaking
Negation (-x = 0 -x) using PXOR reg, reg breaks
dependency
dependency
Lea eax, a
lea ecx, b
lea edi, c
xor edx, edx
movdqa xmm7, allone
lp:

lea eax, a
lea ecx, b
lea edi, c
xor edx, edx
lp:

movdqa xmm0, [eax + edx]
paddd xmm0, [ecx + edx]
pxor xmm0, xmm7
psubd xmm0, xmm7
movdqa [edi + edx], xmm0
add edx, 16
cmp edx, 4096
jl lp

movdqa xmm0, [eax + edx]
paddd xmm0, [ecx + edx]
pxor xmm7, xmm7
psubd xmm7, xmm0
movdqa [edi + edx], xmm7
add edx,16
cmp edx, 4096
jl lp

Assembly/Compiler Coding Rule 37. (M impact, MH generality) Break dependences on portions
of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Sometimes sign-extended semantics can be maintained by zero-extending operands. For example, the C
code in the following statements does not need sign extension, nor does it need prefixes for operand size
overrides:
static short INT a, b;
IF (a == b) {
...
}
Code for comparing these 16-bit operands might be:
MOVZW EAX, [a]
MOVZW EBX, [b]
CMP
EAX, EBX
These circumstances tend to be common. However, the technique will not work if the compare is for
greater than, less than, greater than or equal, and so on, or if the values in eax or ebx are to be used in
another operation where sign extension is required.
Assembly/Compiler Coding Rule 38. (M impact, M generality) Try to use zero extension or
operate on 32-bit operands instead of using moves with sign extension.
The trace cache can be packed more tightly when instructions with operands that can only be represented as 32 bits are not adjacent.
Assembly/Compiler Coding Rule 39. (ML impact, L generality) Avoid placing instructions that
use 32-bit immediates which cannot be encoded as sign-extended 16-bit immediates near each other.
Try to schedule µops that have no immediate immediately before or after µops with 32-bit immediates.

3.5.1.9

Compares

Use TEST when comparing a value in a register with zero. TEST essentially ANDs operands together
without writing to a destination register. TEST is preferred over AND because AND produces an extra
result register. TEST is better than CMP ..., 0 because the instruction size is smaller.

3-27

GENERAL OPTIMIZATION GUIDELINES

Use TEST when comparing the result of a logical AND with an immediate constant for equality or
inequality if the register is EAX for cases such as:
IF (AVAR & 8) { }
The TEST instruction can also be used to detect rollover of modulo of a power of 2. For example, the C
code:
IF ( (AVAR % 16) == 0 ) { }
can be implemented using:
TEST
JNZ

EAX, 0x0F
AfterIf

Using the TEST instruction between the instruction that may modify part of the flag register and the
instruction that uses the flag register can also help prevent partial flag register stall.
Assembly/Compiler Coding Rule 40. (ML impact, M generality) Use the TEST instruction instead
of AND when the result of the logical AND is not used. This saves µops in execution. Use a TEST of a
register with itself instead of a CMP of the register to zero, this saves the need to encode the zero and
saves encoding space. Avoid comparing a constant to a memory operand. It is preferable to load the
memory operand and compare the constant to a register.
Often a produced value must be compared with zero, and then used in a branch. Because most Intel
architecture instructions set the condition codes as part of their execution, the compare instruction may
be eliminated. Thus the operation can be tested directly by a JCC instruction. The notable exceptions are
MOV and LEA. In these cases, use TEST.
Assembly/Compiler Coding Rule 41. (ML impact, M generality) Eliminate unnecessary compare
with zero instructions by using the appropriate conditional jump instruction when the flags are already
set by a preceding arithmetic instruction. If necessary, use a TEST instruction instead of a compare. Be
certain that any code transformations made do not introduce problems with overflow.

3.5.1.10

Using NOPs

Code generators generate a no-operation (NOP) to align instructions. Examples of NOPs of different
lengths in 32-bit mode are shown below:
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
6-byte: LEA REG, 0 (REG) (32-bit displacement)
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
These are all true NOPs, having no effect on the state of the machine except to advance the EIP. Because
NOPs require hardware resources to decode and execute, use the fewest number to achieve the desired
padding.
The one byte NOP:[XCHG EAX,EAX] has special hardware support. Although it still consumes a µop and
its accompanying resources, the dependence upon the old value of EAX is removed. This µop can be
executed at the earliest possible opportunity, reducing the number of outstanding instructions and is the
lowest cost NOP.
The other NOPs have no special hardware support. Their input and output registers are interpreted by the
hardware. Therefore, a code generator should arrange to use the register containing the oldest value as
input, so that the NOP will dispatch and release RS resources at the earliest possible opportunity.

3-28

GENERAL OPTIMIZATION GUIDELINES

Try to observe the following NOP generation priority:

•
•
•

Select the smallest number of NOPs and pseudo-NOPs to provide the desired padding.
Select NOPs that are least likely to execute on slower execution unit clusters.
Select the register arguments of NOPs to reduce dependencies.

3.5.1.11

Mixing SIMD Data Types

Previous microarchitectures (before Intel Core microarchitecture) do not have explicit restrictions on
mixing integer and floating-point (FP) operations on XMM registers. For Intel Core microarchitecture,
mixing integer and floating-point operations on the content of an XMM register can degrade performance. Software should avoid mixed-use of integer/FP operation on XMM registers. Specifically:

•
•
•

Use SIMD integer operations to feed SIMD integer operations. Use PXOR for idiom.
Use SIMD floating-point operations to feed SIMD floating-point operations. Use XORPS for idiom.
When floating-point operations are bitwise equivalent, use PS data type instead of PD data type.
MOVAPS and MOVAPD do the same thing, but MOVAPS takes one less byte to encode the instruction.

3.5.1.12

Spill Scheduling

The spill scheduling algorithm used by a code generator will be impacted by the memory subsystem. A
spill scheduling algorithm is an algorithm that selects what values to spill to memory when there are too
many live values to fit in registers. Consider the code in Example 3-22, where it is necessary to spill
either A, B, or C.
Example 3-22. Spill Scheduling Code
LOOP
C := ...
B := ...
A := A + ...
For modern microarchitectures, using dependence depth information in spill scheduling is even more
important than in previous processors. The loop-carried dependence in A makes it especially important
that A not be spilled. Not only would a store/load be placed in the dependence chain, but there would also
be a data-not-ready stall of the load, costing further cycles.
Assembly/Compiler Coding Rule 42. (H impact, MH generality) For small loops, placing loop
invariants in memory is better than spilling loop-carried dependencies.
A possibly counter-intuitive result is that in such a situation it is better to put loop invariants in memory
than in registers, since loop invariants never have a load blocked by store data that is not ready.

3.5.1.13

Zero-Latency MOV Instructions

In processors based on Intel microarchitecture code name Ivy Bridge, a subset of register-to-register
move operations are executed in the front end (similar to zero-idioms, see Section 3.5.1.8). This
conserves scheduling/execution resources in the out-of-order engine. Most forms of register-to-register

3-29

GENERAL OPTIMIZATION GUIDELINES

MOV instructions can benefit from zero-latency MOV. Example 3-23 list the details of those forms that
qualify and a small set that do not.

Example 3-23. Zero-Latency MOV Instructions
MOV instructions latency that can be eliminated
MOV reg32, reg32
MOV reg64, reg64
MOVUPD/MOVAPD xmm, xmm
MOVUPD/MOVAPD ymm, ymm
MOVUPS?MOVAPS xmm, xmm
MOVUPS/MOVAPS ymm, ymm
MOVDQA/MOVDQU xmm, xmm
MOVDQA/MOVDQU ymm, ymm
MOVZX reg32, reg8 (if not AH/BH/CH/DH)
MOVZX reg64, reg8 (if not AH/BH/CH/DH)

MOV instructions latency that cannot be eliminated
MOV reg8, reg8
MOV reg16, reg16
MOVZX reg32, reg8 (if AH/BH/CH/DH)
MOVZX reg64, reg8 (if AH/BH/CH/DH)
MOVSX

Example 3-24 shows how to process 8-bit integers using MOVZX to take advantage of zero-latency MOV
enhancement. Consider
X = (X * 3^N ) MOD 256;
Y = (Y * 3^N ) MOD 256;
When “MOD 256” is implemented using the “AND 0xff” technique, its latency is exposed in the resultdependency chain. Using a form of MOVZX on a truncated byte input, it can take advantage of zerolatency MOV enhancement and gain about 45% in speed.

Example 3-24. Byte-Granular Data Computation Technique
Use AND Reg32, 0xff
Use MOVZX
mov rsi, N
mov rax, X
mov rcx, Y
loop:
lea rcx, [rcx+rcx*2]
lea rax, [rax+rax*4]
and rcx, 0xff
and rax, 0xff

mov rsi, N
mov rax, X
mov rcx, Y
loop:
lea rbx, [rcx+rcx*2]
movzx, rcx, bl
lea rbx, [rcx+rcx*2]
movzx, rcx, bl

lea rcx, [rcx+rcx*2]
lea rax, [rax+rax*4]
and rcx, 0xff
and rax, 0xff
sub rsi, 2
jg loop

lea rdx, [rax+rax*4]
movzx, rax, dl
llea rdx, [rax+rax*4]
movzx, rax, dl
sub rsi, 2
jg loop

The effectiveness of coding a dense sequence of instructions to rely on a zero-latency MOV instruction
must also consider internal resource constraints in the microarchitecture.

3-30

GENERAL OPTIMIZATION GUIDELINES

Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions
Needing more internal resource for zero-latency
MOVs
Needing less internal resource for zero-latency MOVs
mov rsi, N
mov rax, X
mov rcx, Y

mov rsi, N
mov rax, X
mov rcx, Y

loop:
lea rbx, [rcx+rcx*2]
movzx, rcx, bl
lea rdx, [rax+rax*4]
movzx, rax, dl
lea rbx, [rcx+rcx*2]
movzx, rcx, bl
llea rdx, [rax+rax*4]
movzx, rax, dl
sub rsi, 2
jg loop

loop:
lea rbx, [rcx+rcx*2]
movzx, rcx, bl
lea rbx, [rcx+rcx*2]
movzx, rcx, bl
lea rdx, [rax+rax*4]
movzx, rax, dl
llea rdx, [rax+rax*4]
movzx, rax, dl
sub rsi, 2
jg loop

In Example 3-25, RBX/RCX and RDX/RAX are pairs of registers that are shared and continuously overwritten. In the right-hand sequence, registers are overwritten with new results immediately, consuming
less internal resources provided by the underlying microarchitecture. As a result, it is about 8% faster
than the left-hand sequence where internal resources could only support 50% of the attempt to take
advantage of zero-latency MOV instructions.

3.5.2

Avoiding Stalls in Execution Core

Although the design of the execution core is optimized to make common cases executes quickly. A microop may encounter various hazards, delays, or stalls while making forward progress from the front end to
the ROB and RS. The significant cases are:

•
•
•
•

ROB Read Port Stalls.
Partial Register Reference Stalls.
Partial Updates to XMM Register Stalls.
Partial Flag Register Reference Stalls.

3.5.2.1

ROB Read Port Stalls

As a micro-op is renamed, it determines whether its source operands have executed and been written to
the reorder buffer (ROB), or whether they will be captured “in flight” in the RS or in the bypass network.
Typically, the great majority of source operands are found to be “in flight” during renaming. Those that
have been written back to the ROB are read through a set of read ports.
Since the Intel Core microarchitecture is optimized for the common case where the operands are “in
flight”, it does not provide a full set of read ports to enable all renamed micro-ops to read all sources from
the ROB in the same cycle.
When not all sources can be read, a micro-op can stall in the rename stage until it can get access to
enough ROB read ports to complete renaming the micro-op. This stall is usually short-lived. Typically, a
micro-op will complete renaming in the next cycle, but it appears to the application as a loss of rename
bandwidth.

3-31

GENERAL OPTIMIZATION GUIDELINES

Some of the software-visible situations that can cause ROB read port stalls include:

•

Registers that have become cold and require a ROB read port because execution units are doing other
independent calculations.

•
•

Constants inside registers.
Pointer and index registers.

In rare cases, ROB read port stalls may lead to more significant performance degradations. There are a
couple of heuristics that can help prevent over-subscribing the ROB read ports:

•

Keep common register usage clustered together. Multiple references to the same written-back
register can be “folded” inside the out of order execution core.

•

Keep short dependency chains intact. This practice ensures that the registers will not have been
written back when the new micro-ops are written to the RS.

These two scheduling heuristics may conflict with other more common scheduling heuristics. To reduce
demand on the ROB read port, use these two heuristics only if both the following situations are met:

•
•

Short latency operations.
Indications of actual ROB read port stalls can be confirmed by measurements of the performance
event (the relevant event is RAT_STALLS.ROB_READ_PORT, see Chapter 19 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 3B).

If the code has a long dependency chain, these two heuristics should not be used because they can cause
the RS to fill, causing damage that outweighs the positive effects of reducing demands on the ROB read
port.
Starting with Intel microarchitecture code name Sandy Bridge, ROB port stall no longer applies because
data is read from the physical register file.

3.5.2.2

Writeback Bus Conflicts

The writeback bus inside the execution engine is a common resource needed to facilitate out-of-order
execution of micro-ops in flight. When the writeback bus is needed at the same time by two micro-ops
executing in the same stack of execution units (see Table 2-15), the younger micro-op will have to wait
for the writeback bus to be available. This situation typically will be more likely for short-latency instructions experience a delay when it might have been otherwise ready for dispatching into the execution
engine.
Consider a repeating sequence of independent floating-point ADDs with a single-cycle MOV bound to the
same dispatch port. When the MOV finds the dispatch port available, the writeback bus can be occupied
by the ADD. This delays the MOV operation.
If this problem is detected, you can sometimes change the instruction selection to use a different
dispatch port and reduce the writeback contention.

3.5.2.3

Bypass between Execution Domains

Floating-point (FP) loads have an extra cycle of latency. Moves between FP and SIMD stacks have
another additional cycle of latency.
Example:
ADDPS XMM0, XMM1
PAND XMM0, XMM3
ADDPS XMM2, XMM0
The overall latency for the above calculation is 9 cycles:

•
•
•
•

3 cycles for each ADDPS instruction.
1 cycle for the PAND instruction.
1 cycle to bypass between the ADDPS floating-point domain to the PAND integer domain.
1 cycle to move the data from the PAND integer to the second floating-point ADDPS domain.

3-32

GENERAL OPTIMIZATION GUIDELINES

To avoid this penalty, you should organize code to minimize domain changes. Sometimes you cannot
avoid bypasses.
Account for bypass cycles when counting the overall latency of your code. If your calculation is latencybound, you can execute more instructions in parallel or break dependency chains to reduce total latency.
Code that has many bypass domains and is completely latency-bound may run slower on the Intel Core
microarchitecture than it did on previous microarchitectures.

3.5.2.4

Partial Register Stalls

General purpose registers can be accessed in granularities of bytes, words, doublewords; 64-bit mode
also supports quadword granularity. Referencing a portion of a register is referred to as a partial register
reference.
A partial register stall happens when an instruction refers to a register, portions of which were previously
modified by other instructions. For example, partial register stalls occurs with a read to AX while previous
instructions stored AL and AH, or a read to EAX while previous instruction modified AX.
The delay of a partial register stall is small in processors based on Intel Core and NetBurst microarchitectures, and in Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo, and Intel
Core Duo processors. Pentium M processors (CPUID signature with family 6, model 9) and the P6 family
incur a large penalty.
Note that in Intel 64 architecture, an update to the lower 32 bits of a 64 bit integer register is architecturally defined to zero extend the upper 32 bits. While this action may be logically viewed as a 32 bit
update, it is really a 64 bit update (and therefore does not cause a partial stall).
Referencing partial registers frequently produces code sequences with either false or real dependencies.
Example 3-18 demonstrates a series of false and real dependencies caused by referencing partial registers.
If instructions 4 and 6 (in Example 3-18) are changed to use a movzx instruction instead of a mov, then
the dependences of instruction 4 on 2 (and transitively 1 before it), and instruction 6 on 5 are broken.
This creates two independent chains of computation instead of one serial one.
Example 3-26 illustrates the use of MOVZX to avoid a partial register stall when packing three byte
values into a register.
Example 3-26. Avoiding Partial Register Stalls in Integer Code
A Sequence Causing Partial Register Stall
Alternate Sequence Using MOVZX to Avoid Delay
mov al, byte ptr a[2]
shl eax,16
mov ax, word ptr a
movd mm0, eax
ret

movzx eax, byte ptr a[2]
shl eax, 16
movzx ecx, word ptr a
or eax,ecx
movd mm0, eax
ret

In Intel microarchitecture code name Sandy Bridge, partial register access is handled in hardware by
inserting a micro-op that merges the partial register with the full register in the following cases:

•

After a write to one of the registers AH, BH, CH or DH and before a following read of the 2-, 4- or 8byte form of the same register. In these cases a merge micro-op is inserted. The insertion consumes
a full allocation cycle in which other micro-ops cannot be allocated.

•

After a micro-op with a destination register of 1 or 2 bytes, which is not a source of the instruction (or
the register's bigger form), and before a following read of a 2-,4- or 8-byte form of the same register.
In these cases the merge micro-op is part of the flow. For example:

•

MOV AX, [BX]
When you want to load from memory to a partial register, consider using MOVZX or MOVSX to
avoid the additional merge micro-op penalty.

•

LEA

AX, [BX+CX]
3-33

GENERAL OPTIMIZATION GUIDELINES

For optimal performance, use of zero idioms, before the use of the register, eliminates the need for
partial register merge micro-ops.

3.5.2.5

Partial XMM Register Stalls

Partial register stalls can also apply to XMM registers. The following SSE and SSE2 instructions update
only part of the destination register:
MOVL/HPD XMM, MEM64
MOVL/HPS XMM, MEM32
MOVSS/SD between registers
Using these instructions creates a dependency chain between the unmodified part of the register and the
modified part of the register. This dependency chain can cause performance loss.
Example 3-27 illustrates the use of MOVZX to avoid a partial register stall when packing three byte
values into a register.
Follow these recommendations to avoid stalls from partial updates to XMM registers:

•
•
•

Avoid using instructions which update only part of the XMM register.

•

When copying the XMM register, use the following instructions for full register copy, even if you only
want to copy some of the source register data:

If a 64-bit load is needed, use the MOVSD or MOVQ instruction.
If 2 64-bit loads are required to the same register from non continuous locations, use
MOVSD/MOVHPD instead of MOVLPD/MOVHPD.

MOVAPS
MOVAPD
MOVDQA

Example 3-27. Avoiding Partial Register Stalls in SIMD Code
Using movlpd for memory transactions and movsd
Using movsd for memory and movapd between
between register copies Causing Partial Register Stall
register copies Avoid Delay
mov edx, x
mov ecx, count
movsd xmm3,_1_
movsd xmm2, _1pt5_
align 16

mov edx, x
mov ecx, count
movlpd xmm3,_1_
movlpd xmm2,_1pt5_
align 16
lp:

lp:
movlpd xmm0, [edx]
addsd xmm0, xmm3
movsd xmm1, xmm2
subsd xmm1, [edx]
mulsd xmm0, xmm1
movsd [edx], xmm0
add edx, 8
dec ecx
jnz lp

3.5.2.6

movsd xmm0, [edx]
addsd xmm0, xmm3
movapd xmm1, xmm2
subsd xmm1, [edx]
mulsd xmm0, xmm1
movsd [edx], xmm0
add edx, 8
dec ecx
jnz lp

Partial Flag Register Stalls

A “partial flag register stall” occurs when an instruction modifies a part of the flag register and the
following instruction is dependent on the outcome of the flags. This happens most often with shift
instructions (SAR, SAL, SHR, SHL). The flags are not modified in the case of a zero shift count, but the
shift count is usually known only at execution time. The front end stalls until the instruction is retired.
3-34

GENERAL OPTIMIZATION GUIDELINES

Other instructions that can modify some part of the flag register include CMPXCHG8B, various rotate
instructions, STC, and STD. An example of assembly with a partial flag register stall and alternative code
without the stall is shown in Example 3-28.
In processors based on Intel Core microarchitecture, shift immediate by 1 is handled by special hardware
such that it does not experience partial flag stall.
Example 3-28. Avoiding Partial Flag Register Stalls
Partial Flag Register Stall
xor eax, eax
mov ecx, a
sar ecx, 2
setz al ;SAR can update carry causing a stall

Avoiding Partial Flag Register Stall
or eax, eax
mov ecx, a
sar ecx, 2
test ecx, ecx ; test always updates all flags
setz al ;No partial reg or flag stall,

In Intel microarchitecture code name Sandy Bridge, the cost of partial flag access is replaced by the
insertion of a micro-op instead of a stall. However, it is still recommended to use less of instructions that
write only to some of the flags (such as INC, DEC, SET CL) before instructions that can write flags conditionally (such as SHIFT CL).
Example 3-29 compares two techniques to implement the addition of very large integers (e.g. 1024
bits). The alternative sequence on the right side of Example 3-29 will be faster than the left side on Intel
microarchitecture code name Sandy Bridge, but it will experience partial flag stalls on prior microarchitectures.

Example 3-29. Partial Flag Register Accesses in Intel Microarchitecture Code Name Sandy Bridge
Save partial flag register to avoid stall
Simplified code sequence
lea rsi, [A]
lea rdi, [B]
xor rax, rax
mov rcx, 16 ; 16*64 =1024 bit
lp_64bit:
add rax, [rsi]
adc rax, [rdi]
mov [rdi], rax
setc al ;save carry for next iteration
movzx rax, al
add rsi, 8
add rdi, 8
dec rcx
jnz lp_64bit

3.5.2.7

lea rsi, [A]
lea rdi, [B]
xor rax, rax
mov rcx, 16
lp_64bit:
add rax, [rsi]
adc rax, [rdi]
mov [rdi], rax
lea rsi, [rsi+8]
lea rdi, [rdi+8]
dec rcx
jnz lp_64bit

Floating-Point/SIMD Operands

Moves that write a portion of a register can introduce unwanted dependences. The MOVSD REG, REG
instruction writes only the bottom 64 bits of a register, not all 128 bits. This introduces a dependence on
the preceding instruction that produces the upper 64 bits (even if those bits are not longer wanted). The
dependence inhibits register renaming, and thereby reduces parallelism.
Use MOVAPD as an alternative; it writes all 128 bits. Even though this instruction has a longer latency,
the μops for MOVAPD use a different execution port and this port is more likely to be free. The change can
impact performance. There may be exceptional cases where the latency matters more than the dependence or the execution port.

3-35

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 43. (M impact, ML generality) Avoid introducing dependences
with partial floating-point register writes, e.g. from the MOVSD XMMREG1, XMMREG2 instruction. Use
the MOVAPD XMMREG1, XMMREG2 instruction instead.
The MOVSD XMMREG, MEM instruction writes all 128 bits and breaks a dependence.
The MOVUPD from memory instruction performs two 64-bit loads, but requires additional µops to adjust
the address and combine the loads into a single register. This same functionality can be obtained using
MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses
fewer µops and can be packed into the trace cache more effectively. The latter alternative has been found
to provide a several percent performance improvement in some cases. Its encoding requires more
instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD
is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be
used.
Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD
XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2,
MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD
XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.
Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM,
XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD
MEM+8, XMMREG1 instead.

3.5.3

Vectorization

This section provides a brief summary of optimization issues related to vectorization. There is more detail
in the chapters that follow.
Vectorization is a program transformation that allows special hardware to perform the same operation on
multiple data elements at the same time. Successive processor generations have provided vector
support through the MMX technology, Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2
(SSE2), Streaming SIMD Extensions 3 (SSE3) and Supplemental Streaming SIMD Extensions 3 (SSSE3).
Vectorization is a special case of SIMD, a term defined in Flynn’s architecture taxonomy to denote a
single instruction stream capable of operating on multiple data elements in parallel. The number of
elements which can be operated on in parallel range from four single-precision floating-point data
elements in Streaming SIMD Extensions and two double-precision floating-point data elements in
Streaming SIMD Extensions 2 to sixteen byte operations in a 128-bit register in Streaming SIMD Extensions 2. Thus, vector length ranges from 2 to 16, depending on the instruction extensions used and on
the data type.
The Intel C++ Compiler supports vectorization in three ways:

•
•
•

The compiler may be able to generate SIMD code without intervention from the user.
The can user insert pragmas to help the compiler realize that it can vectorize the code.
The user can write SIMD code explicitly using intrinsics and C++ classes.

To help enable the compiler to generate SIMD code, avoid global pointers and global variables. These
issues may be less troublesome if all modules are compiled simultaneously, and whole-program optimization is used.
User/Source Coding Rule 2. (H impact, M generality) Use the smallest possible floating-point or
SIMD data type, to enable more parallelism with the use of a (longer) SIMD vector. For example, use
single precision instead of double precision where possible.
User/Source Coding Rule 3. (M impact, ML generality) Arrange the nesting of loops so that the
innermost nesting level is free of inter-iteration dependencies. Especially avoid the case where the
store of data in an earlier iteration happens lexically after the load of that data in a future iteration,
something which is called a lexically backward dependence.
The integer part of the SIMD instruction set extensions cover 8-bit,16-bit and 32-bit operands. Not all
SIMD operations are supported for 32 bits, meaning that some source code will not be able to be vectorized at all unless smaller operands are used.

3-36

GENERAL OPTIMIZATION GUIDELINES

User/Source Coding Rule 4. (M impact, ML generality) Avoid the use of conditional branches
inside loops and consider using SSE instructions to eliminate branches.
User/Source Coding Rule 5. (M impact, ML generality) Keep induction (loop) variable expressions
simple.

3.5.4

Optimization of Partially Vectorizable Code

Frequently, a program contains a mixture of vectorizable code and some routines that are non-vectorizable. A common situation of partially vectorizable code involves a loop structure which include mixtures
of vectorized code and unvectorizable code. This situation is depicted in Figure 3-1.

Packed SIMD Instruction

Unpacking

Unvectorizable Code

Serial Routine

Packing

Packed SIMD Instruction

Figure 3-1. Generic Program Flow of Partially Vectorized Code
It generally consists of five stages within the loop:

•
•
•
•
•

Prolog.
Unpacking vectorized data structure into individual elements.
Calling a non-vectorizable routine to process each element serially.
Packing individual result into vectorized data structure.
Epilog.

This section discusses techniques that can reduce the cost and bottleneck associated with the
packing/unpacking stages in these partially vectorize code.
Example 3-30 shows a reference code template that is representative of partially vectorizable coding
situations that also experience performance issues. The unvectorizable portion of code is represented
generically by a sequence of calling a serial function named “foo” multiple times. This generic example is
referred to as “shuffle with store forwarding”, because the problem generally involves an unpacking
stage that shuffles data elements between register and memory, followed by a packing stage that can
experience store forwarding issue.

3-37

GENERAL OPTIMIZATION GUIDELINES

There are more than one useful techniques that can reduce the store-forwarding bottleneck between the
serialized portion and the packing stage. The following sub-sections presents alternate techniques to
deal with the packing, unpacking, and parameter passing to serialized function calls.
Example 3-30. Reference Code Template for Partially Vectorizable Program
// Prolog ///////////////////////////////
push ebp
mov ebp, esp
// Unpacking ////////////////////////////
sub ebp, 32
and ebp, 0xfffffff0
movaps [ebp], xmm0
// Serial operations on components ///////
sub ebp, 4
mov eax, [ebp+4]
mov [ebp], eax
call foo
mov [ebp+16+4], eax
mov eax, [ebp+8]
mov [ebp], eax
call foo
mov [ebp+16+4+4], eax
mov eax, [ebp+12]
mov [ebp], eax
call foo
mov [ebp+16+8+4], eax
mov eax, [ebp+12+4]
mov [ebp], eax
call foo
mov [ebp+16+12+4], eax
// Packing ///////////////////////////////
movaps xmm0, [ebp+16+4]
// Epilog ////////////////////////////////
pop ebp
ret

3.5.4.1

Alternate Packing Techniques

The packing method implemented in the reference code of Example 3-30 will experience delay as it
assembles 4 doubleword result from memory into an XMM register due to store-forwarding restrictions.

3-38

GENERAL OPTIMIZATION GUIDELINES

Three alternate techniques for packing, using different SIMD instruction to assemble contents in XMM
registers are shown in Example 3-31. All three techniques avoid store-forwarding delay by satisfying the
restrictions on data sizes between a preceding store and subsequent load operations.
Example 3-31. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty
Packing Method 1
Packing Method 2
Packing Method 3
movd xmm0, [ebp+16+4]
movd xmm1, [ebp+16+8]
movd xmm2, [ebp+16+12]
movd xmm3, [ebp+12+16+4]
punpckldq xmm0, xmm1
punpckldq xmm2, xmm3
punpckldq xmm0, xmm2

3.5.4.2

movd xmm0, [ebp+16+4]
movd xmm1, [ebp+16+8]
movd xmm2, [ebp+16+12]
movd xmm3, [ebp+12+16+4]
psllq xmm3, 32
orps xmm2, xmm3
psllq xmm1, 32
orps xmm0, xmm1movlhps xmm0, xmm2

movd xmm0, [ebp+16+4]
movd xmm1, [ebp+16+8]
movd xmm2, [ebp+16+12]
movd xmm3, [ebp+12+16+4]
movlhps xmm1,xmm3
psllq xmm1, 32
movlhps xmm0, xmm2
orps xmm0, xmm1

Simplifying Result Passing

In Example 3-30, individual results were passed to the packing stage by storing to contiguous memory
locations. Instead of using memory spills to pass four results, result passing may be accomplished by
using either one or more registers. Using registers to simplify result passing and reduce memory spills
can improve performance by varying degrees depending on the register pressure at runtime.
Example 3-32 shows the coding sequence that uses four extra XMM registers to reduce all memory spills
of passing results back to the parent routine. However, software must observe the following conditions
when using this technique:

•
•

There is no register shortage.
If the loop does not have many stores or loads but has many computations, this technique does not
help performance. This technique adds work to the computational units, while the store and loads
ports are idle.

Example 3-32. Using Four Registers to Reduce Memory Spills and Simplify Result Passing
mov eax, [ebp+4]
mov [ebp], eax
call foo
movd xmm0, eax
mov eax, [ebp+8]
mov [ebp], eax
call foo
movd xmm1, eax
mov eax, [ebp+12]
mov [ebp], eax
call foo
movd xmm2, eax
mov eax, [ebp+12+4]
mov [ebp], eax
call foo
movd xmm3, eax

3-39

GENERAL OPTIMIZATION GUIDELINES

3.5.4.3

Stack Optimization

In Example 3-30, an input parameter was copied in turn onto the stack and passed to the non-vectorizable routine for processing. The parameter passing from consecutive memory locations can be simplified
by a technique shown in Example 3-33.
Example 3-33. Stack Optimization Technique to Simplify Parameter Passing
call foo
mov [ebp+16], eax
add ebp, 4
call foo
mov [ebp+16], eax
add ebp, 4
call foo
mov [ebp+16], eax
add ebp, 4
call foo
Stack Optimization can only be used when:

•

The serial operations are function calls. The function “foo” is declared as: INT FOO(INT A). The
parameter is passed on the stack.

•

The order of operation on the components is from last to first.

Note the call to FOO and the advance of EDP when passing the vector elements to FOO one by one from
last to first.

3.5.4.4

Tuning Considerations

Tuning considerations for situations represented by looping of Example 3-30 include:

•

Applying one of more of the following combinations:
— Choose an alternate packing technique.
— Consider a technique to simply result-passing.
— Consider the stack optimization technique to simplify parameter passing.

•
•

Minimizing the average number of cycles to execute one iteration of the loop.
Minimizing the per-iteration cost of the unpacking and packing operations.

The speed improvement by using the techniques discussed in this section will vary, depending on the
choice of combinations implemented and characteristics of the non-vectorizable routine. For example, if
the routine “foo” is short (representative of tight, short loops), the per-iteration cost of
unpacking/packing tend to be smaller than situations where the non-vectorizable code contain longer
operation or many dependencies. This is because many iterations of short, tight loop can be in flight in
the execution core, so the per-iteration cost of packing and unpacking is only partially exposed and
appear to cause very little performance degradation.
Evaluation of the per-iteration cost of packing/unpacking should be carried out in a methodical manner
over a selected number of test cases, where each case may implement some combination of the techniques discussed in this section. The per-iteration cost can be estimated by:

•
•

Evaluating the average cycles to execute one iteration of the test case.
Evaluating the average cycles to execute one iteration of a base line loop sequence of non-vectorizable code.

3-40

GENERAL OPTIMIZATION GUIDELINES

Example 3-34 shows the base line code sequence that can be used to estimate the average cost of a loop
that executes non-vectorizable routines.

Example 3-34. Base Line Code Sequence to Estimate Loop Overhead
push ebp
mov ebp, esp
sub ebp, 4
mov [ebp], edi
call foo
mov [ebp], edi
call foo
mov [ebp], edi
call foo
mov [ebp], edi
call foo
add ebp, 4
pop ebp
ret
The average per-iteration cost of packing/unpacking can be derived from measuring the execution times
of a large number of iterations by:
((Cycles to run TestCase) - (Cycles to run equivalent baseline sequence) ) / (Iteration count).
For example, using a simple function that returns an input parameter (representative of tight, short
loops), the per-iteration cost of packing/unpacking may range from slightly more than 7 cycles (the
shuffle with store forwarding case, Example 3-30) to ~0.9 cycles (accomplished by several test cases).
Across 27 test cases (consisting of one of the alternate packing methods, no result-simplification/simplification of either 1 or 4 results, no stack optimization or with stack optimization), the average per-iteration cost of packing/unpacking is about 1.7 cycles.
Generally speaking, packing method 2 and 3 (see Example 3-31) tend to be more robust than packing
method 1; the optimal choice of simplifying 1 or 4 results will be affected by register pressure of the
runtime and other relevant microarchitectural conditions.
Note that the numeric discussion of per-iteration cost of packing/packing is illustrative only. It will vary
with test cases using a different base line code sequence and will generally increase if the non-vectorizable routine requires longer time to execute because the number of loop iterations that can reside in
flight in the execution core decreases.

3-41

GENERAL OPTIMIZATION GUIDELINES

3.6

OPTIMIZING MEMORY ACCESSES

This section discusses guidelines for optimizing code and data memory accesses. The most important
recommendations are:

•
•
•
•
•
•
•
•
•

Execute load and store operations within available execution bandwidth.
Enable forward progress of speculative execution.
Enable store forwarding to proceed.
Align data, paying attention to data layout and stack alignment.
Place code and data on separate pages.
Enhance data locality.
Use prefetching and cacheability control instructions.
Enhance code locality and align branch targets.
Take advantage of write combining.

Alignment and forwarding problems are among the most common sources of large delays on processors
based on Intel NetBurst microarchitecture.

3.6.1

Load and Store Execution Bandwidth

Typically, loads and stores are the most frequent operations in a workload, up to 40% of the instructions
in a workload carrying load or store intent are not uncommon. Each generation of microarchitecture
provides multiple buffers to support executing load and store operations while there are instructions in
flight.
Software can maximize memory performance by not exceeding the issue or buffering limitations of the
machine. In the Intel Core microarchitecture, only 20 stores and 32 loads may be in flight at once. In
Intel microarchitecture code name Nehalem, there are 32 store buffers and 48 load buffers. Since only
one load can issue per cycle, algorithms which operate on two arrays are constrained to one operation
every other cycle unless you use programming tricks to reduce the amount of memory usage.
Intel Core Duo and Intel Core Solo processors have less buffers. Nevertheless the general heuristic
applies to all of them.

3.6.1.1

Make Use of Load Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge

While prior microarchitecture has one load port (port 2), Intel microarchitecture code name Sandy Bridge
can load from port 2 and port 3. Thus two load operations can be performed every cycle and doubling the
load throughput of the code. This improves code that reads a lot of data and does not need to write out
results to memory very often (Port 3 also handles store-address operation). To exploit this bandwidth,
the data has to stay in the L1 data cache or it should be accessed sequentially, enabling the hardware
prefetchers to bring the data to the L1 data cache in time.
Consider the following C code example of adding all the elements of an array:
int buff[BUFF_SIZE];
int sum = 0;
for (i=0;ipNext;
// ASM example
loop:
mov rdx, [rdx]
dec rax
cmp rax, -1
jne loop

The left side implements pointer chasing via traversing an index. Compiler then generates the code
shown below addressing memory using base+index with an offset. The right side shows compiler generated code from pointer de-referencing code and uses only a base register.
The code on the right side is faster than the left side across Intel microarchitecture code name Sandy
Bridge and prior microarchitecture. However the code that traverses index will be slower on Intel microarchitecture code name Sandy Bridge relative to prior microarchitecture.

3.6.1.3

Handling L1D Cache Bank Conflict

In Intel microarchitecture code name Sandy Bridge, the internal organization of the L1D cache may
manifest a situation when two load micro-ops whose addresses have a bank conflict. When a bank
conflict is present between two load operations, the more recent one will be delayed until the conflict is
resolved. A bank conflict happens when two simultaneous load operations have the same bit 2-5 of their
linear address but they are not from the same set in the cache (bits 6 - 12).
Bank conflicts should be handled only if the code is bound by load bandwidth. Some bank conflicts do not
cause any performance degradation since they are hidden by other performance limiters. Eliminating
such bank conflicts does not improve performance.
The following example demonstrates bank conflict and how to modify the code and avoid them. It uses
two source arrays with a size that is a multiple of cache line size. When loading an element from A and
the counterpart element from B the elements have the same offset in their cache lines and therefore a
bank conflict may happen.
With the Haswell microarchitecture, the L1 DCache bank conflict issue does not apply.

3-44

GENERAL OPTIMIZATION GUIDELINES

.
Example 3-37. Example of Bank Conflicts in L1D Cache and Remedy
int A[128];
int B[128];
int C[128];
for (i=0;i<128;i+=4){
C[i]=A[i]+B[i];
the loads from A[i] and B[i] collide
C[i+1]=A[i+1]+B[i+1];
C[i+2]=A[i+2]+B[i+2];
C[i+3]=A[i+3]+B[i+3];
}
// Code with Bank Conflicts
xor rcx, rcx
lea r11, A
lea r12, B
lea r13, C
loop:
lea esi, [rcx*4]
movsxd rsi, esi
mov edi, [r11+rsi*4]
add edi, [r12+rsi*4]
mov r8d, [r11+rsi*4+4]
add r8d, [r12+rsi*4+4]
mov r9d, [r11+rsi*4+8]
add r9d, [r12+rsi*4+8]
mov r10d, [r11+rsi*4+12]
add r10d, [r12+rsi*4+12]

// Code without Bank Conflicts
xor rcx, rcx
lea r11, A
lea r12, B
lea r13, C
loop:
lea esi, [rcx*4]
movsxd rsi, esi
mov edi, [r11+rsi*4]
mov r8d, [r11+rsi*4+4]
add edi, [r12+rsi*4]
add r8d, [r12+rsi*4+4]
mov r9d, [r11+rsi*4+8]
mov r10d, [r11+rsi*4+12]
add r9d, [r12+rsi*4+8]
add r10d, [r12+rsi*4+12]

mov [r13+rsi*4], edi
inc ecx
mov [r13+rsi*4+4], r8d
mov [r13+rsi*4+8], r9d
mov [r13+rsi*4+12], r10d
cmp ecx, LEN
jb loop

inc ecx
mov [r13+rsi*4], edi
mov [r13+rsi*4+4], r8d
mov [r13+rsi*4+8], r9d
mov [r13+rsi*4+12], r10d
cmp ecx, LEN
jb loop

3.6.2

Minimize Register Spills

When a piece of code has more live variables than the processor can keep in general purpose registers,
a common method is to hold some of the variables in memory. This method is called register spill. The
effect of L1D cache latency can negatively affect the performance of this code. The effect can be more
pronounced if the address of register spills uses the slower addressing modes.
One option is to spill general purpose registers to XMM registers. This method is likely to improve performance also on previous processor generations. The following example shows how to spill a register to an
XMM register rather than to memory.

3-45

GENERAL OPTIMIZATION GUIDELINES

Example 3-38. Using XMM Register in Lieu of Memory for Register Spills
Register spills into memory
Register spills into XMM
loop:
mov rdx, [rsp+0x18]
movdqa xmm0, [rdx]
movdqa xmm1, [rsp+0x20]
pcmpeqd xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
jne end_loop
movzx rcx, [rbx+0x60]

add qword ptr[rsp+0x18], 0x10
add rdi, 0x4
movzx rdx, di
sub rcx, 0x4
add rsi, 0x1d0
cmp rdx, rcx
jle loop

3.6.3

movq xmm4, [rsp+0x18]
mov rcx, 0x10
movq xmm5, rcx
loop:
movq rdx, xmm4
movdqa xmm0, [rdx]
movdqa xmm1, [rsp+0x20]
pcmpeqd xmm1, xmm0
pmovmskb eax, xmm1
test eax, eax
jne end_loop
movzx rcx, [rbx+0x60]
padd xmm4, xmm5
add rdi, 0x4
movzx rdx, di
sub rcx, 0x4
add rsi, 0x1d0
cmp rdx, rcx
jle loop

Enhance Speculative Execution and Memory Disambiguation

Prior to Intel Core microarchitecture, when code contains both stores and loads, the loads cannot be
issued before the address of the store is resolved. This rule ensures correct handling of load dependencies on preceding stores.
The Intel Core microarchitecture contains a mechanism that allows some loads to be issued early speculatively. The processor later checks if the load address overlaps with a store. If the addresses do overlap,
then the processor re-executes the instructions.
Example 3-39 illustrates a situation that the compiler cannot be sure that “Ptr->Array” does not change
during the loop. Therefore, the compiler cannot keep “Ptr->Array” in a register as an invariant and must
read it again in every iteration. Although this situation can be fixed in software by a rewriting the code to
require the address of the pointer is invariant, memory disambiguation provides performance gain
without rewriting the code.

3-46

GENERAL OPTIMIZATION GUIDELINES

Example 3-39. Loads Blocked by Stores of Unknown Address
C code
Assembly sequence
struct AA {
AA ** array;
};
void nullify_array ( AA *Ptr, DWORD Index, AA *ThisPtr
)
{
while ( Ptr->Array[--Index] != ThisPtr )
{
Ptr->Array[Index] = NULL ;
};
};

3.6.4

nullify_loop:
mov dword ptr [eax], 0
mov edx, dword ptr [edi]
sub ecx, 4
cmp dword ptr [ecx+edx], esi
lea eax, [ecx+edx]
jne nullify_loop

Alignment

Alignment of data concerns all kinds of variables:

•
•
•
•

Dynamically allocated variables.
Members of a data structure.
Global or local variables.
Parameters passed on the stack.

Misaligned data access can incur significant performance penalties. This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including
processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory accesses and requires several
µops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on machines with longer pipelines.
Double-precision floating-point operands that are eight-byte aligned have better performance than operands that are not eight-byte aligned, since they are less likely to incur penalties for cache and MOB splits.
Floating-point operation on a memory operands require that the operand be loaded from memory. This
incurs an additional µop, which can have a minor negative impact on front end bandwidth. Additionally,
memory operands may cause a data cache miss, causing a penalty.
Assembly/Compiler Coding Rule 46. (H impact, H generality) Align data on natural operand size
address boundaries. If the data will be accessed with vector instruction loads and stores, align the data
on 16-byte boundaries.
For best performance, align data as follows:

•
•
•
•
•
•

Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.

A 64-byte or greater data structure or array should be aligned so that its base address is a multiple of 64.
Sorting data in decreasing size order is one heuristic for assisting with natural alignment. As long as 16byte boundaries (and cache lines) are never crossed, natural alignment is not strictly necessary (though
it is an easy way to enforce this).

3-47

GENERAL OPTIMIZATION GUIDELINES

Example 3-40 shows the type of code that can cause a cache line split. The code loads the addresses of
two DWORD arrays. 029E70FEH is not a 4-byte-aligned address, so a 4-byte access at this address will
get 2 bytes from the cache line this address is contained in, and 2 bytes from the cache line that starts at
029E700H. On processors with 64-byte cache lines, a similar cache line split will occur every 8 iterations.
Example 3-40. Code That Causes Cache Line Split
mov
mov
Blockmove:
mov
mov
mov
mov
add
add
sub
jnz

esi, 029e70feh
edi, 05be5260h
eax, DWORD PTR [esi]
ebx, DWORD PTR [esi+4]
DWORD PTR [edi], eax
DWORD PTR [edi+4], ebx
esi, 8
edi, 8
edx, 1
Blockmove

Figure 3-2 illustrates the situation of accessing a data element that span across cache line boundaries.

Address 029e70c1h

Address 029e70feh

Cache Line 029e70c0h

Index 0

Cache Line 029e7100h

Index 0 cont'd

Index 1

Index 15

Index 16

Cache Line 029e7140h

Index 16 cont'd

Index 17

Index 31

Index 32

Figure 3-2. Cache Line Split in Accessing Elements in a Array
Alignment of code is less important for processors based on Intel NetBurst microarchitecture. Alignment
of branch targets to maximize bandwidth of fetching cached instructions is an issue only when not
executing out of the trace cache.
Alignment of code can be an issue for the Pentium M, Intel Core Duo and Intel Core 2 Duo processors.
Alignment of branch targets will improve decoder throughput.

3.6.5

Store Forwarding

The processor’s memory system only sends stores to memory (including cache) after store retirement.
However, store data can be forwarded from a store to a subsequent load from the same address to give
a much shorter store-load latency.
There are two kinds of requirements for store forwarding. If these requirements are violated, store
forwarding cannot occur and the load must get its data from the cache (so the store must write its data
back to the cache first). This incurs a penalty that is largely related to pipeline depth of the underlying
micro-architecture.

3-48

GENERAL OPTIMIZATION GUIDELINES

The first requirement pertains to the size and alignment of the store-forwarding data. This restriction is
likely to have high impact on overall application performance. Typically, a performance penalty due to
violating this restriction can be prevented. The store-to-load forwarding restrictions vary from one microarchitecture to another. Several examples of coding pitfalls that cause store-forwarding stalls and solutions to these pitfalls are discussed in detail in Section 3.6.5.1, “Store-to-Load-Forwarding Restriction on
Size and Alignment.” The second requirement is the availability of data, discussed in Section 3.6.5.2,
“Store-forwarding Restriction on Data Availability.” A good practice is to eliminate redundant load operations.
It may be possible to keep a temporary scalar variable in a register and never write it to memory. Generally, such a variable must not be accessible using indirect pointers. Moving a variable to a register eliminates all loads and stores of that variable and eliminates potential problems associated with store
forwarding. However, it also increases register pressure.
Load instructions tend to start chains of computation. Since the out-of-order engine is based on data
dependence, load instructions play a significant role in the engine’s ability to execute at a high rate. Eliminating loads should be given a high priority.
If a variable does not change between the time when it is stored and the time when it is used again, the
register that was stored can be copied or used directly. If register pressure is too high, or an unseen function is called before the store and the second load, it may not be possible to eliminate the second load.
Assembly/Compiler Coding Rule 47. (H impact, M generality) Pass parameters in registers
instead of on the stack where possible. Passing arguments on the stack requires a store followed by a
reload. While this sequence is optimized in hardware by providing the value to the load directly from
the memory order buffer without the need to access the data cache if permitted by store-forwarding
restrictions, floating-point values incur a significant latency in forwarding. Passing floating-point
arguments in (preferably XMM) registers should save this long latency operation.
Parameter passing conventions may limit the choice of which parameters are passed in registers which
are passed on the stack. However, these limitations may be overcome if the compiler has control of the
compilation of the whole binary (using whole-program optimization).

3.6.5.1

Store-to-Load-Forwarding Restriction on Size and Alignment

Data size and alignment restrictions for store-forwarding apply to processors based on Intel NetBurst
microarchitecture, Intel Core microarchitecture, Intel Core 2 Duo, Intel Core Solo and Pentium M processors. The performance penalty for violating store-forwarding restrictions is less for shorter-pipelined
machines than for Intel NetBurst microarchitecture.
Store-forwarding restrictions vary with each microarchitecture. Intel NetBurst microarchitecture places
more constraints than Intel Core microarchitecture on code generation to enable store-forwarding to
make progress instead of experiencing stalls. Fixing store-forwarding problems for Intel NetBurst microarchitecture generally also avoids problems on Pentium M, Intel Core Duo and Intel Core 2 Duo processors. The size and alignment restrictions for store-forwarding in processors based on Intel NetBurst
microarchitecture are illustrated in Figure 3-3.

3-49

GENERAL OPTIMIZATION GUIDELINES

Load Aligned with
Store W ill Forward

Non-Forwarding

Store

(a) Sm all load after
Large Store

Penalty
Load

Store

(b) Size of Load >=
Store

Penalty
Load

Store

(c) Size of Load >=
Store(s)

Penalty
Load

(d) 128-bit Forward
Must Be 16-Byte
Aligned

Store

Penalty
Load

16-Byte
Boundary
OM15155

Figure 3-3. Size and Alignment Restrictions in Store Forwarding
The following rules help satisfy size and alignment restrictions for store forwarding:
Assembly/Compiler Coding Rule 48. (H impact, M generality) A load that forwards from a store
must have the same address start point and therefore the same alignment as the store data.
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is
forwarded from a store must be completely contained within the store data.
A load that forwards from a store must wait for the store’s data to be written to the store buffer before
proceeding, but other, unrelated loads need not wait.

3-50

GENERAL OPTIMIZATION GUIDELINES

Assembly/Compiler Coding Rule 50. (H impact, ML generality) If it is necessary to extract a nonaligned portion of stored data, read out the smallest aligned portion that completely contains the data
and shift/mask the data as necessary. This is better than incurring the penalties of a failed storeforward.
Assembly/Compiler Coding Rule 51. (MH impact, ML generality) Avoid several small loads after
large stores to the same area of memory by using a single large read and register copies as needed.
Example 3-41 depicts several store-forwarding situations in which small loads follow large stores. The
first three load operations illustrate the situations described in Rule 51. However, the last load operation
gets data from store-forwarding without problem.
Example 3-41. Situations Showing Small Loads After Large Store
mov [EBP],‘abcd’
mov AL, [EBP]
mov BL, [EBP + 1]
mov CL, [EBP + 2]
mov DL, [EBP + 3]
mov AL, [EBP]

; Not blocked - same alignment
; Blocked
; Blocked
; Blocked
; Not blocked - same alignment
; n.b. passes older blocked loads

Example 3-42 illustrates a store-forwarding situation in which a large load follows several small stores.
The data needed by the load operation cannot be forwarded because all of the data that needs to be
forwarded is not contained in the store buffer. Avoid large loads after small stores to the same area of
memory.
Example 3-42. Non-forwarding Example of Large Load After Small Store
mov [EBP], ‘a’
mov [EBP + 1], ‘b’
mov [EBP + 2], ‘c’
mov [EBP + 3], ‘d’
mov EAX, [EBP] ; Blocked
; The first 4 small store can be consolidated into
; a single DWORD store to prevent this non-forwarding
; situation.
Example 3-43 illustrates a stalled store-forwarding situation that may appear in compiler generated
code. Sometimes a compiler generates code similar to that shown in Example 3-43 to handle a spilled
byte to the stack and convert the byte to an integer value.
Example 3-43. A Non-forwarding Situation in Compiler Generated Code
mov DWORD PTR [esp+10h], 00000000h
mov BYTE PTR [esp+10h], bl
mov eax, DWORD PTR [esp+10h] ; Stall
and eax, 0xff
; Converting back to byte value

3-51

GENERAL OPTIMIZATION GUIDELINES

Example 3-44 offers two alternatives to avoid the non-forwarding situation shown in Example 3-43.
Example 3-44. Two Ways to Avoid Non-forwarding Situation in Example 3-43
; A. Use MOVZ instruction to avoid large load after small
; store, when spills are ignored.
movz eax, bl

; Replaces the last three instructions

; B. Use MOVZ instruction and handle spills to the stack
mov DWORD PTR [esp+10h], 00000000h
mov BYTE PTR [esp+10h], bl
movz eax, BYTE PTR [esp+10h]

; Not blocked

When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD
register moves are more efficient (if aligned) and can be used to avoid unaligned loads. Although
floating-point registers allow the movement of 64 bits at a time, floating-point instructions should not be
used for this purpose, as data may be inadvertently modified.
As an additional example, consider the cases in Example 3-45.
Example 3-45. Large and Small Load Stalls
; A. Large load stall
mov
mov
fld

mem, eax
mem + 4, ebx
mem

; Store dword to address “MEM"
; Store dword to address “MEM + 4"
; Load qword at address “MEM", stalls

; B. Small Load stall
fstp
mov
mov

mem
bx, mem+2
cx, mem+4

; Store qword to address “MEM"
; Load word at address “MEM + 2", stalls
; Load word at address “MEM + 4", stalls

In the first case (A), there is a large load after a series of small stores to the same area of memory
(beginning at memory address MEM). The large load will stall.
The FLD must wait for the stores to write to memory before it can access all the data it requires. This stall
can also occur with other data types (for example, when bytes or words are stored and then words or
doublewords are read from the same area of memory).
In the second case (B), there is a series of small loads after a large store to the same area of memory
(beginning at memory address MEM). The small loads will stall.
The word loads must wait for the quadword store to write to memory before they can access the data
they require. This stall can also occur with other data types (for example, when doublewords or words
are stored and then words or bytes are read from the same area of memory). This can be avoided by
moving the store as far from the loads as possible.

3-52

GENERAL OPTIMIZATION GUIDELINES

Store forwarding restrictions for processors based on Intel Core microarchitecture is listed in Table 3-3.
Table 3-3. Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture
Store Forwarding
Store Alignment Width of Store (bits) Load Alignment (byte)
Width of Load (bits)
Restriction
To Natural size

16

word aligned

8, 16

not stalled

To Natural size

16

not word aligned

8

stalled

To Natural size

32

dword aligned

8, 32

not stalled

To Natural size

32

not dword aligned

8

stalled

To Natural size

32

word aligned

16

not stalled

To Natural size

32

not word aligned

16

stalled

To Natural size

64

qword aligned

8, 16, 64

not stalled

To Natural size

64

not qword aligned

8, 16

stalled

To Natural size

64

dword aligned

32

not stalled

To Natural size

64

not dword aligned

32

stalled

To Natural size

128

dqword aligned

8, 16, 128

not stalled

To Natural size

128

not dqword aligned

8, 16

stalled

To Natural size

128

dword aligned

32

not stalled

To Natural size

128

not dword aligned

32

stalled

To Natural size

128

qword aligned

64

not stalled

To Natural size

128

not qword aligned

64

stalled

Unaligned, start
byte 1

32

byte 0 of store

8, 16, 32

not stalled

Unaligned, start
byte 1

32

not byte 0 of store

8, 16

stalled

Unaligned, start
byte 1

64

byte 0 of store

8, 16, 32

not stalled

Unaligned, start
byte 1

64

not byte 0 of store

8, 16, 32

stalled

Unaligned, start
byte 1

64

byte 0 of store

64

stalled

Unaligned, start
byte 7

32

byte 0 of store

8

not stalled

Unaligned, start
byte 7

32

not byte 0 of store

8

not stalled

Unaligned, start
byte 7

32

don’t care

16, 32

stalled

Unaligned, start
byte 7

64

don’t care

16, 32, 64

stalled

3.6.5.2

Store-forwarding Restriction on Data Availability

The value to be stored must be available before the load operation can be completed. If this restriction is
violated, the execution of the load will be delayed until the data is available. This delay causes some
execution resources to be used unnecessarily, and that can lead to sizable but non-deterministic delays.
However, the overall impact of this problem is much smaller than that from violating size and alignment
requirements.

3-53

GENERAL OPTIMIZATION GUIDELINES

In modern microarchitectures, hardware predicts when loads are dependent on and get their data
forwarded from preceding stores. These predictions can significantly improve performance. However, if a
load is scheduled too soon after the store it depends on or if the generation of the data to be stored is
delayed, there can be a significant penalty.
There are several cases in which data is passed through memory, and the store may need to be separated from the load:

•
•
•
•
•

Spills, save and restore registers in a stack frame.
Parameter passing.
Global and volatile variables.
Type conversion between integer and floating-point.
When compilers do not analyze code that is inlined, forcing variables that are involved in the interface
with inlined code to be in memory, creating more memory variables and preventing the elimination of
redundant loads.

Assembly/Compiler Coding Rule 52. (H impact, MH generality) Where it is possible to do so
without incurring other penalties, prioritize the allocation of variables to registers, as in register
allocation and for parameter passing, to minimize the likelihood and impact of store-forwarding
problems. Try not to store-forward data generated from a long latency instruction - for example, MUL
or DIV. Avoid store-forwarding data for variables with the shortest store-load distance. Avoid storeforwarding data for variables with many and/or long dependence chains, and especially avoid including
a store forward on a loop-carried dependence chain.
Example 3-46 shows an example of a loop-carried dependence chain.
Example 3-46. Loop-carried Dependence Chain
for ( i = 0; i < MAX; i++ ) {
a[i] = b[i] * foo;
foo = a[i] / 3;
}
// foo is a loop-carried dependence.
Assembly/Compiler Coding Rule 53. (M impact, MH generality) Calculate store addresses as
early as possible to avoid having stores block loads.

3.6.6

Data Layout Optimizations

User/Source Coding Rule 6. (H impact, M generality) Pad data structures defined in the source
code so that every data element is aligned to a natural operand size address boundary.
If the operands are packed in a SIMD instruction, align to the packed element size (64-bit or 128-bit).
Align data by providing padding inside structures and arrays. Programmers can reorganize structures and
arrays to minimize the amount of memory wasted by padding. However, compilers might not have this
freedom. The C programming language, for example, specifies the order in which structure elements are
allocated in memory. For more information, see Section 4.4, “Stack and Data Alignment”.

3-54

GENERAL OPTIMIZATION GUIDELINES

Example 3-47 shows how a data structure could be rearranged to reduce its size.
Example 3-47. Rearranging a Data Structure
struct unpacked { /* Fits in 20 bytes due to padding */
int
a;
char
b;
int
c;
char
d;
int
e;
};
struct packed { /* Fits in 16 bytes */
int
a;
int
c;
int
e;
char
b;
char
d;
}
Cache line size of 64 bytes can impact streaming applications (for example, multimedia). These reference and use data only once before discarding it. Data accesses which sparsely utilize the data within a
cache line can result in less efficient utilization of system memory bandwidth. For example, arrays of
structures can be decomposed into several arrays to achieve better packing, as shown in Example 3-48.

Example 3-48. Decomposing an Array
struct {
/* 1600 bytes */
int a, c, e;
char b, d;
} array_of_struct [100];
struct {
/* 1400 bytes */
int a[100], c[100], e[100];
char b[100], d[100];
} struct_of_array;
struct {
/* 1200 bytes */
int a, c, e;
} hybrid_struct_of_array_ace[100];
struct {
/* 200 bytes */
char b, d;
} hybrid_struct_of_array_bd[100];

The efficiency of such optimizations depends on usage patterns. If the elements of the structure are all
accessed together but the access pattern of the array is random, then ARRAY_OF_STRUCT avoids unnecessary prefetch even though it wastes memory.
However, if the access pattern of the array exhibits locality (for example, if the array index is being swept
through) then processors with hardware prefetchers will prefetch data from STRUCT_OF_ARRAY, even if
the elements of the structure are accessed together.

3-55

GENERAL OPTIMIZATION GUIDELINES

When the elements of the structure are not accessed with equal frequency, such as when element A is
accessed ten times more often than the other entries, then STRUCT_OF_ARRAY not only saves memory,
but it also prevents fetching unnecessary data items B, C, D, and E.
Using STRUCT_OF_ARRAY also enables the use of the SIMD data types by the programmer and the
compiler.
Note that STRUCT_OF_ARRAY can have the disadvantage of requiring more independent memory stream
references. This can require the use of more prefetches and additional address generation calculations.
It can also have an impact on DRAM page access efficiency. An alternative, HYBRID_STRUCT_OF_ARRAY
blends the two approaches. In this case, only 2 separate address streams are generated and referenced:
1 for HYBRID_STRUCT_OF_ARRAY_ACE and 1 for HYBRID_STRUCT_OF_ARRAY_BD. The second alterative also prevents fetching unnecessary data — assuming that (1) the variables A, C and E are always
used together, and (2) the variables B and D are always used together, but not at the same time as A, C
and E.
The hybrid approach ensures:

•
•
•
•

Simpler/fewer address generations than STRUCT_OF_ARRAY.
Fewer streams, which reduces DRAM page misses.
Fewer prefetches due to fewer streams.
Efficient cache line packing of data elements that are used concurrently.

Assembly/Compiler Coding Rule 54. (H impact, M generality) Try to arrange data structures
such that they permit sequential access.
If the data is arranged into a set of streams, the automatic hardware prefetcher can prefetch data that
will be needed by the application, reducing the effective memory latency. If the data is accessed in a nonsequential manner, the automatic hardware prefetcher cannot prefetch the data. The prefetcher can
recognize up to eight concurrent streams. See Chapter 7, “Optimizing Cache Usage,” for more information on the hardware prefetcher.
User/Source Coding Rule 7. (M impact, L generality) Beware of false sharing within a cache line
(64 bytes).

3.6.7

Stack Alignment

Performance penalty of unaligned access to the stack happens when a memory reference splits a cache
line. This means that one out of eight spatially consecutive unaligned quadword accesses is always
penalized, similarly for one out of 4 consecutive, non-aligned double-quadword accesses, etc.
Aligning the stack may be beneficial any time there are data objects that exceed the default stack alignment of the system. For example, on 32/64bit Linux, and 64bit Windows, the default stack alignment is
16 bytes, while 32bit Windows is 4 bytes.
Assembly/Compiler Coding Rule 55. (H impact, M generality) Make sure that the stack is aligned
at the largest multi-byte granular data type boundary matching the register width.
Aligning the stack typically requires the use of an additional register to track across a padded area of
unknown amount. There is a trade-off between causing unaligned memory references that spanned
across a cache line and causing extra general purpose register spills.
The assembly level technique to implement dynamic stack alignment may depend on compilers, and
specific OS environment. The reader may wish to study the assembly output from a compiler of interest.

3-56

GENERAL OPTIMIZATION GUIDELINES

Example 3-49. Examples of Dynamical Stack Alignment
// 32-bit environment
push
ebp ; save ebp
mov
ebp, esp ; ebp now points to incoming parameters
andl
esp, $- ;align esp to N byte boundary
sub
esp, $; reserve space for new stack frame
.
; parameters must be referenced off of ebp
mov
esp, ebp ; restore esp
pop
ebp ; restore ebp
// 64-bit environment
sub
esp, $
mov
r13, $
andl
r13, $- ; r13 point to aligned section in stack
.
;use r13 as base for aligned data

If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter
and save it into a register or known aligned storage, thus incurring the penalty only once.

3.6.8

Capacity Limits and Aliasing in Caches

There are cases in which addresses with a given stride will compete for some resource in the memory
hierarchy.
Typically, caches are implemented to have multiple ways of set associativity, with each way consisting of
multiple sets of cache lines (or sectors in some cases). Multiple memory references that compete for the
same set of each way in a cache can cause a capacity issue. There are aliasing conditions that apply to
specific microarchitectures. Note that first-level cache lines are 64 bytes. Thus, the least significant 6 bits
are not considered in alias comparisons. For processors based on Intel NetBurst microarchitecture, data
is loaded into the second level cache in a sector of 128 bytes, so the least significant 7 bits are not
considered in alias comparisons.

3.6.8.1

Capacity Limits in Set-Associative Caches

Capacity limits may be reached if the number of outstanding memory references that are mapped to the
same set in each way of a given cache exceeds the number of ways of that cache. The conditions that
apply to the first-level data cache and second level cache are listed below:

•

L1 Set Conflicts — Multiple references map to the same first-level cache set. The conflicting
condition is a stride determined by the size of the cache in bytes, divided by the number of ways.
These competing memory references can cause excessive cache misses only if the number of
outstanding memory references exceeds the number of ways in the working set:
— On Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model
encoding of 0, 1, or 2; there will be an excess of first-level cache misses for more than 4 simultaneous competing memory references to addresses with 2-KByte modulus.
— On Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model
encoding 3; there will be an excess of first-level cache misses for more than 8 simultaneous
competing references to addresses that are apart by 2-KByte modulus.

3-57

GENERAL OPTIMIZATION GUIDELINES

— On Intel Core 2 Duo, Intel Core Duo, Intel Core Solo, and Pentium M processors, there will be an
excess of first-level cache misses for more than 8 simultaneous references to addresses that are
apart by 4-KByte modulus.

•

L2 Set Conflicts — Multiple references map to the same second-level cache set. The conflicting
condition is also determined by the size of the cache or the number of ways:
— On Pentium 4 and Intel Xeon processors, there will be an excess of second-level cache misses for
more than 8 simultaneous competing references. The stride sizes that can cause capacity issues
are 32 KBytes, 64 KBytes, or 128 KBytes, depending of the size of the second level cache.
— On Pentium M processors, the stride sizes that can cause capacity issues are 128 KBytes or 256
KBytes, depending of the size of the second level cache. On Intel Core 2 Duo, Intel Core Duo,
Intel Core Solo processors, stride size of 256 KBytes can cause capacity issue if the number of
simultaneous accesses exceeded the way associativity of the L2 cache.

3.6.8.2

Aliasing Cases in the Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo and Intel® Core™
2 Duo Processors

Pentium M, Intel Core Solo, Intel Core Duo and Intel Core 2 Duo processors have the following aliasing
case:

•

Store forwarding — If a store to an address is followed by a load from the same address, the load
will not proceed until the store data is available. If a store is followed by a load and their addresses
differ by a multiple of 4 KBytes, the load stalls until the store operation completes.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a
non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order
computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes
apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes
apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some
multiple of 4 KBytes apart in the same first-level cache working set.
When declaring multiple arrays that are referenced with the same index and are each a multiple of 64
KBytes (as can happen with STRUCT_OF_ARRAY data layouts), pad them to avoid declaring them contiguously. Padding can be accomplished by either intervening declarations of other variables or by artificially
increasing the dimension.
User/Source Coding Rule 8. (H impact, ML generality) Consider using a special memory allocation
library with address offset capability to avoid aliasing. One way to implement a memory allocator to
avoid aliasing is to allocate more than enough space and pad. For example, allocate structures that are
68 KB instead of 64 KBytes to avoid the 64-KByte aliasing, or have the allocator pad and return random
offsets that are a multiple of 128 Bytes (the size of a cache line).
User/Source Coding Rule 9. (M impact, M generality) When padding variable declarations to
avoid aliasing, the greatest benefit comes from avoiding aliasing on second-level cache lines,
suggesting an offset of 128 bytes or more.
4-KByte memory aliasing occurs when the code accesses two different memory locations with a 4-KByte
offset between them. The 4-KByte aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a constant offset and the constant offset
happens to be a multiple of the byte increment from one iteration to the next.
Example 3-50 shows a routine that copies 16 bytes of memory in each iteration of a loop. If the offsets
(modular 4096) between source buffer (EAX) and destination buffer (EDX) differ by 16, 32, 48, 64, 80;
loads have to wait until stores have been retired before they can continue. For example at offset 16, the
load of the next iteration is 4-KByte aliased current iteration store, therefore the loop must wait until the
store operation completes, making the entire loop serialized. The amount of time needed to wait
decreases with larger offset until offset of 96 resolves the issue (as there is no pending stores by the time
of the load with same address).

3-58

GENERAL OPTIMIZATION GUIDELINES

The Intel Core microarchitecture provides a performance monitoring event (see
LOAD_BLOCK.OVERLAP_STORE in Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B) that allows software tuning effort to detect the occurrence of aliasing conditions.
Example 3-50. Aliasing Between Loads and Stores Across Loop Iterations
LP:
movaps xmm0, [eax+ecx]
movaps [edx+ecx], xmm0
add ecx, 16
jnz lp

3.6.9

Mixing Code and Data

The aggressive prefetching and pre-decoding of instructions by Intel processors have two related effects:

•

Self-modifying code works correctly, according to the Intel architecture processor requirements, but
incurs a significant performance penalty. Avoid self-modifying code if possible.

•

Placing writable data in the code segment might be impossible to distinguish from self-modifying
code. Writable data in the code segment might suffer the same performance penalty as selfmodifying code.

Assembly/Compiler Coding Rule 57. (M impact, L generality) If (hopefully read-only) data must
occur on the same page as code, avoid placing it immediately after an indirect jump. For example,
follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.
Tuning Suggestion 1. In rare cases, a performance problem may be caused by executing data on a
code page as instructions. This is very likely to happen when execution is following an indirect branch
that is not resident in the trace cache. If this is clearly causing a performance problem, try moving the
data elsewhere, or inserting an illegal opcode or a PAUSE instruction immediately after the indirect
branch. Note that the latter two alternatives may degrade performance in some circumstances.
Assembly/Compiler Coding Rule 58. (H impact, L generality) Always put code and data on
separate pages. Avoid self-modifying code wherever possible. If code is to be modified, try to do it all at
once and make sure the code that performs the modifications and the code being modified are on
separate 4-KByte pages or on separate aligned 1-KByte subpages.

3.6.9.1

Self-modifying Code

Self-modifying code (SMC) that ran correctly on Pentium III processors and prior implementations will run
correctly on subsequent implementations. SMC and cross-modifying code (when multiple processors in a
multiprocessor system are writing to a code page) should be avoided when high performance is desired.
Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or
fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page
containing directly or speculatively executed code with another processor as a data page can trigger an
SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is
due to the self-modifying code condition.
Dynamic code need not cause the SMC condition if the code written fills up a data page before that page
is accessed as code. Dynamically-modified code (for example, from target fix-ups) is likely to suffer from
the SMC condition and should be avoided where possible. Avoid the condition by introducing indirect
branches and using data tables on data pages (not code pages) using register-indirect calls.

3-59

GENERAL OPTIMIZATION GUIDELINES

3.6.9.2

Position Independent Code

Position independent code often needs to obtain the value of the instruction pointer. Example 3-51a
shows one technique to put the value of IP into the ECX register by issuing a CALL without a matching
RET. Example 3-51b shows an alternative technique to put the value of IP into the ECX register using a
matched pair of CALL/RET.

Example 3-51. Instruction Pointer Query Techniques
a) Using call without return to obtain IP does not corrupt the RSB
call _label; return address pushed is the IP of next instruction
_label:
pop ECX; IP of this instruction is now put into ECX
b) Using matched call/ret pair
call _lblcx;
... ; ECX now contains IP of this instruction
...
_lblcx
mov ecx, [esp];
ret

3.6.10

Write Combining

Write combining (WC) improves performance in two ways:

•

On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur
before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy.
Then the rest of line is read, and the bytes that have not been written are combined with the
unmodified bytes in the returned line.

•

Write combining allows multiple writes to be assembled and written further out in the cache hierarchy
as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial
writes to uncached memory.

There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of
family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may
be written out to higher cache levels and freed up for use on other write misses. Only four writecombining buffers are guaranteed to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo
processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each
core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for writecombining.
Assembly/Compiler Coding Rule 59. (H impact, L generality) If an inner loop writes to more than
four arrays (four distinct cache lines), apply loop fission to break up the body of the loop such that only
four arrays are being written to in each iteration of each of the resulting loops.
Write combining buffers are used for stores of all memory types. They are particularly important for
writes to uncached memory: writes to different parts of the same cache line can be grouped into a single,
full-cache-line bus transaction instead of going across the bus (since they are not cached) as several
partial writes. Avoiding partial writes can have a significant impact on bus bandwidth-bound graphics
applications, where graphics buffers are in uncached memory. Separating writes to uncached memory
and writes to writeback memory into separate phases can assure that the write combining buffers can fill
before getting evicted by other write traffic. Eliminating partial write transactions has been found to have
3-60

GENERAL OPTIMIZATION GUIDELINES

performance impact on the order of 20% for some applications. Because the cache lines are 64 bytes, a
write to the bus for 63 bytes will result in 8 partial bus transactions.
When coding functions that execute simultaneously on two threads, reducing the number of writes that
are allowed in an inner loop will help take full advantage of write-combining store buffers. For writecombining buffer recommendations for Hyper-Threading Technology, see Chapter 8, “Multicore and
Hyper-Threading Technology.”
Store ordering and visibility are also important issues for write combining. When a write to a writecombining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO).
If a subsequent write happens to another write-combining buffer, a separate RFO may be caused for that
cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the
second RFO has been serviced to guarantee properly ordered visibility of the writes. If the memory type
for the writes is write-combining, there will be no RFO since the line is not cached, and there is no such
delay. For details on write-combining, see Chapter 7, “Optimizing Cache Usage.”

3.6.11

Locality Enhancement

Locality enhancement can reduce data traffic originating from an outer-level sub-system in the
cache/memory hierarchy. This is to address the fact that the access-cost in terms of cycle-count from an
outer level will be more expensive than from an inner level. Typically, the cycle-cost of accessing a given
cache level (or memory system) varies across different microarchitectures, processor implementations,
and platform components. It may be sufficient to recognize the relative data access cost trend by locality
rather than to follow a large table of numeric values of cycle-costs, listed per locality, per processor/platform implementations, etc. The general trend is typically that access cost from an outer sub-system may
be approximately 3-10X more expensive than accessing data from the immediate inner level in the
cache/memory hierarchy, assuming similar degrees of data access parallelism.
Thus locality enhancement should start with characterizing the dominant data traffic locality. Section A,
“Application Performance Tools,” describes some techniques that can be used to determine the dominant
data traffic locality for any workload.
Even if cache miss rates of the last level cache may be low relative to the number of cache references,
processors typically spend a sizable portion of their execution time waiting for cache misses to be
serviced. Reducing cache misses by enhancing a program’s locality is a key optimization. This can take
several forms:

•

Blocking to iterate over a portion of an array that will fit in the cache (with the purpose that
subsequent references to the data-block [or tile] will be cache hit references).

•
•

Loop interchange to avoid crossing cache lines or page boundaries.
Loop skewing to make accesses contiguous.

Locality enhancement to the last level cache can be accomplished with sequencing the data access
pattern to take advantage of hardware prefetching. This can also take several forms:

•

Transformation of a sparsely populated multi-dimensional array into a one-dimension array such that
memory references occur in a sequential, small-stride pattern that is friendly to the hardware
prefetch (see Section 2.3.5.4, “Data Prefetching”).

•

Optimal tile size and shape selection can further improve temporal data locality by increasing hit
rates into the last level cache and reduce memory traffic resulting from the actions of hardware
prefetching (see Section 7.5.11, “Hardware Prefetching and Cache Blocking Techniques”).

It is important to avoid operations that work against locality-enhancing techniques. Using the lock prefix
heavily can incur large delays when accessing memory, regardless of whether the data is in the cache or
in system memory.
User/Source Coding Rule 10. (H impact, H generality) Optimization techniques such as blocking,
loop interchange, loop skewing, and packing are best done by the compiler. Optimize data structures
either to fit in one-half of the first-level cache or in the second-level cache; turn on loop optimizations
in the compiler to enhance locality for nested loops.

3-61

GENERAL OPTIMIZATION GUIDELINES

Optimizing for one-half of the first-level cache will bring the greatest performance benefit in terms of
cycle-cost per data access. If one-half of the first-level cache is too small to be practical, optimize for the
second-level cache. Optimizing for a point in between (for example, for the entire first-level cache) will
likely not bring a substantial improvement over optimizing for the second-level cache.

3.6.12

Minimizing Bus Latency

Each bus transaction includes the overhead of making requests and arbitrations. The average latency of
bus read and bus write transactions will be longer if reads and writes alternate. Segmenting reads and
writes into phases can reduce the average latency of bus transactions. This is because the number of
incidences of successive transactions involving a read following a write, or a write following a read, are
reduced.
User/Source Coding Rule 11. (M impact, ML generality) If there is a blend of reads and writes on
the bus, changing the code to separate these bus transactions into read phases and write phases can
help performance.
Note, however, that the order of read and write operations on the bus is not the same as it appears in the
program.
Bus latency for fetching a cache line of data can vary as a function of the access stride of data references.
In general, bus latency will increase in response to increasing values of the stride of successive cache
misses. Independently, bus latency will also increase as a function of increasing bus queue depths (the
number of outstanding bus requests of a given transaction type). The combination of these two trends
can be highly non-linear, in that bus latency of large-stride, bandwidth-sensitive situations are such that
effective throughput of the bus system for data-parallel accesses can be significantly less than the effective throughput of small-stride, bandwidth-sensitive situations.
To minimize the per-access cost of memory traffic or amortize raw memory latency effectively, software
should control its cache miss pattern to favor higher concentration of smaller-stride cache misses.
User/Source Coding Rule 12. (H impact, H generality) To achieve effective amortization of bus
latency, software should favor data access patterns that result in higher concentrations of cache miss
patterns, with cache miss strides that are significantly smaller than half the hardware prefetch trigger
threshold.

3.6.13

Non-Temporal Store Bus Traffic

Peak system bus bandwidth is shared by several types of bus activities, including reads (from memory),
reads for ownership (of a cache line), and writes. The data transfer rate for bus write transactions is
higher if 64 bytes are written out to the bus at a time.
Typically, bus writes to Writeback (WB) memory must share the system bus bandwidth with read-forownership (RFO) traffic. Non-temporal stores do not require RFO traffic; they do require care in
managing the access patterns in order to ensure 64 bytes are evicted at once (rather than evicting
several 8-byte chunks).

3-62

GENERAL OPTIMIZATION GUIDELINES

Although the data bandwidth of full 64-byte bus writes due to non-temporal stores is twice that of bus
writes to WB memory, transferring 8-byte chunks wastes bus request bandwidth and delivers significantly lower data bandwidth. This difference is depicted in Examples 3-52 and 3-53.
Example 3-52. Using Non-temporal Stores and 64-byte Bus Write Transactions
#define STRIDESIZE 256
lea ecx, p64byte_Aligned
mov edx, ARRAY_LEN
xor eax, eax
slloop:
movntps XMMWORD ptr [ecx + eax], xmm0
movntps XMMWORD ptr [ecx + eax+16], xmm0
movntps XMMWORD ptr [ecx + eax+32], xmm0
movntps XMMWORD ptr [ecx + eax+48], xmm0
; 64 bytes is written in one bus transaction
add eax, STRIDESIZE
cmp eax, edx
jl slloop

Example 3-53. On-temporal Stores and Partial Bus Write Transactions
#define STRIDESIZE 256
Lea ecx, p64byte_Aligned
Mov edx, ARRAY_LEN
Xor eax, eax
slloop:
movntps XMMWORD ptr [ecx + eax], xmm0
movntps XMMWORD ptr [ecx + eax+16], xmm0
movntps XMMWORD ptr [ecx + eax+32], xmm0
; Storing 48 bytes results in 6 bus partial transactions
add eax, STRIDESIZE
cmp eax, edx
jl slloop

3.7

PREFETCHING

Recent Intel processor families employ several prefetching mechanisms to accelerate the movement of
data or code and improve performance:

•
•
•

Hardware instruction prefetcher.
Software prefetch for data.
Hardware prefetch for cache lines of data or instructions.

3.7.1

Hardware Instruction Fetching and Software Prefetching

Software prefetching requires a programmer to use PREFETCH hint instructions and anticipate some suitable timing and location of cache misses.

3-63

GENERAL OPTIMIZATION GUIDELINES

Software PREFETCH operations work the same way as do load from memory operations, with the
following exceptions:

•
•

Software PREFETCH instructions retire after virtual to physical address translation is completed.

•

Avoid specifying a NULL address for software prefetches.

If an exception, such as page fault, is required to prefetch the data, then the software prefetch
instruction retires without prefetching data.

3.7.2

Hardware Prefetching for First-Level Data Cache

The hardware prefetching mechanism for L1 in Intel Core microarchitecture is discussed in Section
2.4.4.2.
Example 3-54 depicts a technique to trigger hardware prefetch. The code demonstrates traversing a
linked list and performing some computational work on 2 members of each element that reside in 2
different cache lines. Each element is of size 192 bytes. The total size of all elements is larger than can
be fitted in the L2 cache.

Example 3-54. Using DCU Hardware Prefetch
Original code

Modified sequence benefit from prefetch

mov ebx, DWORD PTR [First]
xor eax, eax
scan_list:
mov eax, [ebx+4]
mov ecx, 60

mov ebx, DWORD PTR [First]
xor eax, eax
scan_list:
mov eax, [ebx+4]
mov eax, [ebx+4]
mov eax, [ebx+4]
mov ecx, 60

do_some_work_1:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_1
mov eax, [ebx+64]
mov ecx, 30
do_some_work_2:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_2

do_some_work_1:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_1
mov eax, [ebx+64]
mov ecx, 30
do_some_work_2:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_2

mov ebx, [ebx]
test ebx, ebx
jnz scan_list

mov ebx, [ebx]
test ebx, ebx
jnz scan_list

The additional instructions to load data from one member in the modified sequence can trigger the DCU
hardware prefetch mechanisms to prefetch data in the next cache line, enabling the work on the second
member to complete sooner.
Software can gain from the first-level data cache prefetchers in two cases:

•

If data is not in the second-level cache, the first-level data cache prefetcher enables early trigger of
the second-level cache prefetcher.

•

If data is in the second-level cache and not in the first-level data cache, then the first-level data cache
prefetcher triggers earlier data bring-up of sequential cache line to the first-level data cache.

3-64

GENERAL OPTIMIZATION GUIDELINES

There are situations that software should pay attention to a potential side effect of triggering unnecessary DCU hardware prefetches. If a large data structure with many members spanning many cache lines
is accessed in ways that only a few of its members are actually referenced, but there are multiple pair
accesses to the same cache line. The DCU hardware prefetcher can trigger fetching of cache lines that
are not needed. In Example , references to the “Pts” array and “AltPts” will trigger DCU prefetch to fetch
additional cache lines that won’t be needed. If significant negative performance impact is detected due
to DCU hardware prefetch on a portion of the code, software can try to reduce the size of that contemporaneous working set to be less than half of the L2 cache.

Example 3-55. Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines
while ( CurrBond != NULL )
{
MyATOM *a1 = CurrBond->At1 ;
MyATOM *a2 = CurrBond->At2 ;
if ( a1->CurrStep <= a1->LastStep &&
a2->CurrStep <= a2->LastStep
)
{
a1->CurrStep++ ;
a2->CurrStep++ ;
double ux = a1->Pts[0].x - a2->Pts[0].x ;
double uy = a1->Pts[0].y - a2->Pts[0].y ;
double uz = a1->Pts[0].z - a2->Pts[0].z ;
a1->AuxPts[0].x += ux ;
a1->AuxPts[0].y += uy ;
a1->AuxPts[0].z += uz ;
a2->AuxPts[0].x += ux ;
a2->AuxPts[0].y += uy ;
a2->AuxPts[0].z += uz ;
};
CurrBond = CurrBond->Next ;
};

To fully benefit from these prefetchers, organize and access the data using one of the following methods:
Method 1:

•
•

Organize the data so consecutive accesses can usually be found in the same 4-KByte page.
Access the data in constant strides forward or backward IP Prefetcher.

Method 2:

•
•

Organize the data in consecutive lines.
Access the data in increasing addresses, in sequential cache lines.

Example demonstrates accesses to sequential cache lines that can benefit from the first-level cache
prefetcher.

3-65

GENERAL OPTIMIZATION GUIDELINES

Example 3-56. Technique For Using L1 Hardware Prefetch
unsigned int *p1, j, a, b;
for (j = 0; j < num; j += 16)
{
a = p1[j];
b = p1[j+1];
// Use these two values
}
By elevating the load operations from memory to the beginning of each iteration, it is likely that a significant part of the latency of the pair cache line transfer from memory to the second-level cache will be in
parallel with the transfer of the first cache line.
The IP prefetcher uses only the lower 8 bits of the address to distinguish a specific address. If the code
size of a loop is bigger than 256 bytes, two loads may appear similar in the lowest 8 bits and the IP
prefetcher will be restricted. Therefore, if you have a loop bigger than 256 bytes, make sure that no two
loads have the same lowest 8 bits in order to use the IP prefetcher.

3.7.3

Hardware Prefetching for Second-Level Cache

The Intel Core microarchitecture contains two second-level cache prefetchers:

•

Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer,
organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one
of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair
line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch
mechanism found in processors based on Intel NetBurst microarchitecture.

•

Data prefetch logic (DPL) — DPL and L2 Streamer are triggered only by writeback memory type.
They prefetch only inside page boundary (4 KBytes). Both L2 prefetchers can be triggered by
software prefetch instructions and by prefetch request from DCU prefetchers. DPL can also be
triggered by read for ownership (RFO) operations. The L2 Streamer can also be triggered by DPL
requests for L2 cache misses.

Software can gain from organizing data both according to the instruction pointer and according to line
strides. For example, for matrix calculations, columns can be prefetched by IP-based prefetches, and
rows can be prefetched by DPL and the L2 streamer.

3.7.4

Cacheability Instructions

SSE2 provides additional cacheability instructions that extend those provided in SSE. The new cacheability instructions include:

•
•
•

New streaming store instructions.
New cache line flush instruction.
New memory fencing instructions.

For more information, see Chapter 7, “Optimizing Cache Usage.”

3.7.5

REP Prefix and Data Movement

The REP prefix is commonly used with string move instructions for memory related library functions such
as MEMCPY (using REP MOVSD) or MEMSET (using REP STOS). These STRING/MOV instructions with the
REP prefixes are implemented in MS-ROM and have several implementation variants with different
performance levels.

3-66

GENERAL OPTIMIZATION GUIDELINES

The specific variant of the implementation is chosen at execution time based on data layout, alignment
and the counter (ECX) value. For example, MOVSB/STOSB with the REP prefix should be used with
counter value less than or equal to three for best performance.
String MOVE/STORE instructions have multiple data granularities. For efficient data movement, larger data
granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary
counter value into a number of doublewords plus single byte moves with a count value less than or equal
to 3.
Because software can use SIMD data movement instructions to move 16 bytes at a time, the following
paragraphs discuss general guidelines for designing and implementing high-performance library functions such as MEMCPY(), MEMSET(), and MEMMOVE(). Four factors are to be considered:

•

Throughput per iteration — If two pieces of code have approximately identical path lengths,
efficiency favors choosing the instruction that moves larger pieces of data per iteration. Also, smaller
code size per iteration will in general reduce overhead and improve throughput. Sometimes, this may
involve a comparison of the relative overhead of an iterative loop structure versus using REP prefix
for iteration.

•

Address alignment — Data movement instructions with highest throughput usually have alignment
restrictions, or they operate more efficiently if the destination address is aligned to its natural data
size. Specifically, 16-byte moves need to ensure the destination address is aligned to 16-byte
boundaries, and 8-bytes moves perform better if the destination address is aligned to 8-byte
boundaries. Frequently, moving at doubleword granularity performs better with addresses that are 8byte aligned.

•

REP string move vs. SIMD move — Implementing general-purpose memory functions using SIMD
extensions usually requires adding some prolog code to ensure the availability of SIMD instructions,
preamble code to facilitate aligned data movement requirements at runtime. Throughput comparison
must also take into consideration the overhead of the prolog when considering a REP string implementation versus a SIMD approach.

•

Cache eviction — If the amount of data to be processed by a memory routine approaches half the
size of the last level on-die cache, temporal locality of the cache may suffer. Using streaming store
instructions (for example: MOVNTQ, MOVNTDQ) can minimize the effect of flushing the cache. The
threshold to start using a streaming store depends on the size of the last level cache. Determine the
size using the deterministic cache parameter leaf of CPUID.
Techniques for using streaming stores for implementing a MEMSET()-type library must also consider
that the application can benefit from this technique only if it has no immediate need to reference
the target addresses. This assumption is easily upheld when testing a streaming-store implementation on a micro-benchmark configuration, but violated in a full-scale application situation.

When applying general heuristics to the design of general-purpose, high-performance library routines,
the following guidelines can are useful when optimizing an arbitrary counter value N and address alignment. Different techniques may be necessary for optimal performance, depending on the magnitude of
N:

•

When N is less than some small count (where the small count threshold will vary between microarchitectures -- empirically, 8 may be a good value when optimizing for Intel NetBurst microarchitecture),
each case can be coded directly without the overhead of a looping structure. For example, 11 bytes
can be processed using two MOVSD instructions explicitly and a MOVSB with REP counter equaling 3.

•

When N is not small but still less than some threshold value (which may vary for different microarchitectures, but can be determined empirically), an SIMD implementation using run-time CPUID
and alignment prolog will likely deliver less throughput due to the overhead of the prolog. A REP
string implementation should favor using a REP string of doublewords. To improve address
alignment, a small piece of prolog code using MOVSB/STOSB with a count less than 4 can be used to
peel off the non-aligned data moves before starting to use MOVSD/STOSD.

•

When N is less than half the size of last level cache, throughput consideration may favor either:
— An approach using a REP string with the largest data granularity because a REP string has little
overhead for loop iteration, and the branch misprediction overhead in the prolog/epilogue code to
handle address alignment is amortized over many iterations.

3-67

GENERAL OPTIMIZATION GUIDELINES

— An iterative approach using the instruction with largest data granularity, where the overhead for
SIMD feature detection, iteration overhead, and prolog/epilogue for alignment control can be
minimized. The trade-off between these approaches may depend on the microarchitecture.
An example of MEMSET() implemented using stosd for arbitrary counter value with the destination
address aligned to doubleword boundary in 32-bit mode is shown in Example 3-57.

•

When N is larger than half the size of the last level cache, using 16-byte granularity streaming stores
with prolog/epilog for address alignment will likely be more efficient, if the destination addresses will
not be referenced immediately afterwards.

Example 3-57. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination
A ‘C’ example of Memset()
Equivalent Implementation Using REP STOSD
void memset(void *dst,int c,size_t size)
{
char *d = (char *)dst;
size_t i;
for (i=0;i
void add(float *a, float *b, float *c)
{
__m128 t0, t1;
t0 = _mm_load_ps(a);
t1 = _mm_load_ps(b);
t0 = _mm_add_ps(t0, t1);
_mm_store_ps(c, t0);
}

The intrinsics map one-to-one with actual Streaming SIMD Extensions assembly code. The
XMMINTRIN.H header file in which the prototypes for the intrinsics are defined is part of the Intel C++
Compiler included with the VTune Performance Enhancement Environment CD.
Intrinsics are also defined for the MMX technology ISA. These are based on the __m64 data type to
represent the contents of an mm register. You can specify values in bytes, short integers, 32-bit values,
or as a 64-bit object.
The intrinsic data types, however, are not a basic ANSI C data type, and therefore you must observe the
following usage restrictions:

•

Use intrinsic data types only on the left-hand side of an assignment as a return value or as a
parameter. You cannot use it with other arithmetic expressions (for example, “+”, “>>”).

•

Use intrinsic data type objects in aggregates, such as unions to access the byte elements and
structures; the address of an __M64 object may be also used.

•

Use intrinsic data type data only with the MMX technology intrinsics described in this guide.

For complete details of the hardware instructions, see the Intel Architecture MMX Technology
Programmer’s Reference Manual. For a description of data types, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

4.3.1.3

Classes

A set of C++ classes has been defined and available in Intel C++ Compiler to provide both a higher-level
abstraction and more flexibility for programming with MMX technology, Streaming SIMD Extensions and
Streaming SIMD Extensions 2. These classes provide an easy-to-use and flexible interface to the intrinsic
functions, allowing developers to write more natural C++ code without worrying about which intrinsic or
assembly language instruction to use for a given operation. Since the intrinsic functions underlie the
implementation of these C++ classes, the performance of applications using this methodology can
approach that of one using the intrinsics. Further details on the use of these classes can be found in the
Intel C++ Class Libraries for SIMD Operations User’s Guide, order number 693500.

4-15

CODING FOR SIMD ARCHITECTURES

Example 4-16 shows the C++ code using a vector class library. The example assumes the arrays passed
to the routine are already aligned to 16-byte boundaries.
Example 4-16. C++ Code Using the Vector Classes
#include 
void add(float *a, float *b, float *c)
{
F32vec4 *av=(F32vec4 *) a;
F32vec4 *bv=(F32vec4 *) b;
F32vec4 *cv=(F32vec4 *) c;
*cv=*av + *bv;
}
Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The
“+” and “=” operators are overloaded so that the actual Streaming SIMD Extensions implementation in
the previous example is abstracted out, or hidden, from the developer. Note how much more this resembles the original code, allowing for simpler and faster programming.
Again, the example is assuming the arrays, passed to the routine, are already aligned to 16-byte
boundary.

4.3.1.4

Automatic Vectorization

The Intel C++ Compiler provides an optimization mechanism by which loops, such as in Example 4-13
can be automatically vectorized, or converted into Streaming SIMD Extensions code. The compiler uses
similar techniques to those used by a programmer to identify whether a loop is suitable for conversion to
SIMD. This involves determining whether the following might prevent vectorization:

•
•

The layout of the loop and the data structures used.
Dependencies amongst the data accesses in each iteration and across iterations.

Once the compiler has made such a determination, it can generate vectorized code for the loop, allowing
the application to use the SIMD instructions.
The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases
user interaction with the compiler is needed to fully enable this.
Example 4-17 shows the code for automatic vectorization for the simple four-iteration loop (from
Example 4-13).
Example 4-17. Automatic Vectorization for a Simple Loop
void add (float *restrict a,
float *restrict b,
float *restrict c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}

Compile this code using the -QAX and -QRESTRICT switches of the Intel C++ Compiler, version 4.0 or
later.
The RESTRICT qualifier in the argument list is necessary to let the compiler know that there are no other
aliases to the memory to which the pointers point. In other words, the pointer for which it is used,
4-16

CODING FOR SIMD ARCHITECTURES

provides the only means of accessing the memory in question in the scope in which the pointers live.
Without the restrict qualifier, the compiler will still vectorize this loop using runtime data dependence
testing, where the generated code dynamically selects between sequential or vector execution of the
loop, based on overlap of the parameters (See documentation for the Intel C++ Compiler). The restrict
keyword avoids the associated overhead altogether.
See Intel C++ Compiler documentation for details.

4.4

STACK AND DATA ALIGNMENT

To get the most performance out of code written for SIMD technologies data should be formatted in
memory according to the guidelines described in this section. Assembly code with an unaligned accesses
is a lot slower than an aligned access.

4.4.1

Alignment and Contiguity of Data Access Patterns

The 64-bit packed data types defined by MMX technology, and the 128-bit packed data types for
Streaming SIMD Extensions and Streaming SIMD Extensions 2 create more potential for misaligned data
accesses. The data access patterns of many algorithms are inherently misaligned when using MMX technology and Streaming SIMD Extensions. Several techniques for improving data access, such as padding,
organizing data elements into arrays, etc. are described below. SSE3 provides a special-purpose instruction LDDQU that can avoid cache line splits is discussed in Section 5.7.1.1, “Supplemental Techniques for
Avoiding Cache Line Splits.”

4.4.1.1

Using Padding to Align Data

However, when accessing SIMD data using SIMD operations, access to data can be improved simply by a
change in the declaration. For example, consider a declaration of a structure, which represents a point in
space plus an attribute.
typedef struct {short x,y,z; char a} Point;
Point pt[N];
Assume we will be performing a number of computations on X, Y, Z in three of the four elements of a
SIMD word; see Section 4.5.1, “Data Structure Layout,” for an example. Even if the first element in array
PT is aligned, the second element will start 7 bytes later and not be aligned (3 shorts at two bytes each
plus a single byte = 7 bytes).
By adding the padding variable PAD, the structure is now 8 bytes, and if the first element is aligned to 8
bytes (64 bits), all following elements will also be aligned. The sample declaration follows:
typedef struct {short x,y,z; char a; char pad;} Point;
Point pt[N];

4.4.1.2

Using Arrays to Make Data Contiguous

In the following code,
for (i=0; i B[i]) {
C[i] = D[i];
} else {
C[i] = E[i];
}

4-26

CODING FOR SIMD ARCHITECTURES

Example 4-26. Emulation of Conditional Moves (Contd.)
}
MMX assembly code processes 4 short values per iteration:
xor
eax, eax
top_of_loop:
movq
mm0, [A + eax]
pcmpgtwxmm0, [B + eax]; Create compare mask
movq
mm1, [D + eax]
pand
mm1, mm0; Drop elements where AB
por
movq
add
cmp
jle

mm0, mm1; Crete single word
[C + eax], mm0
eax, 8
eax, MAX_ELEMENT*2
top_of_loop

SSE4.1 assembly processes 8 short values per iteration:
xor
eax, eax
top_of_loop:
movdqq xmm0, [A + eax]
pcmpgtwxmm0, [B + eax]; Create compare mask
movdqa xmm1, [E + eax]
pblendv xmm1, [D + eax], xmm0;
movdqa [C + eax], xmm1;
add
eax, 16
cmp
eax, MAX_ELEMENT*2
jle
top_of_loop
If there are multiple consumers of an instance of a register, group the consumers together as closely as
possible. However, the consumers should not be scheduled near the producer.

4.6.1

SIMD Optimizations and Microarchitectures

Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel
NetBurst microarchitecture. The following sub-section discusses optimizing SIMD code targeting Intel
Core Solo and Intel Core Duo processors.
The register-register variant of the following instructions has improved performance on Intel Core Solo
and Intel Core Duo processor relative to Pentium M processors. This is because the instructions consist of
two micro-ops instead of three. Relevant instructions are: unpcklps, unpckhps, packsswb, packuswb,
packssdw, pshufd, shuffps and shuffpd.
Recommendation: When targeting code generation for Intel Core Solo and Intel Core Duo processors,
favor instructions consisting of two micro-ops over those with more than two micro-ops.
Intel Core microarchitecture generally executes SIMD instructions more efficiently than previous microarchitectures in terms of latency and throughput, most 128-bit SIMD operations have 1 cycle throughput
(except shuffle, pack, unpack operations). Many of the restrictions specific to Intel Core Duo, Intel Core
Solo processors (such as 128-bit SIMD operations having 2 cycle throughput at a minimum) do not apply
to Intel Core microarchitecture. The same is true of Intel Core microarchitecture relative to Intel
NetBurst microarchitectures.
Enhanced Intel Core microarchitecture provides dedicated 128-bit shuffler and radix-16 divider hardware. These capabilities and SSE4.1 instructions will make vectorization using 128-bit SIMD instructions
even more efficient and effective.

4-27

CODING FOR SIMD ARCHITECTURES

Recommendation: With the proliferation of 128-bit SIMD hardware in Intel Core microarchitecture and
Enhanced Intel Core microarchitecture, integer SIMD code written using MMX instructions should
consider more efficient implementations using 128-bit SIMD instructions.

4.7

TUNING THE FINAL APPLICATION

The best way to tune your application once it is functioning correctly is to use a profiler that measures the
application while it is running on a system. Intel VTune Amplifier XE can help you determine where to
make changes in your application to improve performance. Using Intel VTune Amplifier XE can help you
with various phases required for optimized performance. See Appendix A.3.1, “Intel® VTune™ Amplifier
XE,” for details. After every effort to optimize, you should check the performance gains to see where you
are making your major optimization gains.

4-28

CHAPTER 5
OPTIMIZING FOR SIMD INTEGER APPLICATIONS
SIMD integer instructions provide performance improvements in applications that are integer-intensive
and can take advantage of SIMD architecture.
Guidelines in this chapter for using SIMD integer instructions (in addition to those described in Chapter
3) may be used to develop fast and efficient code that scales across processor generations.
The collection of 64-bit and 128-bit SIMD integer instructions supported by MMX technology, SSE, SSE2,
SSE3, SSSE3, SSE4.1, and PCMPEQQ in SSE4.2 are referred to as SIMD integer instructions.
Code sequences in this chapter demonstrates the use of basic 64-bit SIMD integer instructions and more
efficient 128-bit SIMD integer instructions.
Processors based on Intel Core microarchitecture support MMX, SSE, SSE2, SSE3, and SSSE3. Processors based on Enhanced Intel Core microarchitecture support SSE4.1 and all previous generations of
SIMD integer instructions. Processors based on Intel microarchitecture code name Nehalem supports
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2.
Single-instruction, multiple-data techniques can be applied to text/string processing, lexing and parser
applications. SIMD programming in string/text processing and lexing applications often require sophisticated techniques beyond those commonly used in SIMD integer programming. This is covered in Chapter
10, “SSE4.2 and SIMD Programming For Text- Processing/Lexing/Parsing”
Execution of 128-bit SIMD integer instructions in Intel Core microarchitecture and Enhanced Intel Core
microarchitecture are substantially more efficient than on previous microarchitectures. Thus newer
SIMD capabilities introduced in SSE4.1 operate on 128-bit operands and do not introduce equivalent 64bit SIMD capabilities. Conversion from 64-bit SIMD integer code to 128-bit SIMD integer code is highly
recommended.
This chapter contains examples that will help you to get started with coding your application. The goal is
to provide simple, low-level operations that are frequently used. The examples use a minimum number
of instructions necessary to achieve best performance on the current generation of Intel 64 and IA-32
processors.
Each example includes a short description, sample code, and notes if necessary. These examples do not
address scheduling as it is assumed the examples will be incorporated in longer code sequences.
For planning considerations of using the SIMD integer instructions, refer to Section 4.1.3.

5.1

GENERAL RULES ON SIMD INTEGER CODE

General rules and suggestions are:
•

Do not intermix 64-bit SIMD integer instructions with x87 floating-point instructions. See Section
5.2, “Using SIMD Integer with x87 Floating-point.” Note that all SIMD integer instructions can be
intermixed without penalty.

•

Favor 128-bit SIMD integer code over 64-bit SIMD integer code. On microarchitectures prior to Intel
Core microarchitecture, most 128-bit SIMD instructions have two-cycle throughput restrictions due
to the underlying 64-bit data path in the execution engine. Intel Core microarchitecture executes
most SIMD instructions (except shuffle, pack, unpack operations) with one-cycle throughput and
provides three ports to execute multiple SIMD instructions in parallel. Enhanced Intel Core microarchitecture speeds up 128-bit shuffle, pack, unpack operations with 1 cycle throughput.

•

When writing SIMD code that works for both integer and floating-point data, use the subset of SIMD
convert instructions or load/store instructions to ensure that the input operands in XMM registers
contain data types that are properly defined to match the instruction.

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Code sequences containing cross-typed usage produce the same result across different implementations but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3/SSE44.1 instructions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged.
•

Use the optimization rules and guidelines described in Chapter 3 and Chapter 4.

•

Take advantage of hardware prefetcher where possible. Use the PREFETCH instruction only when
data access patterns are irregular and prefetch distance can be pre-determined. See Chapter 7,
“Optimizing Cache Usage.”

•

Emulate conditional moves by using blend, masked compares and logicals instead of using
conditional branches.

5.2

USING SIMD INTEGER WITH X87 FLOATING-POINT

All 64-bit SIMD integer instructions use MMX registers, which share register state with the x87 floatingpoint stack. Because of this sharing, certain rules and considerations apply. Instructions using MMX
registers cannot be freely intermixed with x87 floating-point registers. Take care when switching
between 64-bit SIMD integer instructions and x87 floating-point instructions to ensure functional
correctness. See Section 5.2.1.
Both Section 5.2.1 and Section 5.2.2 apply only to software that employs MMX instructions. As noted
before, 128-bit SIMD integer instructions should be favored to replace MMX code and achieve higher
performance. That also obviates the need to use EMMS, and the performance penalty of using EMMS
when intermixing MMX and X87 instructions.
For performance considerations, there is no penalty of intermixing SIMD floating-point operations and
128-bit SIMD integer operations and x87 floating-point operations.

5.2.1

Using the EMMS Instruction

When generating 64-bit SIMD integer code, keep in mind that the eight MMX registers are aliased to x87
floating-point registers. Switching from MMX instructions to x87 floating-point instructions incurs a finite
delay, so it is the best to minimize switching between these instruction types. But when switching, the
EMMS instruction provides an efficient means to clear the x87 stack so that subsequent x87 code can
operate properly.
As soon as an instruction makes reference to an MMX register, all valid bits in the x87 floating-point tag
word are set, which implies that all x87 registers contain valid values. In order for software to operate
correctly, the x87 floating-point stack should be emptied when starting a series of x87 floating-point
calculations after operating on the MMX registers.
Using EMMS clears all valid bits, effectively emptying the x87 floating-point stack and making it ready for
new x87 floating-point operations. The EMMS instruction ensures a clean transition between using operations on the MMX registers and using operations on the x87 floating-point stack. On the Pentium 4
processor, there is a finite overhead for using the EMMS instruction.
Failure to use the EMMS instruction (or the _MM_EMPTY() intrinsic) between operations on the MMX
registers and x87 floating-point registers may lead to unexpected results.

NOTE
Failure to reset the tag word for FP instructions after using an MMX instruction can result
in faulty execution or poor performance.

5.2.2

Guidelines for Using EMMS Instruction

When developing code with both x87 floating-point and 64-bit SIMD integer instructions, follow these
steps:

5-2

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

1. Always call the EMMS instruction at the end of 64-bit SIMD integer code when the code transitions to
x87 floating-point code.
2. Insert the EMMS instruction at the end of all 64-bit SIMD integer code segments to avoid an x87
floating-point stack overflow exception when an x87 floating-point instruction is executed.
When writing an application that uses both floating-point and 64-bit SIMD integer instructions, use the
following guidelines to help you determine when to use EMMS:
•

If next instruction is x87 FP — Use _MM_EMPTY() after a 64-bit SIMD integer instruction if the
next instruction is an X87 FP instruction; for example, before doing calculations on floats, doubles or
long doubles.

•

Don’t empty when already empty — If the next instruction uses an MMX register, _MM_EMPTY()
incurs a cost with no benefit.

•

Group Instructions — Try to partition regions that use X87 FP instructions from those that use 64bit SIMD integer instructions. This eliminates the need for an EMMS instruction within the body of a
critical loop.

•

Runtime initialization — Use _MM_EMPTY() during runtime initialization of __M64 and X87 FP data
types. This ensures resetting the register between data type transitions. See Example 5-1 for coding
usage.

Example 5-1. Resetting Register Between __m64 and FP Data Types Code
Incorrect Usage

Correct Usage

__m64 x = _m_paddd(y, z);
float f = init();

__m64 x = _m_paddd(y, z);
float f = (_mm_empty(), init());

You must be aware that your code generates an MMX instruction, which uses MMX registers with the Intel
C++ Compiler, in the following situations:
•

when using a 64-bit SIMD integer intrinsic from MMX technology, SSE/SSE2/SSSE3

•

when using a 64-bit SIMD integer instruction from MMX technology, SSE/SSE2/SSSE3 through inline
assembly

•

when referencing the __M64 data type variable

Additional information on the x87 floating-point programming model can be found in the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1. For more on EMMS, visit http://developer.intel.com.

5.3

DATA ALIGNMENT

Make sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is 16-byte
aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to
accesses that span 2 cache lines. Referencing unaligned 128-bit SIMD integer data results in an exception unless the MOVDQU (move double-quadword unaligned) instruction is used. Using the MOVDQU
instruction on unaligned data can result in lower performance than using 16-byte aligned references.
Refer to Section 4.4, “Stack and Data Alignment,” for more information.
Loading 16 bytes of SIMD data efficiently requires data alignment on 16-byte boundaries. SSSE3
provides the PALIGNR instruction. It reduces overhead in situations that requires software to processing
data elements from non-aligned address. The PALIGNR instruction is most valuable when loading or
storing unaligned data with the address shifts by a few bytes. You can replace a set of unaligned loads
with aligned loads followed by using PALIGNR instructions and simple register to register copies.

5-3

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Using PALIGNRs to replace unaligned loads improves performance by eliminating cache line splits and
other penalties. In routines like MEMCPY( ), PALIGNR can boost the performance of misaligned cases.
Example 5-2 shows a situation that benefits by using PALIGNR.

Example 5-2. FIR Processing Example in C language Code
void FIR(float *in, float *out, float *coeff, int count)
{int i,j;
for ( i=0; i= 0X8000, simplify the algorithm as in Example 5-27.

Example 5-27. Simplified Clipping to an Arbitrary Signed Range
; Input:
; Output:
;
paddssw

MM0
MM1

signed source operands
signed operands clipped to the unsigned
range [high, low]
mm0, (packed_max - packed_high)
; in effect this clips to high
psubssw mm0, (packed_usmax - packed_high + packed_low)
; clips to low
paddw
mm0, low
; undo the previous two offsets
This algorithm saves a cycle when it is known that (High - Low) >= 0x8000. The three-instruction algorithm does not work when (High - Low) < 0x8000 because 0xffff minus any number < 0x8000 will yield
a number greater in magnitude than 0x8000 (which is a negative number).
When the second instruction, psubssw MM0, (0xffff - High + Low) in the three-step algorithm
(Example 5-27) is executed, a negative number is subtracted. The result of this subtraction causes the
values in MM0 to be increased instead of decreased, as should be the case, and an incorrect answer is
generated.

5-20

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

5.6.6.2

Clipping to an Arbitrary Unsigned Range [High, Low]

Example 5-28 clips an unsigned value to the unsigned range [High, Low]. If the value is less than low or
greater than high, then clip to low or high, respectively. This technique uses the packed-add and packedsubtract instructions with unsigned saturation, thus the technique can only be used on packed-bytes and
packed-words data types.
Figure 5-28 illustrates operation on word values.

Example 5-28. Clipping to an Arbitrary Unsigned Range [High, Low]
; Input:
;
; Output:
;
;
paddusw
psubusw
paddw

5.6.7

MM0

unsigned source operands

MM1

unsigned operands clipped to the unsigned
range [HIGH, LOW]
mm0, 0xffff - high
; in effect this clips to high
mm0, (0xffff - high + low)
; in effect this clips to low
mm0, low
; undo the previous two offsets

Packed Max/Min of Byte, Word and Dword

The PMAXSW instruction returns the maximum between four signed words in either of two SIMD registers, or one SIMD register and a memory location.
The PMINSW instruction returns the minimum between the four signed words in either of two SIMD
registers, or one SIMD register and a memory location.
The PMAXUB instruction returns the maximum between the eight unsigned bytes in either of two SIMD
registers, or one SIMD register and a memory location.
The PMINUB instruction returns the minimum between the eight unsigned bytes in either of two SIMD
registers, or one SIMD register and a memory location.
SSE2 extended PMAXSW/PMAXUB/PMINSW/PMINUB to 128-bit operations. SSE4.1 adds 128-bit operations for signed bytes, unsigned word, signed and unsigned dword.

5.6.8

Packed Multiply Integers

The PMULHUW/PMULHW instruction multiplies the unsigned/signed words in the destination operand
with the unsigned/signed words in the source operand. The high-order 16 bits of the 32-bit intermediate
results are written to the destination operand. The PMULLW instruction multiplies the signed words in the
destination operand with the signed words in the source operand. The low-order 16 bits of the 32-bit
intermediate results are written to the destination operand.
SSE2 extended PMULHUW/PMULHW/PMULLW to 128-bit operations and adds PMULUDQ.
The PMULUDQ instruction performs an unsigned multiply on the lower pair of double-word operands
within 64-bit chunks from the two sources; the full 64-bit result from each multiplication is returned to
the destination register.
This instruction is added in both a 64-bit and 128-bit version; the latter performs 2 independent operations, on the low and high halves of a 128-bit register.
SSE4.1 adds 128-bit operations of PMULDQ and PMULLD. The PMULLD instruction multiplies the signed
dwords in the destination operand with the signed dwords in the source operand. The low-order 32 bits
of the 64-bit intermediate results are written to the destination operand. The PMULDQ instruction multiplies the two low-order, signed dwords in the destination operand with the two low-order, signed dwords
in the source operand and stores two 64-bit results in the destination operand.
5-21

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

5.6.9

Packed Sum of Absolute Differences

The PSADBW instruction computes the absolute value of the difference of unsigned bytes for either two
SIMD registers, or one SIMD register and a memory location. The differences of 8 pairs of unsigned bytes
are then summed to produce a word result in the lower 16-bit field, and the upper three words are set to
zero. With SSE2, PSADBW is extended to compute two word results.
The subtraction operation presented above is an absolute difference. That is, T = ABS(X-Y). Byte values
are stored in temporary space, all values are summed together, and the result is written to the lower
word of the destination register.
Motion estimation involves searching reference frames for best matches. Sum absolute difference (SAD)
on two blocks of pixels is a common ingredient in video processing algorithms to locate matching blocks
of pixels. PSADBW can be used as building blocks for finding best matches by way of calculating SAD
results on 4x4, 8x4, 8x8 blocks of pixels.

5.6.10

MPSADBW and PHMINPOSUW

The MPSADBW instruction in SSE4.1 performs eight SAD operations. Each SAD operation produces a
word result from 4 pairs of unsigned bytes. With 8 SAD result in an XMM register, PHMINPOSUM can help
search for the best match between eight 4x4 pixel blocks.
For motion estimation algorithms, MPSADBW is likely to improve over PSADBW in several ways:
•

Simplified data movement to construct packed data format for SAD computation on pixel blocks.

•

Higher throughput in terms of SAD results per iteration (less iteration required per frame).

•

MPSADBW results are amenable to efficient search using PHMINPOSUW.

Examples of MPSADBW vs. PSADBW for 4x4 and 8x8 block search can be found in the white paper listed
in the reference section of Chapter 1.

5.6.11

Packed Average (Byte/Word)

The PAVGB and PAVGW instructions add the unsigned data elements of the source operand to the
unsigned data elements of the destination register, along with a carry-in. The results of the addition are
then independently shifted to the right by one bit position. The high order bits of each element are filled
with the carry bits of the corresponding sum.
The destination operand is an SIMD register. The source operand can either be an SIMD register or a
memory operand.
The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on
packed unsigned words.

5.6.12

Complex Multiply by a Constant

Complex multiplication is an operation which requires four multiplications and two additions. This is
exactly how the PMADDWD instruction operates. In order to use this instruction, you need to format the
data into multiple 16-bit values. The real and imaginary components should be 16-bits each. Consider
Example 5-29, which assumes that the 64-bit MMX registers are being used:
•

Let the input data be DR and DI, where DR is real component of the data and DI is imaginary
component of the data.

•

Format the constant complex coefficients in memory as four 16-bit values [CR -CI CI CR]. Remember
to load the values into the MMX register using MOVQ.

•

The real component of the complex product is PR = DR*CR - DI*CI and the imaginary component of
the complex product is PI = DR*CI + DI*CR.

5-22

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

•

The output is a packed doubleword. If needed, a pack instruction can be used to convert the result to
16-bit (thereby matching the format of the input).

Example 5-29. Complex Multiply by a Constant
; Input:
;
;
;
; Output:
;
;
punpckldq
pmaddwd

5.6.13

MM0
MM1

complex value, Dr, Di
constant complex coefficient in the form
[Cr -Ci Ci Cr]

MM0

two 32-bit dwords containing [Pr Pi]

mm0, mm0
mm0, mm1

; makes [dr di dr di]
; done, the result is
; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]

Packed 64-bit Add/Subtract

The PADDQ/PSUBQ instructions add/subtract quad-word operands within each 64-bit chunk from the two
sources; the 64-bit result from each computation is written to the destination register. Like the integer
ADD/SUB instruction, PADDQ/PSUBQ can operate on either unsigned or signed (two’s complement notation) integer operands.
When an individual result is too large to be represented in 64-bits, the lower 64-bits of the result are
written to the destination operand and therefore the result wraps around. These instructions are added
in both a 64-bit and 128-bit version; the latter performs 2 independent operations, on the low and high
halves of a 128-bit register.

5.6.14

128-bit Shifts

The PSLLDQ/PSRLDQ instructions shift the first operand to the left/right by the number of bytes specified
by the immediate operand. The empty low/high-order bytes are cleared (set to zero).
If the value specified by the immediate operand is greater than 15, then the destination is set to all zeros.

5.6.15

PTEST and Conditional Branch

SSE4.1 offers PTEST instruction that can be used in vectorizing loops with conditional branches. PTEST is
an 128-bit version of the general-purpose instruction TEST. The ZF or CF field of the EFLAGS register are
modified as a result of PTEST.
Example 5-30(a) depicts a loop that requires a conditional branch to handle the special case of divide-byzero. In order to vectorize such loop, any iteration that may encounter divide-by-zero must be treated
outside the vectorizable iterations.

5-23

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-30. Using PTEST to Separate Vectorizable and non-Vectorizable Loop Iterations
(a) /* Loops requiring infrequent exception
handling*/
float a[CNT];
unsigned int i;
for (i=0;i b[i])
{ a[i] += b[i]; }
else
{ a[i] -= b[i]; }
}

5-24

(b) /* Vectorize Condition Flow with PTEST, BLENDVPS*/
xor
eax,eax
lp:
movaps xmm0, a[eax]
movaps xmm1, b[eax]
movaps xmm2, xmm0
// compare a and b values
cmpgtps xmm0, xmm1
// xmm3 - will hold -b
movaps xmm3, [SIGN_BIT_MASK]
xorps
xmm3, xmm1

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-31. Using PTEST and Variable BLEND to Vectorize Heterogeneous Loops (Contd.)
// select values for the add operation,
// true condition produce a+b, false will become a+(-b)
// blend mask is xmm0
blendvps xmm1,xmm3, xmm0
addps
xmm2, xmm1
movaps a[eax], xmm2
add
eax, 16
cmp
eax, CNT
jnz
lp
Example 5-31(b) depicts an assembly sequence that uses BLENDVPS and PTEST to vectorize the
handling of heterogeneous computations occurring across four consecutive loop iterations.

5.6.17

Vectorization of Control Flows in Nested Loops

The PTEST and BLENDVPx instructions can be used as building blocks to vectorize more complex controlflow statements, where each control flow statement is creating a “working” mask used as a predicate of
which the conditional code under the mask will operate.
The Mandelbrot-set map evaluation is useful to illustrate a situation with more complex control flows in
nested loops. The Mandelbrot-set is a set of height values mapped to a 2-D grid. The height value is the
number of Mandelbrot iterations (defined over the complex number space as In = In-12 + I0) needed to
get |In| > 2. It is common to limit the map generation by setting some maximum threshold value of the
height, all other points are assigned with a height equal to the threshold. Example 5-32 shows an
example of Mandelbrot map evaluation implemented in C.

Example 5-32. Baseline C Code for Mandelbrot Set Map Evaluation
#define DIMX (64)
#define DIMY (64)
#define X_STEP (0.5f/DIMX)
#define Y_STEP (0.4f/(DIMY/2))
int map[DIMX][DIMY];
void mandelbrot_C()
{ int i,j;
float x,y;
for (i=0,x=-1.8f;i= 4.0f)
float old_sx = sx;
sx = x + sx*sx - sy*sy;
sy = y + 2*old_sx*sy;
iter++;
}
map[i][j] = iter;

break;

}
}
}
Example 5-33 shows a vectorized implementation of Mandelbrot map evaluation. Vectorization is not
done on the inner most loop, because the presence of the break statement implies the iteration count will
vary from one pixel to the next. The vectorized version take into account the parallel nature of 2-D,
vectorize over four iterations of Y values of 4 consecutive pixels, and conditionally handles three
scenarios:
•

In the inner most iteration, when all 4 pixels do not reach break condition, vectorize 4 pixels.

•

When one or more pixels reached break condition, use blend intrinsics to accumulate the complex
height vector for the remaining pixels not reaching the break condition and continue the inner
iteration of the complex height vector.

•

When all four pixels reached break condition, exit the inner loop.

Example 5-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics
__declspec(align(16)) float _INIT_Y_4[4] = {0,Y_STEP,2*Y_STEP,3*Y_STEP};
F32vec4 _F_STEP_Y(4*Y_STEP);
I32vec4 _I_ONE_ = _mm_set1_epi32(1);
F32vec4 _F_FOUR_(4.0f);
F32vec4 _F_TWO_(2.0f);;
void mandelbrot_C()
{ int i,j;
F32vec4 x,y;
for (i = 0, x = F32vec4(-1.8f); i < DIMX; i ++, x += F32vec4(X_STEP))
{
for (j = DIMY/2, y = F32vec4(-0.2f) +
*(F32vec4*)_INIT_Y_4; j < DIMY; j += 4, y += _F_STEP_Y)
{
F32vec4 sx,sy;
I32vec4 iter = _mm_setzero_si128();
int scalar_iter = 0;
sx = x;
sy = y;

5-26

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics (Contd.)
while (scalar_iter < 256)
{
int mask = 0;
F32vec4 old_sx = sx;
__m128 vmask = _mm_cmpnlt_ps(sx*sx + sy*sy,_F_FOUR_);
// if all data points in our vector are hitting the “exit” condition,
// the vectorized loop can exit
if (_mm_test_all_ones(_mm_castps_si128(vmask)))
break;
(continue)
// if non of the data points are out, we don’t need the extra code which blends the results
if (_mm_test_all_zeros(_mm_castps_si128(vmask),
_mm_castps_si128(vmask)))
{
sx = x + sx*sx - sy*sy;
sy = y + _F_TWO_*old_sx*sy;
iter += _I_ONE_;
}
else
{
// Blended flavour of the code, this code blends values from previous iteration with the values
// from current iteration. Only values which did not hit the “exit” condition are being stored;
// values which are already “out” are maintaining their value
sx = _mm_blendv_ps(x + sx*sx - sy*sy,sx,vmask);
sy = _mm_blendv_ps(y + _F_TWO_*old_sx*sy,sy,vmask);
iter = I32vec4(_mm_blendv_epi8(iter + _I_ONE_,
iter,_mm_castps_si128(vmask)));
}
scalar_iter++;
}
_mm_storeu_si128((__m128i*)&map[i][j],iter);
}
}
}

5.7

MEMORY OPTIMIZATIONS

You can improve memory access using the following techniques:
•

Avoiding partial memory accesses.

•

Increasing the bandwidth of memory fills and video fills.

•

Prefetching data with Streaming SIMD Extensions. See Chapter 7, “Optimizing Cache Usage.”

MMX registers and XMM registers allow you to move large quantities of data without stalling the
processor. Instead of loading single array values that are 8, 16, or 32 bits long, consider loading the
values in a single quadword or double quadword and then incrementing the structure or array pointer
accordingly.
Any data that will be manipulated by SIMD integer instructions should be loaded using either:
•

An SIMD integer instruction that loads a 64-bit or 128-bit operand (for example: MOVQ MM0, M64).

•

The register-memory form of any SIMD integer instruction that operates on a quadword or double
quadword memory operand (for example, PMADDW MM0, M64).

All SIMD data should be stored using an SIMD integer instruction that stores a 64-bit or 128-bit operand
(for example: MOVQ M64, MM0).

5-27

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

The goal of the above recommendations is twofold. First, the loading and storing of SIMD data is more
efficient using the larger block sizes. Second, following the above recommendations helps to avoid
mixing of 8-, 16-, or 32-bit load and store operations with SIMD integer technology load and store operations to the same SIMD data.
This prevents situations in which small loads follow large stores to the same area of memory, or large
loads follow small stores to the same area of memory. The Pentium II, Pentium III, and Pentium 4 processors may stall in such situations. See Chapter 3 for details.

5.7.1

Partial Memory Accesses

Consider a case with a large load after a series of small stores to the same area of memory (beginning at
memory address MEM). The large load stalls in the case shown in Example 5-34.

Example 5-34. A Large Load after a Series of Small Stores (Penalty)
mov
mem, eax
mov
mem + 4, ebx
:
:
movq mm0, mem

; store dword to address “mem"
; store dword to address “mem + 4"

; load qword at address “mem", stalls

MOVQ must wait for the stores to write memory before it can access all data it requires. This stall can also
occur with other data types (for example, when bytes or words are stored and then words or doublewords are read from the same area of memory). When you change the code sequence as shown in
Example 5-35, the processor can access the data without delay.

Example 5-35. Accessing Data Without Delay
movd

mm1, ebx

movd
psllq

mm2, eax
mm1, 32

por
movq

mm1, mm2
mem, mm1

:
:
movq

mm0, mem

; build data into a qword first
; before storing it to memory

; store SIMD variable to “mem" as
; a qword

; load qword SIMD “mem", no stall

Consider a case with a series of small loads after a large store to the same area of memory (beginning at
memory address MEM), as shown in Example 5-36. Most of the small loads stall because they are not
aligned with the store. See Section 3.6.5, “Store Forwarding,” for details.

Example 5-36. A Series of Small Loads After a Large Store
movq
:
:
mov
mov

mem, mm0

; store qword to address “mem"

bx, mem + 2
cx, mem + 4

; load word at “mem + 2" stalls
; load word at “mem + 4" stalls

The word loads must wait for the quadword store to write to memory before they can access the data
they require. This stall can also occur with other data types (for example: when doublewords or words
are stored and then words or bytes are read from the same area of memory).
5-28

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

When you change the code sequence as shown in Example 5-37, the processor can access the data
without delay.

Example 5-37. Eliminating Delay for a Series of Small Loads after a Large Store
movq
:
:

mem, mm0

; store qword to address “mem"

movq
movd

mm1, mem
eax, mm1

; load qword at address “mem"
; transfer “mem + 2" to eax from
; MMX register, not memory

psrlq
shr
movd

mm1, 32
eax, 16
ebx, mm1

and

ebx, 0ffffh

; transfer “mem + 4" to bx from
; MMX register, not memory

These transformations, in general, increase the number of instructions required to perform the desired
operation. For Pentium II, Pentium III, and Pentium 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty due to the increased number of instructions.

5.7.1.1

Supplemental Techniques for Avoiding Cache Line Splits

Video processing applications sometimes cannot avoid loading data from memory addresses that are not
aligned to 16-byte boundaries. An example of this situation is when each line in a video frame is averaged by shifting horizontally half a pixel.
Example shows a common operation in video processing that loads data from memory address not
aligned to a 16-byte boundary. As video processing traverses each line in the video frame, it experiences
a cache line split for each 64 byte chunk loaded from memory.

Example 5-38. An Example of Video Processing with Cache Line Splits
// Average half-pels horizontally (on // the “x” axis),
// from one reference frame only.
nextLinesLoop:
movdqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned
movdqu xmm0, XMMWORD PTR [edx+1]
movdqu xmm1, XMMWORD PTR [edx+eax]
movdqu xmm1, XMMWORD PTR [edx+eax+1]
pavgbxmm0, xmm1
pavgbxmm2, xmm3
movdqaXMMWORD PTR [ecx], xmm0
movdqaXMMWORD PTR [ecx+eax], xmm2
// (repeat ...)
SSE3 provides an instruction LDDQU for loading from memory address that are not 16-byte aligned.
LDDQU is a special 128-bit unaligned load designed to avoid cache line splits. If the address of the load
is aligned on a 16-byte boundary, LDQQU loads the 16 bytes requested. If the address of the load is not
aligned on a 16-byte boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned address
immediately below the address of the load request. It then provides the requested 16 bytes. If the

5-29

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

address is aligned on a 16-byte boundary, the effective number of memory requests is implementation
dependent (one, or more).
LDDQU is designed for programming usage of loading data from memory without storing modified data
back to the same address. Thus, the usage of LDDQU should be restricted to situations where no storeto-load forwarding is expected. For situations where store-to-load forwarding is expected, use regular
store/load pairs (either aligned or unaligned based on the alignment of the data accessed).

Example 5-39. Video Processing Using LDDQU to Avoid Cache Line Splits
// Average half-pels horizontally (on // the “x” axis),
// from one reference frame only.
nextLinesLoop:
lddqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned
lddqu xmm0, XMMWORD PTR [edx+1]
lddqu xmm1, XMMWORD PTR [edx+eax]
lddqu xmm1, XMMWORD PTR [edx+eax+1]
pavgbxmm0, xmm1
pavgbxmm2, xmm3
movdqaXMMWORD PTR [ecx], xmm0 //results stored elsewhere
movdqaXMMWORD PTR [ecx+eax], xmm2
// (repeat ...)

5.7.2

Increasing Bandwidth of Memory Fills and Video Fills

It is beneficial to understand how memory is accessed and filled. A memory-to-memory fill (for example
a memory-to-video fill) is defined as a 64-byte (cache line) load from memory which is immediately
stored back to memory (such as a video frame buffer).
The following are guidelines for obtaining higher bandwidth and shorter latencies for sequential memory
fills (video fills). These recommendations are relevant for all Intel architecture processors with MMX
technology and refer to cases in which the loads and stores do not hit in the first- or second-level cache.

5.7.2.1

Increasing Memory Bandwidth Using the MOVDQ Instruction

Loading any size data operand will cause an entire cache line to be loaded into the cache hierarchy. Thus,
any size load looks more or less the same from a memory bandwidth perspective. However, using many
smaller loads consumes more microarchitectural resources than fewer larger stores. Consuming too
many resources can cause the processor to stall and reduce the bandwidth that the processor can
request of the memory subsystem.
Using MOVDQ to store the data back to UC memory (or WC memory in some cases) instead of using 32bit stores (for example, MOVD) will reduce by three-quarters the number of stores per memory fill cycle.
As a result, using the MOVDQ in memory fill cycles can achieve significantly higher effective bandwidth
than using MOVD.

5.7.2.2

Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM
Page

DRAM is divided into pages, which are not the same as operating system (OS) pages. The size of a DRAM
page is a function of the total size of the DRAM and the organization of the DRAM. Page sizes of several
Kilobytes are common. Like OS pages, DRAM pages are constructed of sequential addresses. Sequential
memory accesses to the same DRAM page have shorter latencies than sequential accesses to different
DRAM pages.
In many systems the latency for a page miss (that is, an access to a different page instead of the page
previously accessed) can be twice as large as the latency of a memory page hit (access to the same page
5-30

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

as the previous access). Therefore, if the loads and stores of the memory fill cycle are to the same DRAM
page, a significant increase in the bandwidth of the memory fill cycles can be achieved.

5.7.2.3

Increasing UC and WC Store Bandwidth by Using Aligned Stores

Using aligned stores to fill UC or WC memory will yield higher bandwidth than using unaligned stores. If
a UC store or some WC stores cross a cache line boundary, a single store will result in two transaction on
the bus, reducing the efficiency of the bus transactions. By aligning the stores to the size of the stores,
you eliminate the possibility of crossing a cache line boundary, and the stores will not be split into separate transactions.

5.7.3

Reverse Memory Copy

Copying blocks of memory from a source location to a destination location in reverse order presents a
challenge for software to make the most out of the machines capabilities while avoiding microarchitectural hazards. The basic, un-optimized C code is shown in Example 5-40.
The simple C code in Example 5-40 is sub-optimal, because it loads and stores one byte at a time (even
in situations that hardware prefetcher might have brought data in from system memory to cache).

Example 5-40. Un-optimized Reverse Memory Copy in C
unsigned char* src;
unsigned char* dst;
while (len > 0)
{
*dst-- = *src++;
--len;
}

5-31

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Using MOVDQA or MOVDQU, software can load and store up to 16 bytes at a time but must either ensure
16 byte alignment requirement (if using MOVDQA) or minimize the delays MOVDQU may encounter if
data span across cache line boundary.

(a)

N

0 1 2 3 4 5 6 ...
Source Bytes
16 Byte Aligned

Cache Line boundary

Destination Bytes
(b)

0 1 2 3 4 5 6 ...

N

Source

Destination

Figure 5-8. Data Alignment of Loads and Stores in Reverse Memory Copy
Given the general problem of arbitrary byte count to copy, arbitrary offsets of leading source byte and
destination bytes, address alignment relative to 16 byte and cache line boundaries, these alignment situations can be a bit complicated. Figure 5-8 (a) and (b) depict the alignment situations of reverse memory
copy of N bytes.
The general guidelines for dealing with unaligned loads and stores are (in order of importance):
•

Avoid stores that span cache line boundaries.

•

Minimize the number of loads that span cacheline boundaries.

•

Favor 16-byte aligned loads and stores over unaligned versions.

In Figure 5-8 (a), the guidelines above can be applied to the reverse memory copy problem as follows:
1. Peel off several leading destination bytes until it aligns on 16 Byte boundary, then the ensuing
destination bytes can be written to using MOVAPS until the remaining byte count falls below 16 bytes.
2. After the leading source bytes have been peeled (corresponding to step 1 above), the source
alignment in Figure 5-8 (a) allows loading 16 bytes at a time using MOVAPS until the remaining byte
count falls below 16 bytes.
Switching the byte ordering of each 16 bytes of data can be accomplished by a 16-byte mask with
PSHUFB. The pertinent code sequence is shown in Example 5-41.

5-32

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-41. Using PSHUFB to Reverse Byte Ordering 16 Bytes at a Time
__declspec(align(16)) static const unsigned char BswapMask[16] = {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0};
mov esi, src
mov edi, dst
mov ecx, len
movaps xmm7, BswapMask
start:
movdqa xmm0, [esi]
pshufb xmm0, xmm7
movdqa [edi-16], xmm0
sub edi, 16
add esi, 16
sub ecx, 16
cmp ecx, 32
jae start
//handle left-overs

In Figure 5-8 (b), we also start with peeling the destination bytes:
1. Peel off several leading destination bytes until it aligns on 16 Byte boundary, then the ensuing
destination bytes can be written to using MOVAPS until the remaining byte count falls below 16 bytes.
However, the remaining source bytes are not aligned on 16 byte boundaries, replacing MOVDQA with
MOVDQU for loads will inevitably run into cache line splits.
2. To achieve higher data throughput than loading unaligned bytes with MOVDQU, the 16 bytes of data
targeted to each of 16 bytes of aligned destination addresses can be assembled using two aligned
loads. This technique is illustrated in Figure 5-9.

0 1 2 3 4 5 6 ...

N

Step 1:Pell off
leading bytes

Step1: Pell off
leading bytes

Source Bytes
R
PO

PO
R

Step2 : Load 2
aligned 16-Byte
Blocks

Reverse byte ord
er In register, Sto
re
aligned 16 bytes

16 Byte Aligned

Cache Line boundary

Destination Bytes

Figure 5-9. A Technique to Avoid Cacheline Split Loads in Reverse Memory Copy Using Two Aligned
Loads

5-33

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

5.8

CONVERTING FROM 64-BIT TO 128-BIT SIMD INTEGERS

SSE2 defines a superset of 128-bit integer instructions currently available in MMX technology; the operation of the extended instructions remains. The superset simply operates on data that is twice as wide.
This simplifies porting of 64-bit integer applications. However, there are few considerations:
•

Computation instructions which use a memory operand that may not be aligned to a 16-byte
boundary must be replaced with an unaligned 128-bit load (MOVDQU) followed by the same
computation operation that uses instead register operands.
Use of 128-bit integer computation instructions with memory operands that are not 16-byte aligned
will result in a #GP. Unaligned 128-bit loads and stores are not as efficient as corresponding aligned
versions; this fact can reduce the performance gains when using the 128-bit SIMD integer
extensions.

•

General guidelines on the alignment of memory operands are:
— The greatest performance gains can be achieved when all memory streams are 16-byte aligned.
— Reasonable performance gains are possible if roughly half of all memory streams are 16-byte
aligned and the other half are not.
— Little or no performance gain may result if all memory streams are not aligned to 16-bytes. In
this case, use of the 64-bit SIMD integer instructions may be preferable.

•

Loop counters need to be updated because each 128-bit integer instruction operates on twice the
amount of data as its 64-bit integer counterpart.

•

Extension of the PSHUFW instruction (shuffle word across 64-bit integer operand) across a full 128bit operand is emulated by a combination of the following instructions: PSHUFHW, PSHUFLW, and
PSHUFD.

•

Use of the 64-bit shift by bit instructions (PSRLQ, PSLLQ) are extended to 128 bits by:
— Use of PSRLQ and PSLLQ, along with masking logic operations.
— A Code sequence rewritten to use the PSRLDQ and PSLLDQ instructions (shift double quad-word
operand by bytes).

5.8.1

SIMD Optimizations and Microarchitectures

Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel
NetBurst microarchitecture. The following sections discuss optimizing SIMD code that targets Intel Core
Solo and Intel Core Duo processors.
On Intel Core Solo and Intel Core Duo processors, LDDQU behaves identically to movdqu by loading 16
bytes of data irrespective of address alignment.

5.8.1.1

Packed SSE2 Integer versus MMX Instructions

In general, 128-bit SIMD integer instructions should be favored over 64-bit MMX instructions on Intel
Core Solo and Intel Core Duo processors. This is because:
•

Improved decoder bandwidth and more efficient micro-op flows relative to the Pentium M processor.

•

Wider width of the XMM registers can benefit code that is limited by either decoder bandwidth or
execution latency. XMM registers can provide twice the space to store data for in-flight execution.
Wider XMM registers can facilitate loop-unrolling or in reducing loop overhead by halving the number
of loop iterations.

In microarchitectures prior to Intel Core microarchitecture, execution throughput of 128-bit SIMD integration operations is basically the same as 64-bit MMX operations. Some shuffle/unpack/shift operations
do not benefit from the front end improvements. The net impact of using 128-bit SIMD integer instruction on Intel Core Solo and Intel Core Duo processors is likely to be slightly positive overall, but there
may be a few situations where their use will generate an unfavorable performance impact.

5-34

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Intel Core microarchitecture generally executes 128-bit SIMD instructions more efficiently than previous
microarchitectures in terms of latency and throughput, many of the limitations specific to Intel Core Duo,
Intel Core Solo processors do not apply. The same is true of Intel Core microarchitecture relative to Intel
NetBurst microarchitectures.
Enhanced Intel Core microarchitecture provides even more powerful 128-bit SIMD execution capabilities
and more comprehensive sets of SIMD instruction extensions than Intel Core microarchitecture. The
integer SIMD instructions offered by SSE4.1 operates on 128-bit XMM register only. All of these highly
encourages software to favor 128-bit vectorizable code to take advantage of processors based on
Enhanced Intel Core microarchitecture and Intel Core microarchitecture.

5.8.1.2

Work-around for False Dependency Issue

In processor based on Intel microarchitecture code name Nehalem, using PMOVSX and PMOVZX instructions to combine data type conversion and data movement in the same instruction will create a falsedependency due to hardware causes. A simple work-around to avoid the false dependency issue is to use
PMOVSX, PMOVZX instruction solely for data type conversion and issue separate instruction to move data
to destination or from origin.

Example 5-42. PMOVSX/PMOVZX Work-around to Avoid False Dependency
#issuing the instruction below will create a false dependency on xmm0
pmovzxbd xmm0, dword ptr [eax]
// the above instruction may be blocked if xmm0 are updated by other instructions in flight
................................................................
#Alternate solution to avoid false dependency
movd xmm0, dword ptr [eax] ; OOO hardware can hoist loads to hide latency
pmovsxbd xmm0, xmm0

5.9

TUNING PARTIALLY VECTORIZABLE CODE

Some loop structured code are more difficult to vectorize than others. Example 5-43 depicts a loop
carrying out table look-up operation and some arithmetic computation.

Example 5-43. Table Look-up Operations in C Code
// pIn1
integer input arrays.
// pOut
integer output array.
// count
size of array.
// LookUpTable integer values.
TABLE_SIZE
size of the look-up table.
for (unsigned i=0; i < count; i++)
{ pOut[i] =
( ( LookUpTable[pIn1[i] % TABLE_SIZE] + pIn1[i] + 17 ) | 17
) % 256;
}
Although some of the arithmetic computations and input/output to data array in each iteration can be
easily vectorizable, but the table look-up via an index array is not. This creates different approaches to
5-35

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

tuning. A compiler can take a scalar approach to execute each iteration sequentially. Hand-tuning of such
loops may use a couple of different techniques to handle the non-vectorizable table look-up operation.
One vectorization technique is to load the input data for four iteration at once, then use SSE2 instruction
to shift out individual index out of an XMM register to carry out table look-up sequentially. The shift technique is depicted by Example 5-44. Another technique is to use PEXTRD in SSE4.1 to extract the index
from an XMM directly and then carry out table look-up sequentially. The PEXTRD technique is depicted by
Example 5-45.

Example 5-44. Shift Techniques on Non-Vectorizable Table Look-up
int modulo[4] = {256-1, 256-1, 256-1, 256-1};
int c[4] = {17, 17, 17, 17};
mov
esi, pIn1
mov
ebx, pOut
mov
ecx, count
mov
edx, pLookUpTablePTR
movaps xmm6, modulo
movaps xmm5, c
lloop:
// vectorizable multiple consecutive data accesses
movaps
xmm4, [esi]
// read 4 indices from pIn1
movaps
xmm7, xmm4
pand
xmm7, tableSize
//Table look-up is not vectorizable, shift out one data element to look up table one by one
movd
eax, xmm7
// get first index
movd
xmm0, word ptr[edx + eax*4]
psrldq
xmm7, 4
movd
eax, xmm7
// get 2nd index
movd
xmm1, word ptr[edx + eax*4]
psrldq
xmm7, 4
movd
eax, xmm7
// get 3rdindex
movd
xmm2, word ptr[edx + eax*4]
psrldq
xmm7, 4
movd
eax, xmm7
// get fourth index
movd
xmm3, word ptr[edx + eax*4]
//end of scalar part
//packing
movlhps
xmm1,xmm3
psllq
xmm1,32
movlhps
xmm0,xmm2
orps
xmm0,xmm1
//end of packing
(continue)
//Vectorizable computation operations
paddd
xmm0, xmm4 //+pIn1
paddd
xmm0, xmm5 // +17
por
xmm0, xmm5
andps
xmm0, xmm6 //mod
movaps
[ebx], xmm0
//end of vectorizable operation

5-36

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-44. Shift Techniques on Non-Vectorizable Table Look-up (Contd.)
add
add
add
sub
test
jne lloop

ebx, 16
esi, 16
edi, 16
ecx, 1
ecx, ecx

Example 5-45. PEXTRD Techniques on Non-Vectorizable Table Look-up
int modulo[4] = {256-1, 256-1, 256-1, 256-1};
int c[4] = {17, 17, 17, 17};
mov
esi, pIn1
mov
ebx, pOut
mov
ecx, count
mov
edx, pLookUpTablePTR
movaps xmm6, modulo
movaps xmm5, c
lloop:
// vectorizable multiple consecutive data accesses
movaps
xmm4, [esi]
// read 4 indices from pIn1
movaps
xmm7, xmm4
pand
xmm7, tableSize
//Table look-up is not vectorizable, extract one data element to look up table one by one
movd
eax, xmm7
// get first index
mov
eax, [edx + eax*4]
movd
xmm0, eax
(continue)
pextrd
eax, xmm7, 1
// extract 2nd index
mov
eax, [edx + eax*4]
pinsrd
xmm0, eax, 1
pextrd
eax, xmm7, 2
// extract 2nd index
mov
eax, [edx + eax*4]
pinsrd
xmm0, eax, 2
pextrd
eax, xmm7, 3
// extract 2nd index
mov
eax, [edx + eax*4]
pinsrd
xmm0, eax, 2
//end of scalar part
//packing not needed
//Vectorizable operations
paddd
xmm0, xmm4 //+pIn1
paddd
xmm0, xmm5 // +17
por
xmm0, xmm5
andps
xmm0, xmm6 //mod
movaps
[ebx], xmm0

5-37

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-45. PEXTRD Techniques on Non-Vectorizable Table Look-up (Contd.)
add
add
add
sub
test
jne lloop

ebx, 16
esi, 16
edi, 16
ecx, 1
ecx, ecx

The effectiveness of these two hand-tuning techniques on partially vectorizable code depends on the
relative cost of transforming data layout format using various forms of pack and unpack instructions.
The shift technique requires additional instructions to pack scalar table values into an XMM to transition
into vectorized arithmetic computations. The net performance gain or loss of this technique will vary with
the characteristics of different microarchitectures. The alternate PEXTRD technique uses less instruction
to extract each index, does not require extraneous packing of scalar data into packed SIMD data format
to begin vectorized arithmetic computation.

5.10

PARALLEL MODE AES ENCRYPTION AND DECRYPTION

To deliver optimal encryption and decryption throughput using AESNI, software can optimize by reordering the computations and working on multiple blocks in parallel. This can speed up encryption (and
decryption) in parallel modes of operation such as ECB, CTR, and CBC-Decrypt (comparing to CBCEncrypt which is serial mode of operation). See details in Recommendation for Block Cipher Modes of
Operation?. The Related Documentation section provides a pointer to this document.
In Intel microarchitecture code name Sandy Bridge, the AES round instructions (AESENC / AESECNLAST
/ AESDEC / AESDECLAST) have a throughput of one cycle and latency of eight cycles. This allows independent AES instructions for multiple blocks to be dispatched every cycle, if data can be provided sufficiently fast. Compared to the prior Intel microarchitecture code name Westmere, where these
instructions have throughput of two cycles and a latency of six cycles, the AES encryption/decryption
throughput can be significantly increased, for parallel modes of operation.
To achieve optimal parallel operation with multiple blocks, write the AES software sequences in a way
that it computes one AES round on multiple blocks, using one Round Key, and then it continues to
compute the subsequent round for multiple blocks, using another Round Key.
For such software optimization, you need to define the number of blocks that are processed in parallel.
In Intel microarchitecture code name Sandy Bridge, the optimal parallelization parameter is eight blocks,
compared to four blocks on prior microarchitecture.

5.10.1

AES Counter Mode of Operation

Example 5-46 is an example of a function that implements the Counter Mode (CTR mode) of operations
while operating on eight blocks in parallel. The following pseudo-code encrypts n data blocks of 16 byte
each (PT[i]):

5-38

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-46. Pseudo-Code Flow of AES Counter Mode Operation
CTRBLK := NONCE || IV || ONE
FOR i := 1 to n-1 DO
CT[i] := PT[i] XOR AES(CTRBLK)
CTRBLK := CTRBLK + 1) % 256;
END
CT[n] := PT[n] XOR TRUNC(AES(CTRBLK)) CTRBLK := NONCE || IV || ONE
FOR i := 1 to n-1 DO
CT[i] := PT[i] XOR AES(CTRBLK)// CT [i] is the i-th ciphetext block
CTRBLK := CTRBLK + 1
END
CT[n]:= PT[n] XOR TRUNC(AES(CTRBLK))
Example 5-47 in the following pages show the assembly implementation of the above code, optimized for
Intel microarchitecture code name Sandy Bridge.

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel
/*****************************************************************************/
/* This function encrypts an input buffer using AES in CTR mode
*/
/* The parameters:
*/
/* const unsigned char *in - pointer to the palintext for encryption or */
/* ciphertextfor decryption
*/
/* unsigned char *out - pointer to the buffer where the encrypted/decrypted*/
/*
data will be stored
*/
/* const unsigned char ivec[8] - 8 bytes of the initialization vector */
/* const unsigned char nonce[4] - 4 bytes of the nonce
*/
/* const unsigned long length - the length of the input in bytes
*/
/* int number_of_rounds - number of AES round. 10 = AES128, 12 = AES192, 14 = AES256 */
/* unsigned char *key_schedule - pointer to the AES key schedule
*/
/*****************************************************************************/
//void AES_128_CTR_encrypt_parallelize_8_blocks_unrolled (
//
const unsigned char *in,
//
unsigned char *out,
//
const unsigned char ivec[8],
//
const unsigned char nonce[4],
//
const unsigned long length,
//
unsigned char *key_schedule)
.align 16,0x90
.align 16
ONE:
.quad 0x00000000,0x00000001
.align 16
FOUR:
.quad 0x00000004,0x00000004
.align 16
EIGHT:
.quad 0x00000008,0x00000008
(continue)

5-39

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
.align 16
TWO_N_ONE:
.quad 0x00000002,0x00000001
.align 16
TWO_N_TWO:
.quad 0x00000002,0x00000002
.align 16
LOAD_HIGH_BROADCAST_AND_BSWAP:
.byte 15,14,13,12,11,10,9,8
.byte 15,14,13,12,11,10,9,8
align 16
BSWAP_EPI_64:
.byte 7,6,5,4,3,2,1,0
.byte 15,14,13,12,11,10,9,8
.globl AES_CTR_encrypt
AES_CTR_encrypt:
# parameter 1: %rdi
# parameter 2: %rsi
# parameter 3: %rdx
# parameter 4: %rcx
# parameter 5: %r8
# parameter 6: %r9
# parameter 7: 8 + %rsp
movq %r8, %r10
movl 8(%rsp), %r12d
shrq $4, %r8
shlq $60, %r10
je
NO_PARTS
addq $1, %r8
NO_PARTS:
movq %r8, %r10
shlq $61, %r10
shrq $61, %r10
pinsrq $1, (%rdx), %xmm0
pinsrd $1, (%rcx), %xmm0
psrldq $4, %xmm0
movdqa %xmm0, %xmm4
pshufb (LOAD_HIGH_BROADCAST_AND_BSWAP), %xmm4
paddq (TWO_N_ONE), %xmm4
movdqa %xmm4, %xmm1
paddq (TWO_N_TWO), %xmm4
movdqa %xmm4, %xmm2
paddq (TWO_N_TWO), %xmm4
movdqa %xmm4, %xmm3
paddq (TWO_N_TWO), %xmm4
pshufb (BSWAP_EPI_64), %xmm1
pshufb (BSWAP_EPI_64), %xmm2
pshufb (BSWAP_EPI_64), %xmm3
pshufb (BSWAP_EPI_64), %xmm4
shrq

$3, %r8
je
REMAINDER
subq $128, %rsi
subq $128, %rdi
(continue)

5-40

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
LOOP:
addq
addq

$128, %rsi
$128, %rdi

movdqa
movdqa
movdqa
movdqa
movdqa
movdqa
movdqa
movdqa

%xmm0, %xmm7
%xmm0, %xmm8
%xmm0, %xmm9
%xmm0, %xmm10
%xmm0, %xmm11
%xmm0, %xmm12
%xmm0, %xmm13
%xmm0, %xmm14

shufpd
shufpd
shufpd
shufpd
shufpd
shufpd
shufpd
shufpd

$2, %xmm1, %xmm7
$0, %xmm1, %xmm8
$2, %xmm2, %xmm9
$0, %xmm2, %xmm10
$2, %xmm3, %xmm11
$0, %xmm3, %xmm12
$2, %xmm4, %xmm13
$0, %xmm4, %xmm14

pshufb
pshufb
pshufb
pshufb

(BSWAP_EPI_64), %xmm1
(BSWAP_EPI_64), %xmm2
(BSWAP_EPI_64), %xmm3
(BSWAP_EPI_64), %xmm4

movdqa (%r9), %xmm5
movdqa 16(%r9), %xmm6
paddq
paddq
paddq
paddq

(EIGHT), %xmm1
(EIGHT), %xmm2
(EIGHT), %xmm3
(EIGHT), %xmm4

pxor
pxor
pxor
pxor

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10

pxor
pxor
pxor
pxor

%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

pshufb
pshufb
pshufb
pshufb

(BSWAP_EPI_64), %xmm1
(BSWAP_EPI_64), %xmm2
(BSWAP_EPI_64), %xmm3
(BSWAP_EPI_64), %xmm4
(continue)

5-41

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14

movdqa 32(%r9), %xmm5
movdqa 48(%r9), %xmm6
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14

movdqa 64(%r9), %xmm5
movdqa 80(%r9), %xmm6
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14
(continue)

5-42

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
movdqa 96(%r9), %xmm5
movdqa 112(%r9), %xmm6
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14

movdqa 128(%r9), %xmm5
movdqa 144(%r9), %xmm6
movdqa 160(%r9), %xmm15
cmp
$12, %r12d
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14

jb

LAST
(continue)

5-43

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
movdqa 160(%r9), %xmm5
movdqa 176(%r9), %xmm6
movdqa 192(%r9), %xmm15
cmp
$14, %r12d
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14

jb

LAST

movdqa 192(%r9), %xmm5
movdqa 208(%r9), %xmm6
movdqa 224(%r9), %xmm15
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc

%xmm5, %xmm7
%xmm5, %xmm8
%xmm5, %xmm9
%xmm5, %xmm10
%xmm5, %xmm11
%xmm5, %xmm12
%xmm5, %xmm13
%xmm5, %xmm14

aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
aesenc
LAST:

%xmm6, %xmm7
%xmm6, %xmm8
%xmm6, %xmm9
%xmm6, %xmm10
%xmm6, %xmm11
%xmm6, %xmm12
%xmm6, %xmm13
%xmm6, %xmm14
(continue)

5-44

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
aesenclast %xmm15, %xmm7
aesenclast %xmm15, %xmm8
aesenclast %xmm15, %xmm9
aesenclast %xmm15, %xmm10
aesenclast %xmm15, %xmm11
aesenclast %xmm15, %xmm12
aesenclast %xmm15, %xmm13
aesenclast %xmm15, %xmm14
pxor
pxor
pxor
pxor
pxor
pxor
pxor
pxor

(%rdi), %xmm7
16(%rdi), %xmm8
32(%rdi), %xmm9
48(%rdi), %xmm10
64(%rdi), %xmm11
80(%rdi), %xmm12
96(%rdi), %xmm13
112(%rdi), %xmm14

dec %r8
movdqu %xmm7, (%rsi)
movdqu %xmm8, 16(%rsi)
movdqu %xmm9, 32(%rsi)
movdqu %xmm10, 48(%rsi)
movdqu %xmm11, 64(%rsi)
movdqu %xmm12, 80(%rsi)
movdqu %xmm13, 96(%rsi)
movdqu %xmm14, 112(%rsi)
jne LOOP
addq $128,%rsi
addq $128,%rdi
REMAINDER:
cmp $0, %r10
je END
shufpd $2, %xmm1, %xmm0
IN_LOOP:
movdqa %xmm0, %xmm11
pshufb (BSWAP_EPI_64), %xmm0
pxor (%r9), %xmm11
paddq (ONE), %xmm0
aesenc 16(%r9), %xmm11
aesenc 32(%r9), %xmm11
pshufb (BSWAP_EPI_64), %xmm0
(continue)

5-45

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-47. AES128-CTR Implementation with Eight Block in Parallel (Contd.)
aesenc 48(%r9), %xmm11
aesenc 64(%r9), %xmm11
aesenc 80(%r9), %xmm11
aesenc 96(%r9), %xmm11
aesenc 112(%r9), %xmm11
aesenc 128(%r9), %xmm11
aesenc 144(%r9), %xmm11
movdqa 160(%r9), %xmm2
cmp $12, %r12d
jb IN_LAST
aesenc 160(%r9), %xmm11
aesenc 176(%r9), %xmm11
movdqa 192(%r9), %xmm2
cmp $14, %r12d
jb IN_LAST
aesenc 192(%r9), %xmm11
aesenc 208(%r9), %xmm11
movdqa 224(%r9), %xmm2
IN_LAST:
aesenclast %xmm2, %xmm11
pxor
(%rdi) ,%xmm11
movdqu %xmm11, (%rsi)
addq
$16,%rdi
addq
$16,%rsi
dec
%r10
jne
IN_LOOP
END:
ret

5.10.2

AES Key Expansion Alternative

In Intel microarchitecture code name Sandy Bridge, the throughput of AESKEYGENASSIST is two cycles
with higher latency than the AESENC/AESDEC instructions. Software may consider implementing the
AES key expansion by using the AESENCLAST instruction with the second operand (i.e., the round key)
being the RCON value, duplicated four times in the register. The AESENCLAST instruction performs the
SubBytes step and the xor-with-RCON step, while the ROTWORD step can be done using a PSHUFB
instruction. Following are code examples of AES128 key expansion using either method.

Example 5-48. AES128 Key Expansion
// Use AESKENYGENASSIST
.align 16,0x90
.globl AES_128_Key_Expansion
AES_128_Key_Expansion:
# parameter 1: %rdi
# parameter 2: %rsi
movl $10, 240(%rsi)
movdqu (%rdi), %xmm1
movdqa %xmm1, (%rsi)
(continue)

5-46

// Use AESENCLAST
mask:
.long 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
con1:
.long 1,1,1,1
con2:
.long 0x1b,0x1b,0x1b,0x1b
.align 16,0x90
.globl AES_128_Key_Expansion
(continue)

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-48. AES128 Key Expansion (Contd.)
aeskeygenassist $1, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 16(%rsi)
aeskeygenassist $2, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 32(%rsi)
aeskeygenassist $4, %xmm1, %xmm2
ASSISTS:
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 48(%rsi)
aeskeygenassist $8, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 64(%rsi)
aeskeygenassist $16, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 80(%rsi)
aeskeygenassist $32, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 96(%rsi)
aeskeygenassist $64, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 112(%rsi)
aeskeygenassist $0x80, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 128(%rsi)
aeskeygenassist $0x1b, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 144(%rsi)
aeskeygenassist $0x36, %xmm1, %xmm2
call PREPARE_ROUNDKEY_128
movdqa %xmm1, 160(%rsi)
ret
PREPARE_ROUNDKEY_128:
pshufd $255, %xmm2, %xmm2
movdqa %xmm1, %xmm3
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pxor %xmm2, %xmm1
ret

AES_128_Key_Expansion:
# parameter 1: %rdi
# parameter 2: %rsi
movdqu (%rdi), %xmm1
movdqa %xmm1, (%rsi)
movdqa %xmm1, %xmm2
movdqa (con1), %xmm0
movdqa (mask), %xmm15
mov $8, %ax
LOOP1:
add $16, %rsi
dec %ax
pshufb %xmm15,%xmm2
aesenclast %xmm0, %xmm2
pslld $1, %xmm0
movdqa %xmm1, %xmm3
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pxor %xmm2, %xmm1
movdqa %xmm1, (%rsi)
movdqa %xmm1, %xmm2
jne LOOP1
movdqa (con2), %xmm0
pshufb %xmm15,%xmm2
aesenclast %xmm0, %xmm2
pslld $1, %xmm0
movdqa %xmm1, %xmm3
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pxor %xmm2, %xmm1
movdqa %xmm1, 16(%rsi)
movdqa %xmm1, %xmm2
pshufb %xmm15,%xmm2
aesenclast %xmm0, %xmm2
movdqa %xmm1, %xmm3
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pslldq $4, %xmm3
pxor %xmm3, %xmm1
(continue)

5-47

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-48. AES128 Key Expansion (Contd.)
pslldq $4, %xmm3
pxor %xmm3, %xmm1
pxor %xmm2, %xmm1
movdqa %xmm1, 32(%rsi)
movdqa %xmm1, %xmm2
ret

5.10.3

Enhancement in Intel Microarchitecture Code Name Haswell

5.10.3.1

AES and Multi-Buffer Cryptographic Throughput

The AESINC/AESINCLAST, AESDEC/AESDECLAST instructions in Intel microarchitecture code name
Haswell have slightly improvement latency and are one micro-op. These improvements are expected to
benefit AES algorithms operating in parallel modes (e.g. CBC decryption) and multiple-buffer implementations of AES algorithms. Several white papers provide more details and examples of using AESNI. See
the following links:
•

http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set.

•

http://software.intel.com/en-us/articles/download-the-intel-aesni-sample-library/.

•

http://software.intel.com/file/26898.

5.10.3.2

PCLMULQDQ Improvement

The latency of PCLMULQDQ in Intel microarchitecture code name Haswell is reduced from 14 to 7 cycles,
and throughput improved from once every 8 cycle to every other cycle, when compared to prior generations. This will speed up CRC calculations for generic polynomials. Details and examples can be found at:
•

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-crccomputation-paper.html.

•

http://download.intel.com/embedded/processor/whitepaper/327889.pdf.

•

http://www.intel.com/Assets/PDF/manual/323640.pdf.

AES-GCM implemented using PCLMULQDQ can be found in OpenSSL project at:
•

https://www-ssl.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/aesipsec-performance-linux-paper.html.

5.11

LIGHT-WEIGHT DECOMPRESSION AND DATABASE PROCESSING

Traditionally, database storage requires high-compression ratio means to preserve the finite disk I/O
bandwidth limitations. In row-optimized database architecture, the primary limitation on database
processing performance often correlates to the hardware constraints of the storage I/O bandwidth, the
locality issues of data records from rows in large tables that must be decompressed from its storage
format. Many recent database innovations are centered around columnar database architecture, where
storage format is optimized for query operations to fetch data in a sequential manner.
Some of the recent advances in columnar database (also known as in-memory database) are lightweight compression/decompression techniques and vectorized query operation primitives using SSE4.2
and other SIMD instructions. When a database engine combines those processing techniques with a
column-optimized storage system using solid state drives, query performance increase of several fold
has been reported1. This section discusses the usage of SIMD instructions for light-weight compression/decompression in columnar databases.
1. See published TPC-H non-clustered performance results at www.tpc.org
5-48

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

The optimal objective for light-weight compression/decompression is to deliver high throughput at
reasonably low CPU utilization, such that the finite total compute bandwidth can be divided more favorably between query processing and decompression to achieve maximal query throughput. SSE4.2 can
raise the compute bandwidth for some query operations to a significantly higher level (see Section
10.3.3), compared to query primitives implemented using general-purpose-register instructions. This
also places higher demand on the streaming data feed of decompressed columnar data.

5.11.1

Reduced Dynamic Range Datasets

One of the more successful approaches to compress/decompress columnar data in high-speed is based
on the idea that an ensemble of integral values in a sequential data stream of fixed-size storage width
can be represented more compactly if the dynamic range of that ensemble is reduced by way of partitioning, offset from a common reference value, and additional techniques2,3.
For example, a column that stores 5-digit ZIPCODE as 32-bit integers only requires a dynamic range of
17 bits. The unique primary keys in a 2 billion row table can be reduced through partitioning of sequential
blocks of 2^N entries to store the offset in the block header and reducing the storage size of each 32-bit
integer as N bits.

5.11.2

Compression and Decompression Using SIMD Instructions

To illustrate the usage of SIMD instructions for reduced-dynamic-range compression/decompression,
and compressed data elements are not byte-aligned, we consider an array of 32-bit integers whose
dynamic range only requires 5 bits per value.
To pack a stream of 32-bit integer values into consecutive 5-bit buckets, the SIMD technique illustrated
in Example 5-49 consists of the following phases:
•

Dword-to-byte packing and byte-array sequencing: The stream of dword elements is reduced to byte
streams with each iteration handling 32 elements. The two resulting 16-byte vectors are sequenced
to enable 4-way bit-stitching using PSLLD and PSRLD instructions.

Example 5-49. Compress 32-bit Integers into 5-bit Buckets
;
static __declspec(align(16)) short mask_dw_5b[16] = // 5-bit mask for 4 way bit-packing via dword
{0x1f, 0x0, 0x1f, 0x0, 0x1f, 0x0, 0x1f, 0x0}; // packed shift
static __declspec(align(16)) short sprdb_0_5_10_15[8] = // shuffle control to re-arrange
{ 0xff00, 0xffff, 0x04ff, 0xffff, 0xffff, 0xff08, 0xffff, 0x0cff}; // bytes 0, 4, 8, 12 to gap positions at 0, 5, 10, 15
void RDRpack32x4_sse(int *src, int cnt, char * out)
int i, j;
__m128i a0, a1, a2, a3, c0, c1, b0, b1, b2, b3, bb;
__m128i msk4 ;
__m128i sprd4 = _mm_loadu_si128( (__m128i*) &sprdb_0_5_10_15[0]);
switch( bucket_width) {
case 5:j= 0;
(continue)

2. “SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units”, T. Willhalm, et. al., Proceedings of the
VLDB Endowment, Vol. 2, #1, August 2009.
3. "Super-Scalar RAM-CPU Cache Compression," M. Zukowski, et, al, Data Engineering, International Conference, vol. 0, no. 0,
pp. 59, 2006.
5-49

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-49. Compress 32-bit Integers into 5-bit Buckets (Contd.)
msk4 = _mm_loadu_si128( (__m128i*) &mask_dw_5b[0]);
// process 32 elements in each iteration
for (i = 0; i < cnt; i+= 32) {
b0 = _mm_packus_epi32(_mm_loadu_si128( (__m128i*) &src[i]), _mm_loadu_si128( (__m128i*) &src[i+4]));
b1 = _mm_packus_epi32(_mm_loadu_si128( (__m128i*) &src[i+8]), _mm_loadu_si128( (__m128i*) &src[i+12]));
b2 = _mm_packus_epi32(_mm_loadu_si128( (__m128i*) &src[i+16]), _mm_loadu_si128( (__m128i*)
&src[i+20]));
b3 = _mm_packus_epi32(_mm_loadu_si128( (__m128i*) &src[i+24]), _mm_loadu_si128( (__m128i*)
&src[i+28]));
c0 = _mm_packus_epi16( _mm_unpacklo_epi64(b0, b1), _mm_unpacklo_epi64(b2, b3));
// c0 contains bytes: 0-3, 8-11, 16-19, 24-27 elements
c1 = _mm_packus_epi16( _mm_unpackhi_epi64(b0, b1), _mm_unpackhi_epi64(b2, b3));
// c1 contains bytes: 4-7, 12-15, 20-23, 28-31
b0 = _mm_and_si128( c0, msk4);
// keep lowest 5 bits in each way/dword
b1 = _mm_and_si128( _mm_srli_epi32(c0, 3), _mm_slli_epi32(msk4, 5));
b0 = _mm_or_si128( b0, b1);
// add next 5 bits to each way/dword
b1 = _mm_and_si128( _mm_srli_epi32(c0, 6), _mm_slli_epi32(msk4, 10));
b0 = _mm_or_si128( b0, b1);
b1 = _mm_and_si128( _mm_srli_epi32(c0, 9), _mm_slli_epi32(msk4, 15));
b0 = _mm_or_si128( b0, b1);
b1 = _mm_and_si128( _mm_slli_epi32(c1, 20), _mm_slli_epi32(msk4, 20));
b0 = _mm_or_si128( b0, b1);
b1 = _mm_and_si128( _mm_slli_epi32(c1, 17), _mm_slli_epi32(msk4, 25));
b0 = _mm_or_si128( b0, b1);
b1 = _mm_and_si128( _mm_slli_epi32(c1, 14), _mm_slli_epi32(msk4, 30));
b0 = _mm_or_si128( b0, b1);
// add next 2 bits from each dword channel, xmm full
*(int*)&out[j] = _mm_cvtsi128_si32( b0);// the first dword is compressed and ready
// re-distribute the remaining 3 dword and add gap bytes to store remained bits
b0 = _mm_shuffle_epi8(b0, gap4x3);
b1 = _mm_and_si128( _mm_srli_epi32(c1, 18), _mm_srli_epi32(msk4, 2)); // do 4-way packing of the next 3 bits
b2 = _mm_and_si128( _mm_srli_epi32(c1, 21), _mm_slli_epi32(msk4, 3));
b1 = _mm_or_si128( b1, b2); //5th byte compressed at bytes 0, 4, 8, 12
// shuffle the fifth byte result to byte offsets of 0, 5, 10, 15
b0 = _mm_or_si128( b0, _mm_shuffle_epi8(b1, sprd4));
_mm_storeu_si128( (__m128i *) &out[j+4] , b0);
j += bucket_width*4;
}
// handle remainder if cnt is not multiples of 32
break;
}
}
•

Four-way bit stitching: In each way (dword) of the destination, 5 bits are packed consecutively from
the corresponding byte element that contains 5 non-zero bit patterns. Since each dword destination
will be completely filled up by the contents of 7 consecutive elements, the remaining three bits of the
7th element and the 8th element are done separately in a similar 4-way stitching operation but
require the assistance of shuffle operations.

Example 5-50 shows the reverse operation of decompressing consecutively packed 5-bit buckets into
32-bit data elements.

5-50

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-50. Decompression of a Stream of 5-bit Integers into 32-bit Elements
;
static __declspec(align(16)) short mask_dw_5b[16] = // 5-bit mask for 4 way bit-packing via dword
{0x1f, 0x0, 0x1f, 0x0, 0x1f, 0x0, 0x1f, 0x0}; // packed shift
static __declspec(align(16)) short pack_dw_4x3[8] = // pack 3 dwords 1-4, 6-9, 11-14
{ 0xffff, 0xffff, 0x0201, 0x0403, 0x0706, 0x0908, 0xc0b, 0x0e0d}; // to vacate bytes 0-3
static __declspec(align(16)) short packb_0_5_10_15[8] = // shuffle control to re-arrange bytes
{ 0xffff, 0x0ff, 0xffff, 0x5ff, 0xffff, 0xaff, 0xffff, 0x0fff}; // 0, 5, 10, 15 to gap positions at 3, 7, 11, 15
void RDRunpack32x4_sse(char *src, int cnt, int * out)
{int i, j;
__m128i a0, a1, a2, a3, c0, c1, b0, b1, b2, b3, bb, d0, d1, d2, d3;
__m128i msk4 ;
__m128i pck4 = _mm_loadu_si128( (__m128i*) &packb_0_5_10_15[0]);
__m128i pckdw3 = _mm_loadu_si128( (__m128i*) &pack_dw_4x3[0]);
switch( bucket_width) {
case 5:j= 0;
msk4 = _mm_loadu_si128( (__m128i*) &mask_dw_5b[0]);
for (i = 0; i < cnt; i+= 32) {
a1 = _mm_loadu_si128( (__m128i*) &src[j +4]);
// pick up bytes 4, 9, 14, 19 and shuffle into offset 3, 7, 11, 15
c0 = _mm_shuffle_epi8(a1, pck4);
b1 = _mm_and_si128( _mm_srli_epi32(c0, 3), _mm_slli_epi32(msk4, 24));
// put 3 unaligned dword 1-4, 6-9, 11-14 to vacate bytes 0-3
a1 = _mm_shuffle_epi8(a1, pckdw3);
b0 = _mm_and_si128( _mm_srli_epi32(c0, 6), _mm_slli_epi32(msk4, 16));
a0 = _mm_cvtsi32_si128( *(int *)&src[j ]);
b1 = _mm_or_si128( b0, b1); // finished decompress source bytes 4, 9, 14, 19
a0 = _mm_or_si128( a0, a1); // bytes 0-16 contain compressed bits
b0 = _mm_and_si128( _mm_srli_epi32(a0, 14), _mm_slli_epi32(msk4, 16));
b1 = _mm_or_si128( b0, b1);
b0 = _mm_and_si128( _mm_srli_epi32(a0, 17), _mm_slli_epi32(msk4, 8));
b1 = _mm_or_si128( b0, b1);
b0 = _mm_and_si128( _mm_srli_epi32(a0, 20), msk4);
b1 = _mm_or_si128( b0, b1);// b1 now full with decompressed 4-7,12-15,20-23,28-31
_mm_storeu_si128( (__m128i *) &out[i+4] , _mm_cvtepu8_epi32(b1));
b0 = _mm_and_si128( _mm_slli_epi32(a0, 9), _mm_slli_epi32(msk4, 24));
c0 = _mm_and_si128( _mm_slli_epi32(a0, 6), _mm_slli_epi32(msk4, 16));
b0 = _mm_or_si128( b0, c0);
_mm_storeu_si128( (__m128i *) &out[i+12] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 4)));
c0 = _mm_and_si128( _mm_slli_epi32(a0, 3), _mm_slli_epi32(msk4, 8));
_mm_storeu_si128( (__m128i *) &out[i+20] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 8)));
b0 = _mm_or_si128( b0, c0);
_mm_storeu_si128( (__m128i *) &out[i+28] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 12)));
c0 = _mm_and_si128( a0, msk4);
b0 = _mm_or_si128( b0, c0);// b0 now full with decompressed 0-3,8-11,16-19,24-27

5-51

OPTIMIZING FOR SIMD INTEGER APPLICATIONS

Example 5-50. Decompression of a Stream of 5-bit Integers into 32-bit Elements (Contd.)
_mm_storeu_si128( (__m128i *) &out[i] , _mm_cvtepu8_epi32(b0));
_mm_storeu_si128( (__m128i *) &out[i+8] , _mm_cvtepu8_epi32(_mm_srli_si128(b0, 4)));
_mm_storeu_si128( (__m128i *) &out[i+16] , _mm_cvtepu8_epi32(_mm_srli_si128(b0, 8)));
_mm_storeu_si128( (__m128i *) &out[i+24] , _mm_cvtepu8_epi32(_mm_srli_si128(b0, 12)));
j += g_bwidth*4;
}
break;
}
}
Compression/decompression of integers for dynamic range that are non-power-of-2 can generally use
similar mask/packed shift/stitch technique with additional adaptation of the horizontal rearrangement of
partially stitched vectors. The increase in throughput relative to using general-purpose scalar instructions will depend on implementation and bucket width.
When compiled with the “/O2” option on an Intel Compiler, the compression throughput can reach 6
Bytes/cycle on Intel microarchitecture code name Sandy Bridge, and the throughput varies little for
working set sizes due to the streaming data access pattern and the effectiveness of hardware
prefetchers. The decompression throughput of the above example is more than 5 Bytes/cycle at full utilization, allowing a database query engine to partition CPU utilization effectively to allocate a small fraction
for on-the-fly decompression to feed vectorized query computation.
The decompression throughput increase using a SIMD light-weight compression technique offers database architects new degrees of freedom to relocate critical performance bottlenecks from a lowerthroughput technology (disk I/O, DRAM) to a faster pipeline.

5-52

CHAPTER 6
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
This chapter discusses rules for optimizing for the single-instruction, multiple-data (SIMD) floating-point
instructions available in SSE, SSE2 SSE3, and SSE4.1. The chapter also provides examples that illustrate
the optimization techniques for single-precision and double-precision SIMD floating-point applications.

6.1

GENERAL RULES FOR SIMD FLOATING-POINT CODE

The rules and suggestions in this section help optimize floating-point code containing SIMD floatingpoint instructions. Generally, it is important to understand and balance port utilization to create efficient
SIMD floating-point code. Basic rules and suggestions include the following:

•
•

Follow all guidelines in Chapter 3 and Chapter 4.

•

Utilize the flush-to-zero and denormals-are-zero modes for higher performance to avoid the penalty
of dealing with denormals and underflows.

•

Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield
reduced accuracy but execute much faster. Note the following:

Mask exceptions to achieve higher performance. When exceptions are unmasked, software
performance is slower.

— If reduced accuracy is acceptable, use them with no iteration.
— If near full accuracy is needed, use a Newton-Raphson iteration.
— If full accuracy is needed, then use divide and square root which provide more accuracy, but slow
down performance.

6.2

PLANNING CONSIDERATIONS

Whether adapting an existing application or creating a new one, using SIMD floating-point instructions to
achieve optimum performance gain requires programmers to consider several issues. In general, when
choosing candidates for optimization, look for code segments that are computationally intensive and
floating-point intensive. Also consider efficient use of the cache architecture.
The sections that follow answer the questions that should be raised before implementation:

•
•
•
•
•

Can data layout be arranged to increase parallelism or cache utilization?

•
•
•

Does the result of computation affected by enabling flush-to-zero or denormals-to-zero modes?

Which part of the code benefits from SIMD floating-point instructions?
Is the current algorithm the most appropriate for SIMD floating-point instructions?
Is the code floating-point intensive?
Do either single-precision floating-point or double-precision floating-point computations provide
enough range and precision?
Is the data arranged for efficient utilization of the SIMD floating-point registers?
Is this application targeted for processors without SIMD floating-point instructions?

See also: Section 4.2, “Considerations for Code Conversion to SIMD Programming.”

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

6.3

USING SIMD FLOATING-POINT WITH X87 FLOATING-POINT

Because the XMM registers used for SIMD floating-point computations are separate registers and are not
mapped to the existing x87 floating-point stack, SIMD floating-point code can be mixed with x87
floating-point or 64-bit SIMD integer code.
With Intel Core microarchitecture, 128-bit SIMD integer instructions provides substantially higher efficiency than 64-bit SIMD integer instructions. Software should favor using SIMD floating-point and
integer SIMD instructions with XMM registers where possible.

6.4

SCALAR FLOATING-POINT CODE

There are SIMD floating-point instructions that operate only on the lowest order element in the SIMD
register. These instructions are known as scalar instructions. They allow the XMM registers to be used for
general-purpose floating-point computations.
In terms of performance, scalar floating-point code can be equivalent to or exceed x87 floating-point
code and has the following advantages:

•

SIMD floating-point code uses a flat register model, whereas x87 floating-point code uses a stack
model. Using scalar floating-point code eliminates the need to use FXCH instructions. These have
performance limits on the Intel Pentium 4 processor.

•
•
•

Mixing with MMX technology code without penalty.
Flush-to-zero mode.
Shorter latencies than x87 floating-point.

When using scalar floating-point instructions, it is not necessary to ensure that the data appears in
vector form. However, the optimizations regarding alignment, scheduling, instruction selection, and
other optimizations covered in Chapter 3 and Chapter 4 should be observed.

6.5

DATA ALIGNMENT

SIMD floating-point data is 16-byte aligned. Referencing unaligned 128-bit SIMD floating-point data will
result in an exception unless MOVUPS or MOVUPD (move unaligned packed single or unaligned packed
double) is used. The unaligned instructions used on aligned or unaligned data will also suffer a performance penalty relative to aligned accesses.
See also: Section 4.4, “Stack and Data Alignment.”

6.5.1

Data Arrangement

Because SSE and SSE2 incorporate SIMD architecture, arranging data to fully use the SIMD registers
produces optimum performance. This implies contiguous data for processing, which leads to fewer cache
misses. Correct data arrangement can potentially quadruple data throughput when using SSE or double
throughput when using SSE2. Performance gains can occur because four data elements can be loaded
with 128-bit load instructions into XMM registers using SSE (MOVAPS). Similarly, two data elements can
loaded with 128-bit load instructions into XMM registers using SSE2 (MOVAPD).
Refer to the Section 4.4, “Stack and Data Alignment,” for data arrangement recommendations. Duplicating and padding techniques overcome misalignment problems that occur in some data structures and
arrangements. This increases the data space but avoids penalties for misaligned data access.
For some applications (for example: 3D geometry), traditional data arrangement requires some changes
to fully utilize the SIMD registers and parallel techniques. Traditionally, the data layout has been an array
of structures (AoS). To fully utilize the SIMD registers in such applications, a new data layout has been
proposed — a structure of arrays (SoA) resulting in more optimized performance.

6-2

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

6.5.1.1

Vertical versus Horizontal Computation

The majority of the floating-point arithmetic instructions in SSE/SSE2 provide greater performance gain
on vertical data processing for parallel data elements. This means each element of the destination is the
result of an arithmetic operation performed from the source elements in the same vertical position
(Figure 6-1).
To supplement these homogeneous arithmetic operations on parallel data elements, SSE and SSE2
provides data movement instructions (e.g., SHUFPS, UNPCKLPS, UNPCKHPS, MOVLHPS, MOVHLPS, etc.)
that facilitate moving data elements
horizontally.

X3

X2

Y3

X1

Y2

X0

Y1

OP

OP

OP

X3 OP Y3

X2 OP Y2

X 1OP Y1

Y0

OP

X0 OP Y0

Figure 6-1. Homogeneous Operation on Parallel Data Elements
The organization of structured data have a significant impact on SIMD programming efficiency and
performance. This can be illustrated using two common type of data structure organizations:

•

Array of Structure: This refers to the arrangement of an array of data structures. Within the data
structure, each member is a scalar. This is shown in Figure 6-2. Typically, a repetitive sequence of
computation is applied to each element of an array, i.e., a data structure. Computational sequence
for the scalar members of the structure is likely to be non-homogeneous within each iteration. AoS is
generally associated with a horizontal computation model.

X

Y

Z

W

Figure 6-2. Horizontal Computation Model

•

Structure of Array: Here, each member of the data structure is an array. Each element of the array is
a scalar. This is shown Table 6-1. Repetitive computational sequence is applied to scalar elements
and homogeneous operation can be easily achieved across consecutive iterations within the same
structural member. Consequently, SoA is generally amenable to the vertical computation model.

6-3

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Table 6-1. SoA Form of Representing Vertices Data
Vx array

X1

X2

X3

X4

.....

Xn

Vy array

Y1

Y2

Y3

Y4

.....

Yn

Vz array

Z1

Z2

Z3

Y4

.....

Zn

Vw array

W1

W2

W3

W4

.....

Wn

Using SIMD instructions with vertical computation on SOA arrangement can achieve higher efficiency and
performance than AOS and horizontal computation. This can be seen with dot-product operation on
vectors. The dot product operation on SoA arrangement is shown in Figure 6-3.

X1

X2

X3

X4

X

Fx

Fx

Fx

Fx

+

Y1

Y2

Y3

Y4

X

Fy

Fy

Fy

Fy

+

Z1

Z2

Z3

Z4

X

Fz

Fz

Fz

Fz

+

W1

W2

W3

W4

X

Fw

Fw

Fw

Fw

=

R1

R2

R3

R4
OM15168

Figure 6-3. Dot Product Operation

Example 6-1 shows how one result would be computed for seven instructions if the data were organized
as AoS and using SSE alone: four results would require 28 instructions.
Example 6-1. Pseudocode for Horizontal (xyz, AoS) Computation
mulps
movaps
shufps
addps
movaps
shufps
addps

6-4

; x*x', y*y', z*z'
; reg->reg move, since next steps overwrite
; get b,a,d,c from a,b,c,d
; get a+b,a+b,c+d,c+d
; reg->reg move
; get c+d,c+d,a+b,a+b from prior addps
; get a+b+c+d,a+b+c+d,a+b+c+d,a+b+c+d

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Now consider the case when the data is organized as SoA. Example 6-2 demonstrates how four results
are computed for five instructions.
Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation
mulps
mulps
mulps
addps
addps

; x*x' for all 4 x-components of 4 vertices
; y*y' for all 4 y-components of 4 vertices
; z*z' for all 4 z-components of 4 vertices
; x*x' + y*y'
; x*x'+y*y'+z*z'

For the most efficient use of the four component-wide registers, reorganizing the data into the SoA
format yields increased throughput and hence much better performance for the instructions used.
As seen from this simple example, vertical computation can yield 100% use of the available SIMD registers to produce four results. (The results may vary for other situations.) If the data structures are represented in a format that is not “friendly” to vertical computation, it can be rearranged “on the fly” to
facilitate better utilization of the SIMD registers. This operation is referred to as “swizzling” operation and
the reverse operation is referred to as “deswizzling.”

6.5.1.2

Data Swizzling

Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D
geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point
and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS
instructions.

Example 6-3. Swizzling Data Using SHUFPS, MOVLHPS, MOVHLPS
typedef struct _VERTEX_AOS {
float x, y, z, color;
} Vertex_aos;
typedef struct _VERTEX_SOA {
float x[4], float y[4], float z[4];
float color[4];

// AoS structure declaration

} Vertex_soa;
// SoA structure declaration
void swizzle_asm (Vertex_aos *in, Vertex_soa *out)
{
// in mem: x1y1z1w1-x2y2z2w2-x3y3z3w3-x4y4z4w4// SWIZZLE XYZW --> XXXX
asm {
mov ebx, in
// get structure addresses
mov edx, out
movaps
movaps
movaps
movaps
movaps
movhlps
movaps
movlhps
movhlps
movlhps

xmm1, [ebx ]
// x4 x3 x2 x1
xmm2, [ebx + 16] // y4 y3 y2 y1
xmm3, [ebx + 32] // z4 z3 z2 z1
xmm4, [ebx + 48] // w4 w3 w2 w1
xmm7, xmm4 // xmm7= w4 z4 y4 x4
xmm7, xmm3 // xmm7= w4 z4 w3 z3
xmm6, xmm2 // xmm6= w2 z2 y2 x2
xmm3, xmm4 // xmm3= y4 x4 y3 x3
xmm2, xmm1 // xmm2= w2 z2 w1 z1
xmm1, xmm6 // xmm1= y2 x2 y1 x1

6-5

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-3. Swizzling Data (Contd.)Using SHUFPS, MOVLHPS, MOVHLPS (Contd.)
movaps
movaps
shufps
shufps
shufps
shufps

xmm6, xmm2// xmm6= w2 z2 w1 z1
xmm5, xmm1// xmm5= y2 x2 y1 x1
xmm2, xmm7, 0xDD // xmm2= w4 w3 w2 w1 => v4
xmm1, xmm3, 0x88 // xmm1= x4 x3 x2 x1 => v1
xmm5, xmm3, 0xDD // xmm5= y4 y3 y2 y1 => v2
xmm6, xmm7, 0x88 // xmm6= z4 z3 z2 z1 => v3

movaps
movaps
movaps
movaps

[edx], xmm1
[edx+16], xmm5
[edx+32], xmm6
[edx+48], xmm2

// store X
// store Y
// store Z
// store W

}
}

Example 6-4 shows a similar data-swizzling algorithm using SIMD instructions in the integer domain.

Example 6-4. Swizzling Data Using UNPCKxxx Instructions
void swizzle_asm (Vertex_aos *in, Vertex_soa *out)
{
// in mem: x1y1z1w1-x2y2z2w2-x3y3z3w3-x4y4z4w4// SWIZZLE XYZW --> XXXX
asm {
mov ebx, in
// get structure addresses
mov edx, out
movdqa
movdqa
movdqa
movdqa
movdqa
punpckldq
punpckhdq
movdqa
punpckldq
punpckldq
movdqa
punpcklqdq
punpckhqdq
movdqa
punpcklqdq
punpckhqdq

xmm1, [ebx + 0*16]
//w0 z0 y0 x0
xmm2, [ebx + 1*16]
//w1 z1 y1 x1
xmm3, [ebx + 2*16]
//w2 z2 y2 x2
xmm4, [ebx + 3*16]
//w3 z3 y3 x3
xmm5, xmm1
xmm1, xmm2
// y1 y0 x1 x0
xmm5, xmm2
// w1 w0 z1 z0
xmm2, xmm3
xmm3, xmm4
// y3 y2 x3 x2
xmm2, xmm4
// w3 w2 z3 z2
xmm4, xmm1
xmm1, xmm3
// x3 x2 x1 x0
xmm4, xmm3
// y3 y2 y1 y0
xmm3, xmm5
xmm5, xmm2
// z3 z2 z1 z0
xmm3, xmm2
// w3 w2 w1 w0

movdqa
movdqa
movdqa
movdqa

[edx+0*16], xmm1
[edx+1*16], xmm4
[edx+2*16], xmm5
[edx+3*16], xmm3

//x3 x2 x1 x0
//y3 y2 y1 y0
//z3 z2 z1 z0
//w3 w2 w1 w0

}
The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is
preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer
microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency
and reduce the throughput of the execution engine.

6-6

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of
each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be
slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and
PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit.
Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs.
MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on
Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit
shuffle unit.

6.5.1.3

Data Deswizzling

In the deswizzle operation, we want to arrange the SoA format back into AoS format so the XXXX, YYYY,
ZZZZ are rearranged and stored in memory as XYZ. Example 6-5 illustrates one deswizzle function for
floating-point data.

Example 6-5. Deswizzling Single-Precision SIMD Data
void deswizzle_asm(Vertex_soa *in, Vertex_aos *out)
{
__asm {
mov
ecx, in
// load structure addresses
mov
edx, out
movaps
xmm0, [ecx ]
//x3 x2 x1 x0
movaps
xmm1, [ecx + 16]
//y3 y2 y1 y0
movaps
xmm2, [ecx + 32]
//z3 z2 z1 z0
movaps
xmm3, [ecx + 48]
//w3 w2 w1 w0
movaps
movaps
unpcklps
unpcklps
movdqa
movlhps
movhlps

xmm5, xmm0
xmm7, xmm2
xmm0, xmm1
xmm2, xmm3
xmm4, xmm0
xmm0, xmm2
xmm4, xmm2

unpckhps
unpckhps
movdqa
movlhps
movhlps
movaps
movaps
movaps
movaps

xmm5, xmm1
// y3 x3 y2 x2
xmm7, xmm3
// w3 z3 w2 z2
xmm6, xmm5
xmm5, xmm7
// w2 z2 y2 x2
xmm6, xmm7
// w3 z3 y3 x3
[edx+0*16], xmm0 //w0 z0 y0 x0
[edx+1*16], xmm4 //w1 z1 y1 x1
[edx+2*16], xmm5 //w2 z2 y2 x2
[edx+3*16], xmm6 //w3 z3 y3 x3

// y1 x1 y0 x0
// w1 z1 w0 z0
// w0 z0 y0 x0
// w1 z1 y1 x1

}
}
Example 6-6 shows a similar deswizzle function using SIMD integer instructions. Both of these techniques demonstrate loading 16 bytes and performing horizontal data movement in registers. This
approach is likely to be more efficient than alternative techniques of storing 8-byte halves of XMM registers using MOVLPS and MOVHPS.

6-7

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-6. Deswizzling Data Using SIMD Integer Instructions
void deswizzle_rgb(Vertex_soa *in, Vertex_aos *out)
{
//---deswizzle rgb--// assume: xmm1=rrrr, xmm2=gggg, xmm3=bbbb, xmm4=aaaa
__asm {
mov
ecx, in
// load structure addresses
mov
edx, out
movdqa
xmm0, [ecx]
// load r4 r3 r2 r1 => xmm1
movdqa
xmm1, [ecx+16]
// load g4 g3 g2 g1 => xmm2
movdqa
xmm2, [ecx+32]
movdqa
xmm3, [ecx+48]
// Start deswizzling here
movdqa
xmm5, xmm0
movdqa
xmm7, xmm2
punpckldq
xmm0, xmm1
punpckldq
xmm2, xmm3
movdqa
xmm4, xmm0
punpcklqdq xmm0, xmm2
punpckhqdq xmm4, xmm2
punpckhdq xmm5, xmm1
punpckhdq xmm7, xmm3
movdqa
xmm6, xmm5
punpcklqdq xmm5, xmm7
punpckhqdq xmm6, xmm7
movdqa

// load b4 b3 b2 b1 => xmm3
// load a4 a3 a2 a1 => xmm4

// g2 r2 g1 r1
// a2 b2 a1 b1
// a1 b1 g1 r1 => v1
// a2 b2 g2 r2 => v2
// g4 r4 g3 r3
// a4 b4 a3 b3
// a3 b3 g3 r3 => v3
// a4 b4 g4 r4 => v4

[edx], xmm0

movdqa
[edx+16], xmm4
movdqa
[edx+32], xmm5
movdqa
[edx+48], xmm6
// DESWIZZLING ENDS HERE
}
}

6.5.1.4

// v1
// v2
// v3
// v4

Horizontal ADD Using SSE

Although vertical computations generally make use of SIMD performance better than horizontal computations, in some cases, code must use a horizontal operation.
MOVLHPS/MOVHLPS and shuffle can be used to sum data horizontally. For example, starting with four
128-bit registers, to sum up each register horizontally while having the final results in one register, use
the MOVLHPS/MOVHLPS to align the upper and lower parts of each register. This allows you to use a
vertical add. With the resulting partial horizontal summation, full summation follows easily.
Figure 6-4 presents a horizontal add using MOVHLPS/MOVLHPS. Example 6-7 and Example 6-8 provide
the code for this operation.

6-8

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

xm m 0
A1

A2

A3

xm m 1
A4

B1

MO VLHPS
A1

A2

B1

B2

B3

xm m 2
B4

C1

MOVHLPS

B2

A3

A4

B3

C2

C3

A2+A4

B4

C1

C2

D1

B1+B3

D1

D2

D2

D3

D4

M OVHLPS
C3

C4

D3

D4

ADDPS

B1+B3

B2+B4

C1+C3

C2+C4

SHUFPS
A1+A3

C4

MOVLHPS

ADDPS
A1+A3

xm m 3

C1+C3

D1+D3

D2+D4

SHUFPS
D1+D3

A2+A4

B2+B4

C2+C4

D2+D4

ADDPS
A1+A2+A3+A4

B1+B2+B3+B4

C1+C2+C3+C4

D1+D2+D3+D4
OM15169

Figure 6-4. Horizontal Add Using MOVHLPS/MOVLHPS

Example 6-7. Horizontal Add Using MOVHLPS/MOVLHPS
void horiz_add(Vertex_soa *in, float *out) {
__asm {
mov ecx, in
// load structure addresses
mov edx, out
movaps xmm0, [ecx]
// load A1 A2 A3 A4 => xmm0
movaps xmm1, [ecx+16]
// load B1 B2 B3 B4 => xmm1
movaps xmm2, [ecx+32]
// load C1 C2 C3 C4 => xmm2
movaps xmm3, [ecx+48]
// load D1 D2 D3 D4 => xmm3
// START HORIZONTAL ADD
movaps xmm5, xmm0
movlhps xmm5, xmm1
movhlps xmm1, xmm0
addps xmm5, xmm1
movaps xmm4, xmm2
movlhps xmm2, xmm3
movhlps xmm3, xmm4
addps xmm3, xmm2
movaps xmm6, xmm3
shufps xmm3, xmm5, 0xDD

// xmm5= A1,A2,A3,A4
// xmm5= A1,A2,B1,B2
// xmm1= A3,A4,B3,B4
// xmm5= A1+A3,A2+A4,B1+B3,B2+B4
// xmm2= C1,C2,D1,D2
// xmm3= C3,C4,D3,D4
// xmm3= C1+C3,C2+C4,D1+D3,D2+D4
// xmm6= C1+C3,C2+C4,D1+D3,D2+D4
//xmm6=A1+A3,B1+B3,C1+C3,D1+D3

shufps xmm5, xmm6, 0x88
addps xmm6, xmm5

// xmm5= A2+A4,B2+B4,C2+C4,D2+D4
// xmm6= D,C,B,A

6-9

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-7. Horizontal Add Using MOVHLPS/MOVLHPS (Contd.)
// END HORIZONTAL ADD
movaps [edx], xmm6
}
}

Example 6-8. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS
void horiz_add_intrin(Vertex_soa *in, float *out)
{
__m128 v, v2, v3, v4;
__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;
// Temporary variables
tmm0 = _mm_load_ps(in->x);
// tmm0 = A1 A2 A3 A4
tmm1 = _mm_load_ps(in->y);
tmm2 = _mm_load_ps(in->z);
tmm3 = _mm_load_ps(in->w);
tmm5 = tmm0;
tmm5 = _mm_movelh_ps(tmm5, tmm1);
tmm1 = _mm_movehl_ps(tmm1, tmm0);
tmm5 = _mm_add_ps(tmm5, tmm1);
tmm4 = tmm2;

// tmm1 = B1 B2 B3 B4
// tmm2 = C1 C2 C3 C4
// tmm3 = D1 D2 D3 D4
// tmm0 = A1 A2 A3 A4
// tmm5 = A1 A2 B1 B2
// tmm1 = A3 A4 B3 B4
// tmm5 = A1+A3 A2+A4 B1+B3 B2+B4

tmm2 = _mm_movelh_ps(tmm2, tmm3);
tmm3 = _mm_movehl_ps(tmm3, tmm4);
tmm3 = _mm_add_ps(tmm3, tmm2);
tmm6 = tmm3;
tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);

// tmm2 = C1 C2 D1 D2
// tmm3 = C3 C4 D3 D4
// tmm3 = C1+C3 C2+C4 D1+D3 D2+D4
// tmm6 = C1+C3 C2+C4 D1+D3 D2+D4
// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3

tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);
tmm6 = _mm_add_ps(tmm6, tmm5);

// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4
// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4
// C1+C2+C3+C4 D1+D2+D3+D4

_mm_store_ps(out, tmm6);
}

6.5.2

Use of CVTTPS2PI/CVTTSS2SI Instructions

The CVTTPS2PI and CVTTSS2SI instructions encode the truncate/chop rounding mode implicitly in the
instruction. They take precedence over the rounding mode specified in the MXCSR register. This behavior
can eliminate the need to change the rounding mode from round-nearest, to truncate/chop, and then
back to round-nearest to resume computation.
Avoid frequent changes to the MXCSR register since there is a penalty associated with writing this
register. Typically, when using CVTTPS2P/CVTTSS2SI, rounding control in MXCSR can always be set to
round-nearest.

6.5.3

Flush-to-Zero and Denormals-are-Zero Modes

The flush-to-zero (FTZ) and denormals-are-zero (DAZ) modes are not compatible with the IEEE Standard 754. They are provided to improve performance for applications where underflow is common and
where the generation of a denormalized result is not necessary.
6-10

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

See also: Section 3.8.3, “Floating-point Modes and Exceptions.”

6.6

SIMD OPTIMIZATIONS AND MICROARCHITECTURES

Pentium M, Intel Core Solo and Intel Core Duo processors have a different microarchitecture than Intel
NetBurst microarchitecture. Intel Core microarchitecture offers significantly more efficient SIMD floatingpoint capability than previous microarchitectures. In addition, instruction latency and throughput of
SSE3 instructions are significantly improved in Intel Core microarchitecture over previous microarchitectures.

6.6.1

SIMD Floating-point Programming Using SSE3

SSE3 enhances SSE and SSE2 with nine instructions targeted for SIMD floating-point programming. In
contrast to many SSE/SSE2 instructions offering homogeneous arithmetic operations on parallel data
elements and favoring the vertical computation model, SSE3 offers instructions that performs asymmetric arithmetic operation and arithmetic operation on horizontal data elements.
ADDSUBPS and ADDSUBPD are two instructions with asymmetric arithmetic processing capability (see
Figure 6-5). HADDPS, HADDPD, HSUBPS and HSUBPD offers horizontal arithmetic processing capability
(see Figure 6-6). In addition: MOVSLDUP, MOVSHDUP and MOVDDUP load data from memory (or XMM
register) and replicate data elements at once.

X1

X0

Y1

Y0

ADD

SUB

X1 + Y1

X0 -Y0

Figure 6-5. Asymmetric Arithmetic Operation of the SSE3 Instruction

X1

X0

Y1

Y0

ADD

ADD

Y0 + Y1

X0 + X1

Figure 6-6. Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD

6-11

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

6.6.1.1

SSE3 and Complex Arithmetics

The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of
multiplication and division of complex numbers. For example, a complex number can be stored in a structure consisting of its real and imaginary part. This naturally leads to the use of an array of structure.
Example 6-9 demonstrates using SSE3 instructions to perform multiplications of single-precision
complex numbers. Example 6-10 demonstrates using SSE3 instructions to perform division of complex
numbers.

Example 6-9. Multiplication of Two Pair of Single-precision Complex Number
// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, Src1; load real parts into the destination,
; a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values,
; i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0,
; a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary
; parts, c1, d1, c0, d0
movshdup xmm2, Src1; load the imaginary parts into the
; destination, b1, b1, b0, b0
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0,
; b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0,
; a0c0-b0d0

Example 6-10. Division of Two Pair of Single-precision Complex Numbers
// Division of (ak + i bk ) / (ck + i dk )
movshdup xmm0, Src1; load imaginary parts into the
; destination, b1, b1, b0, b0
movaps xmm1, src2; load the 2nd pair of complex values,
; i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, b1d1, b1c1, b0d0,
; b0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary
; parts, c1, d1, c0, d0
movsldup xmm2, Src1; load the real parts into the
; destination, a1, a1, a0, a0
mulps xmm2, xmm1; temp results, a1c1, a1d1, a0c0, a0d0
addsubps xmm0, xmm2; a1c1+b1d1, b1c1-a1d1, a0c0+b0d0,
; b0c0-a0d0
mulps
movps
shufps
addps

6-12

xmm1, xmm1 ; c1c1, d1d1, c0c0, d0d0
xmm2, xmm1; c1c1, d1d1, c0c0, d0d0
xmm2, xmm2, b1; d1d1, c1c1, d0d0, c0c0
xmm2, xmm1; c1c1+d1d1, c1c1+d1d1, c0c0+d0d0,
; c0c0+d0d0

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-10. Division of Two Pair of Single-precision Complex Numbers (Contd.)
divps xmm0, xmm2
shufps xmm0, xmm0, b1 ; (b1c1-a1d1)/(c1c1+d1d1),
; (a1c1+b1d1)/(c1c1+d1d1),
; (b0c0-a0d0)/( c0c0+d0d0),
; (a0c0+b0d0)/( c0c0+d0d0)
In both examples, the complex numbers are store in arrays of structures. MOVSLDUP, MOVSHDUP and
the asymmetric ADDSUBPS allow performing complex arithmetics on two pair of single-precision
complex number simultaneously and without any unnecessary swizzling between data elements.
Due to microarchitectural differences, software should implement multiplication of complex doubleprecision numbers using SSE3 instructions on processors based on Intel Core microarchitecture. In Intel
Core Duo and Intel Core Solo processors, software should use scalar SSE2 instructions to implement
double-precision complex multiplication. This is because the data path between SIMD execution units is
128 bits in Intel Core microarchitecture, and only 64 bits in previous microarchitectures. Processors
based on the Enhanced Intel Core microarchitecture generally executes SSE3 instruction more efficiently
than previous microarchitectures, they also have a 128-bit shuffle unit that will benefit complex arithmetic operations further than Intel Core microarchitecture did.
Example 6-11 shows two equivalent implementations of double-precision complex multiply of two pair of
complex numbers using vector SSE2 versus SSE3 instructions. Example 6-12 shows the equivalent
scalar SSE2 implementation.

Example 6-11. Double-Precision Complex Multiplication of Two Pairs
SSE2 Vector Implementation
SSE3 Vector Implementation
movapd xmm0, [eax] ;y x
movapd xmm1, [eax+16] ;w z
unpcklpd xmm1, xmm1 ;z z
movapd xmm2, [eax+16] ;w z
unpckhpd xmm2, xmm2 ;w w
mulpd xmm1, xmm0 ;z*y z*x
mulpd xmm2, xmm0 ;w*y w*x
xorpd xmm2, xmm7 ;-w*y +w*x
shufpd xmm2, xmm2,1 ;w*x -w*y
addpd xmm2, xmm1 ;z*y+w*x z*x-w*y
movapd [ecx], xmm2

movapd xmm0, [eax] ;y x
movapd xmm1, [eax+16] ;z z
movapd xmm2, xmm1
unpcklpd xmm1, xmm1
unpckhpd xmm2, xmm2
mulpd xmm1, xmm0 ;z*y z*x
mulpd xmm2, xmm0 ;w*y w*x
shufpd xmm2, xmm2, 1 ;w*x w*y
addsubpd xmm1, xmm2 ;w*x+z*y z*x-w*y
movapd [ecx], xmm1

Example 6-12. Double-Precision Complex Multiplication Using Scalar SSE2
movsd
movsd
movsd
movsd

xmm0, [eax]
;x
xmm5, [eax+8]
;y
xmm1, [eax+16] ;z
xmm2, [eax+24] ;w

movsd
movsd
mulsd
mulsd
mulsd

xmm3, xmm1 ;z
xmm4, xmm2 ;w
xmm1, xmm0 ;z*x
xmm2, xmm0 ;w*x
xmm3, xmm5 ;z*y

6-13

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-12. Double-Precision Complex Multiplication Using Scalar SSE2 (Contd.)
mulsd
subsd
addsd
movsd
movsd

xmm4, xmm5 ;w*y
xmm1, xmm4 ;z*x - w*y
xmm3, xmm2 ;z*y + w*x
[ecx], xmm1
[ecx+8], xmm3

6.6.1.2

Packed Floating-Point Performance in Intel Core Duo Processor

Most packed SIMD floating-point code will speed up on Intel Core Solo processors relative to Pentium M
processors. This is due to improvement in decoding packed SIMD instructions.
The improvement of packed floating-point performance on the Intel Core Solo processor over Pentium M
processor depends on several factors. Generally, code that is decoder-bound and/or has a mixture of
integer and packed floating-point instructions can expect significant gain. Code that is limited by execution latency and has a “cycles per instructions” ratio greater than one will not benefit from decoder
improvement.
When targeting complex arithmetics on Intel Core Solo and Intel Core Duo processors, using singleprecision SSE3 instructions can deliver higher performance than alternatives. On the other hand, tasks
requiring double-precision complex arithmetics may perform better using scalar SSE2 instructions on
Intel Core Solo and Intel Core Duo processors. This is because scalar SSE2 instructions can be dispatched
through two ports and executed using two separate floating-point units.
Packed horizontal SSE3 instructions (HADDPS and HSUBPS) can simplify the code sequence for some
tasks. However, these instruction consist of more than five micro-ops on Intel Core Solo and Intel Core
Duo processors. Care must be taken to ensure the latency and decoding penalty of the horizontal instruction does not offset any algorithmic benefits.

6.6.2

Dot Product and Horizontal SIMD Instructions

Sometimes the AOS type of data organization are more natural in many algebraic formula, one common
example is the dot product operation. Dot product operation can be implemented using SSE/SSE2
instruction sets. SSE3 added a few horizontal add/subtract instructions for applications that rely on the
horizontal computation model. SSE4.1 provides additional enhancement with instructions that are
capable of directly evaluating dot product operations of vectors of 2, 3 or 4 components.

Example 6-13. Dot Product of Vector Length 4 Using SSE/SSE2
Using SSE/SSE2 to compute one dot product
movaps xmm0, [eax] // a4, a3, a2, a1
mulps xmm0, [eax+16] // a4*b4, a3*b3, a2*b2, a1*b1
movhlps xmm1, xmm0 // X, X, a4*b4, a3*b3, upper half not needed
addps xmm0, xmm1 // X, X, a2*b2+a4*b4, a1*b1+a3*b3,
pshufd xmm1, xmm0, 1 // X, X, X, a2*b2+a4*b4
addss xmm0, xmm1 // a1*b1+a3*b3+a2*b2+a4*b4
movss [ecx], xmm0

6-14

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-14. Dot Product of Vector Length 4 Using SSE3
Using SSE3 to compute one dot product
movaps xmm0, [eax]
mulps xmm0, [eax+16] // a4*b4, a3*b3, a2*b2, a1*b1
haddps xmm0, xmm0 // a4*b4+a3*b3, a2*b2+a1*b1, a4*b4+a3*b3, a2*b2+a1*b1
movaps xmm1, xmm0 // a4*b4+a3*b3, a2*b2+a1*b1, a4*b4+a3*b3, a2*b2+a1*b1
psrlq xmm0, 32 // 0, a4*b4+a3*b3, 0, a4*b4+a3*b3
addss xmm0, xmm1 // -, -, -, a1*b1+a3*b3+a2*b2+a4*b4
movss [eax], xmm0

Example 6-15. Dot Product of Vector Length 4 Using SSE4.1
Using SSE4.1 to compute one dot product
movaps xmm0, [eax]
dpps xmm0, [eax+16], 0xf1 // 0, 0, 0, a1*b1+a3*b3+a2*b2+a4*b4
movss [eax], xmm0
Example 6-13, Example 6-14, and Example 6-15 compare the basic code sequence to compute one dotproduct result for a pair of vectors.
The selection of an optimal sequence in conjunction with an application’s memory access patterns may
favor different approaches. For example, if each dot product result is immediately consumed by additional computational sequences, it may be more optimal to compare the relative speed of these different
approaches. If dot products can be computed for an array of vectors and kept in the cache for subsequent
computations, then more optimal choice may depend on the relative throughput of the sequence of
instructions.
In Intel Core microarchitecture, Example 6-14 has higher throughput than Example 6-13. Due to the
relatively longer latency of HADDPS, the speed of Example 6-14 is slightly slower than Example 6-13.
In Enhanced Intel Core microarchitecture, Example 6-15 is faster in both speed and throughput than
Example 6-13 and Example 6-14. Although the latency of DPPS is also relatively long, it is compensated
by the reduction of number of instructions in Example 6-15 to do the same amount of work.
Unrolling can further improve the throughput of each of three dot product implementations.
Example 6-16 shows two unrolled versions using the basic SSE2 and SSE3 sequences. The SSE4.1
version can also be unrolled and using INSERTPS to pack 4 dot-product results.

Example 6-16. Unrolled Implementation of Four Dot Products
SSE2 Implementation
SSE3 Implementation
movaps xmm0, [eax]
mulps xmm0, [eax+16]
;w0*w1 z0*z1 y0*y1 x0*x1
movaps xmm2, [eax+32]
mulps xmm2, [eax+16+32]
;w2*w3 z2*z3 y2*y3 x2*x3
movaps xmm3, [eax+64]
mulps xmm3, [eax+16+64]
;w4*w5 z4*z5 y4*y5 x4*x5
movaps xmm4, [eax+96]
mulps xmm4, [eax+16+96]
;w6*w7 z6*z7 y6*y7 x6*x7

movaps xmm0, [eax]
mulps xmm0, [eax+16]
movaps xmm1, [eax+32]
mulps xmm1, [eax+16+32]
movaps xmm2, [eax+64]
mulps xmm2, [eax+16+64]
movaps xmm3, [eax+96]
mulps xmm3, [eax+16+96]
haddps xmm0, xmm1
haddps xmm2, xmm3
haddps xmm0, xmm2
movaps [ecx], xmm0

6-15

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

Example 6-16. Unrolled Implementation of Four Dot Products (Contd.)
SSE2 Implementation
SSE3 Implementation
movaps xmm1, xmm0
unpcklps xmm0, xmm2
; y2*y3 y0*y1 x2*x3 x0*x1
unpckhps xmm1, xmm2
; w2*w3 w0*w1 z2*z3 z0*z1
movaps xmm5, xmm3
unpcklps xmm3, xmm4
; y6*y7 y4*y5 x6*x7 x4*x5
unpckhps xmm5, xmm4
; w6*w7 w4*w5 z6*z7 z4*z5
addps xmm0, xmm1
addps xmm5, xmm3
movaps xmm1, xmm5
movhlps xmm1, xmm0
movlhps xmm0, xmm5
addps xmm0, xmm1
movaps [ecx], xmm0

6.6.3

Vector Normalization

Normalizing vectors is a common operation in many floating-point applications. Example 6-17 shows an
example in C of normalizing an array of (x, y, z) vectors.

Example 6-17. Normalization of an Array of Vectors
for (i=0;i= g_array_aperture) {
next = &pArray[0 ];
}
else {
next = &pArray[i + stride];
}
*p = next; // populate the address of the next node
}
The effective latency reduction for several microarchitecture implementations is shown in Figure 7-2. For
a constant-stride access pattern, the benefit of the automatic hardware prefetcher begins at half the
trigger threshold distance and reaches maximum benefit when the cache-miss stride is 64 bytes.

7-13

OPTIMIZING CACHE USAGE

U p p e r b o u n d o f P o in t e r - C h a s i n g L a t e n c y R e d u c t io n
120%

Effective Latency Reduction

100%

80%

F a m .1 5 ; M o d e l 3 , 4
F a m .1 5 ; M o d e l 0 ,1 ,2

60%

Fam . 6; M odel 13
Fam . 6; M odel 14
Fam . 15; M odel 6

40%

20%

4

8

2

0
24

22

20

0

4

8

6

19

17

16

14

2

12

96

11

80

64

0%

S tr i d e (B y te s)

Figure 7-2. Effective Latency Reduction as a Function of Access Stride

7.5.4

Example of Latency Hiding with S/W Prefetch Instruction

Achieving the highest level of memory optimization using PREFETCH instructions requires an understanding of the architecture of a given machine. This section translates the key architectural implications
into several simple guidelines for programmers to use.
Figure 7-3 and Figure 7-4 show two scenarios of a simplified 3D geometry pipeline as an example. A 3Dgeometry pipeline typically fetches one vertex record at a time and then performs transformation and
lighting functions on it. Both figures show two separate pipelines, an execution pipeline, and a memory
pipeline (front-side bus).
Since the Pentium 4 processor (similar to the Pentium II and Pentium III processors) completely decouples the functionality of execution and memory access, the two pipelines can function concurrently.
Figure 7-3 shows “bubbles” in both the execution and memory pipelines. When loads are issued for
accessing vertex data, the execution units sit idle and wait until data is returned. On the other hand, the
memory bus sits idle while the execution units are processing vertices. This scenario severely decreases
the advantage of having a decoupled architecture.

Tim e

Execution
pipeline

Execution units idle

Execution units idle
Issue loads
(vertex data)

Front-Side
Bus

Issue loads
FSB idle

M em latency

Vertex n

M em latency

Vertex n+1
OM15170

Figure 7-3. Memory Access Latency and Execution Without Prefetch

7-14

OPTIMIZING CACHE USAGE

Tim e
Execution
pipeline

Front-Side
Bus

Vertex n-2

Vertex n-1

Vertex n

issue prefetch
for vertex n

prefetch
V n+1

prefetch
V n+2

Vertex n+1

Mem latency for V n
Mem latency for V n+1
Mem latency for V n+2
OM15171

Figure 7-4. Memory Access Latency and Execution With Prefetch

The performance loss caused by poor utilization of resources can be completely eliminated by correctly
scheduling the PREFETCH instructions. As shown in Figure 7-4, prefetch instructions are issued two
vertex iterations ahead. This assumes that only one vertex gets processed in one iteration and a new
data cache line is needed for each iteration. As a result, when iteration n, vertex Vn, is being processed;
the requested data is already brought into cache. In the meantime, the front-side bus is transferring the
data needed for iteration n+1, vertex Vn+1. Because there is no dependence between Vn+1 data and the
execution of Vn, the latency for data access of Vn+1 can be entirely hidden behind the execution of Vn.
Under such circumstances, no “bubbles” are present in the pipelines and thus the best possible performance can be achieved.
Prefetching is useful for inner loops that have heavy computations, or are close to the boundary between
being compute-bound and memory-bandwidth-bound. It is probably not very useful for loops which are
predominately memory bandwidth-bound.
When data is already located in the first level cache, prefetching can be useless and could even slow
down the performance because the extra µops either back up waiting for outstanding memory accesses
or may be dropped altogether. This behavior is platform-specific and may change in the future.

7.5.5

Software Prefetching Usage Checklist

The following checklist covers issues that need to be addressed and/or resolved to use the software
PREFETCH instruction properly:

•
•
•
•
•
•
•
•

Determine software prefetch scheduling distance.
Use software prefetch concatenation.
Minimize the number of software prefetches.
Mix software prefetch with computation instructions.
Use cache blocking techniques (for example, strip mining).
Balance single-pass versus multi-pass execution.
Resolve memory bank conflict issues.
Resolve cache management issues.

Subsequent sections discuss the above items.

7-15

OPTIMIZING CACHE USAGE

7.5.6

Software Prefetch Scheduling Distance

Determining the ideal prefetch placement in the code depends on many architectural parameters,
including: the amount of memory to be prefetched, cache lookup latency, system memory latency, and
estimate of computation cycle. The ideal distance for prefetching data is processor- and platform-dependent. If the distance is too short, the prefetch will not hide the latency of the fetch behind computation.
If the prefetch is too far ahead, prefetched data may be flushed out of the cache by the time it is
required.
Since prefetch distance is not a well-defined metric, for this discussion, we define a new term, prefetch
scheduling distance (PSD), which is represented by the number of iterations. For large loops, prefetch
scheduling distance can be set to 1 (that is, schedule prefetch instructions one iteration ahead). For small
loop bodies (that is, loop iterations with little computation), the prefetch scheduling distance must be
more than one iteration.
A simplified equation to compute PSD is deduced from the mathematical model.
Example 7-4 illustrates the use of a prefetch within the loop body. The prefetch scheduling distance is set
to 3, ESI is effectively the pointer to a line, EDX is the address of the data being referenced and XMM1XMM4 are the data used in computation. Example 7-5 uses two independent cache lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines are used per iteration.
Example 7-4. Prefetch Scheduling Distance
top_loop:
prefetchnta [edx + esi + 128*3]
prefetchnta [edx*4 + esi + 128*3]
.....
movaps
movaps
movaps
movaps
.....
.....

xmm1, [edx + esi]
xmm2, [edx*4 + esi]
xmm3, [edx + esi + 16]
xmm4, [edx*4 + esi + 16]

add
cmp
jl

esi, 128
esi, ecx
top_loop

7.5.7

Software Prefetch Concatenation

Maximum performance can be achieved when the execution pipeline is at maximum throughput, without
incurring any memory latency penalties. This can be achieved by prefetching data to be used in successive iterations in a loop. De-pipelining memory generates bubbles in the execution pipeline.
To explain this performance issue, a 3D geometry pipeline that processes 3D vertices in strip format is
used as an example. A strip contains a list of vertices whose predefined vertex order forms contiguous
triangles. It can be easily observed that the memory pipe is de-pipelined on the strip boundary due to
ineffective prefetch arrangement. The execution pipeline is stalled for the first two iterations for each
strip. As a result, the average latency for completing an iteration will be 165 (FIX) clocks.
This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This
de-pipelining effect can be removed by applying a technique called prefetch concatenation. With this
technique, the memory access and execution can be fully pipelined and fully utilized.
For nested loops, memory de-pipelining could occur during the interval between the last iteration of an
inner loop and the next iteration of its associated outer loop. Without paying special attention to prefetch
insertion, loads from the first iteration of an inner loop can miss the cache and stall the execution pipeline
waiting for data returned, thus degrading the performance.
7-16

OPTIMIZING CACHE USAGE

In Example 7-5, the cache line containing A[II][0] is not prefetched at all and always misses the cache.
This assumes that no array A[][] footprint resides in the cache. The penalty of memory de-pipelining
stalls can be amortized across the inner loop iterations. However, it may become very harmful when the
inner loop is short. In addition, the last prefetch in the last PSD iterations are wasted and consume
machine resources. Prefetch concatenation is introduced here in order to eliminate the performance
issue of memory de-pipelining.
Example 7-5. Using Prefetch Concatenation
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 32; jj+=8) {
prefetch a[ii][jj+8]
computation a[ii][jj]
}
}
Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop
and its associated outer loop. Simply by unrolling the last iteration out of the inner loop and specifying
the effective prefetch address for data used in the following iteration, the performance loss of memory
de-pipelining can be completely removed. Example 7-6 gives the rewritten code.
Example 7-6. Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
prefetch a[ii][jj+8]
computation a[ii][jj]
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}
This code segment for data prefetching is improved and only the first iteration of the outer loop suffers
any memory access latency penalty, assuming the computation time is larger than the memory latency.
Inserting a prefetch of the first data element needed prior to entering the nested loop computation would
eliminate or reduce the start-up penalty for the very first iteration of the outer loop. This uncomplicated
high-level code optimization can improve memory performance significantly.

7.5.8

Minimize Number of Software Prefetches

Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even
though they require minimal clock and memory bandwidth.
Excessive prefetching may lead to performance penalties because of issue penalties in the front end of
the machine and/or resource contention in the memory sub-system. This effect may be severe in cases
where the target loops are small and/or cases where the target loop is issue-bound.
One approach to solve the excessive prefetching issue is to unroll and/or software-pipeline loops to
reduce the number of prefetches required. Figure 7-5 presents a code example which implements
prefetch and unrolls the loop to remove the redundant prefetch instructions whose prefetch addresses hit
the previously issued prefetch instructions. In this particular example, unrolling the original loop once
saves six prefetch instructions and nine instructions for conditional jumps in every other iteration.

7-17

OPTIMIZING CACHE USAGE

top_loop:
prefetchnta [edx+esi+32]
prefetchnta [edx*4+esi+32]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
add esi, 16
unrolled
cm p esi, ecx
iteration
jl top_loop

top_loop:
prefetchnta [edx+esi+128]
prefetchnta [edx*4+esi+128]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
m ovaps xm m 1, [edx+esi+16]
m ovaps xm m 2, [edx*4+esi+16]
. . . . .
m ovaps xm m 1, [edx+esi+96]
m ovaps xm m 2, [edx*4+esi+96]
. . . . .
. . . . .
add esi, 128
cm p esi, ecx
jl top_loop
OM15172

Figure 7-5. Prefetch and Loop Unrolling

Figure 7-6 demonstrates the effectiveness of software prefetches in latency hiding.

Tim e
Execution
pipeline

Front-Side
Bus

Vertex n-2

Vertex n-1

Vertex n

issue prefetch
for vertex n

prefetch
V n+1

prefetch
V n+2

Vertex n+1

Mem latency for V n
Mem latency for V n+1
Mem latency for V n+2
OM15171

Figure 7-6. Memory Access Latency and Execution With Prefetch

The X axis in Figure 7-6 indicates the number of computation clocks per loop (each iteration is independent). The Y axis indicates the execution time measured in clocks per loop. The secondary Y axis indicates the percentage of bus bandwidth utilization. The tests vary by the following parameters:

•

Number of load/store streams — Each load and store stream accesses one 128-byte cache line each
per iteration.

•

Amount of computation per loop — This is varied by increasing the number of dependent arithmetic
operations executed.

•

Number of the software prefetches per loop — For example, one every 16 bytes, 32 bytes, 64 bytes,
128 bytes.

7-18

OPTIMIZING CACHE USAGE

As expected, the leftmost portion of each of the graphs in Figure 7-6 shows that when there is not
enough computation to overlap the latency of memory access, prefetch does not help and that the
execution is essentially memory-bound. The graphs also illustrate that redundant prefetches do not
increase performance.

7.5.9

Mix Software Prefetch with Computation Instructions

It may seem convenient to cluster all of PREFETCH instructions at the beginning of a loop body or before
a loop, but this can lead to severe performance degradation. In order to achieve the best possible performance, PREFETCH instructions must be interspersed with other computational instructions in the instruction sequence rather than clustered together. If possible, they should also be placed apart from loads.
This improves the instruction level parallelism and reduces the potential instruction resource stalls. In
addition, this mixing reduces the pressure on the memory access resources and in turn reduces the
possibility of the prefetch retiring without fetching data.
Figure 7-7 illustrates distributing PREFETCH instructions. A simple and useful heuristic of prefetch
spreading for a Pentium 4 processor is to insert a PREFETCH instruction every 20 to 25 clocks. Rearranging PREFETCH instructions could yield a noticeable speedup for the code which stresses the cache
resource.

top_loop:
prefetchnta [ebx+128]
prefetchnta [ebx+1128]
prefetchnta [ebx+2128]
prefetchnta [ebx+3128]
. . . .
. . . .
prefetchnta [ebx+17128]
prefetchnta [ebx+18128]
prefetchnta [ebx+19128]
prefetchnta [ebx+20128]
movps xmm1, [ebx]
addps xmm2, [ebx+3000]
mulps xmm3, [ebx+4000]
addps xmm1, [ebx+1000]
addps xmm2, [ebx+3016]
mulps xmm1, [ebx+2000]
mulps xmm1, xmm2
. . . . . . . .
. . . . . .
. . . . .
add ebx, 128
cmp ebx, ecx
jl top_loop

sp

dp
re a

r

tch
efe

es

top_loop:
prefetchnta [ebx+128]
movps xmm1, [ebx]
addps xmm2, [ebx+3000]
mulps xmm3, [ebx+4000]
prefetchnta [ebx+1128]
addps xmm1, [ebx+1000]
addps xmm2, [ebx+3016]
prefetchnta [ebx+2128]
mulps xmm1, [ebx+2000]
mulps xmm1, xmm2
prefetchnta [ebx+3128]
. . . . . . .
. . .
prefetchnta [ebx+18128]
. . . . . .
prefetchnta [ebx+19128]
. . . . . .
. . . .
prefetchnta [ebx+20128]
add ebx, 128
cmp ebx, ecx
jl top_loop

Figure 7-7. Spread Prefetch Instructions

NOTE
To avoid instruction execution stalls due to the over-utilization of the resource, PREFETCH
instructions must be interspersed with computational instructions

7.5.10

Software Prefetch and Cache Blocking Techniques

Cache blocking techniques (such as strip-mining) are used to improve temporal locality and the cache hit
rate. Strip-mining is one-dimensional temporal locality optimization for memory. When two-dimensional
arrays are used in programs, loop blocking technique (similar to strip-mining but in two dimensions) can
be applied for a better memory performance.

7-19

OPTIMIZING CACHE USAGE

If an application uses a large data set that can be reused across multiple passes of a loop, it will benefit
from strip mining. Data sets larger than the cache will be processed in groups small enough to fit into
cache. This allows temporal data to reside in the cache longer, reducing bus traffic.
Data set size and temporal locality (data characteristics) fundamentally affect how PREFETCH instructions are applied to strip-mined code. Figure 7-8 shows two simplified scenarios for temporally-adjacent
data and temporally-non-adjacent data.

Dataset A

Dataset A

Pass 1

Dataset A

Dataset B

Pass 2

Dataset B

Dataset A

Pass 3

Dataset B

Dataset B

Pass 4

Temporally
adjacent passes

Temporally
non-adjacent
passes

Figure 7-8. Cache Blocking – Temporally Adjacent and Non-adjacent Passes

In the temporally-adjacent scenario, subsequent passes use the same data and find it already in secondlevel cache. Prefetch issues aside, this is the preferred situation. In the temporally non-adjacent
scenario, data used in pass m is displaced by pass (m+1), requiring data re-fetch into the first level cache
and perhaps the second level cache if a later pass reuses the data. If both data sets fit into the secondlevel cache, load operations in passes 3 and 4 become less expensive.
Figure 7-9 shows how prefetch instructions and strip-mining can be applied to increase performance in
both of these scenarios.

7-20

OPTIMIZING CACHE USAGE

Prefetchnta
Dataset A

Prefetcht0
Dataset A

SM1
Reuse
Dataset A

Prefetcht0
Dataset B

Prefetchnta
Dataset B

Reuse
Dataset A

SM2

SM1
Reuse
Dataset B

Reuse
Dataset B

Temporally
adjacent passes

Temporally
non-adjacent passes

Figure 7-9. Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes

Loops

For Pentium 4 processors, the left scenario shows a graphical implementation of using PREFETCHNTA to
prefetch data into selected ways of the second-level cache only (SM1 denotes strip mine one way of
second-level), minimizing second-level cache pollution. Use PREFETCHNTA if the data is only touched
once during the entire execution pass in order to minimize cache pollution in the higher level caches. This
provides instant availability, assuming the prefetch was issued far ahead enough, when the read access
is issued.
In scenario to the right (see Figure 7-9), keeping the data in one way of the second-level cache does not
improve cache locality. Therefore, use PREFETCHT0 to prefetch the data. This amortizes the latency of
the memory references in passes 1 and 2, and keeps a copy of the data in second-level cache, which
reduces memory traffic and latencies for passes 3 and 4. To further reduce the latency, it might be worth
considering extra PREFETCHNTA instructions prior to the memory references in passes 3 and 4.
In Example 7-7, consider the data access patterns of a 3D geometry engine first without strip-mining
and then incorporating strip-mining. Note that 4-wide SIMD instructions of Pentium III processor can
process 4 vertices per every iteration.
Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in
the second pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during
transformation loop as well as bandwidth wasted in the lighting loop.
Example 7-7. Data Access of a 3D Geometry Engine without Strip-mining
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertexi data
prefetchnta vertexi+1 data
prefetchnta vertexi+2 data
prefetchnta vertexi+3 data
TRANSFORMATION code
nvtx+=4

// v =[x,y,z,nx,ny,nz,tu,tv]

// use only x,y,z,tu,tv of a vertex

7-21

OPTIMIZING CACHE USAGE

Example 7-7. Data Access of a 3D Geometry Engine without Strip-mining (Contd.)
}
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertexi data
prefetchnta vertexi+1 data
prefetchnta vertexi+2 data
prefetchnta vertexi+3 data
compute the light vectors
LOCAL LIGHTING code
nvtx+=4

// v =[x,y,z,nx,ny,nz,tu,tv]
// x,y,z fetched again

// use only x,y,z
// use only nx,ny,nz

}
Now consider the code in Example 7-8 where strip-mining has been incorporated into the loops.
Example 7-8. Data Access of a 3D Geometry Engine with Strip-mining
while (nstrip < NUM_STRIP) {
/* Strip-mine the loop to fit data into one way of the second-level
cache */
while (nvtx < MAX_NUM_VTX_PER_STRIP) {
prefetchnta vertexi data
// v=[x,y,z,nx,ny,nz,tu,tv]
prefetchnta vertexi+1 data
prefetchnta vertexi+2 data
prefetchnta vertexi+3 data
TRANSFORMATION code
nvtx+=4
}
while (nvtx < MAX_NUM_VTX_PER_STRIP) {
/* x y z coordinates are in the second-level cache, no prefetch is
required */
compute the light vectors
POINT LIGHTING code
nvtx+=4
}
}
With strip-mining, all vertex data can be kept in the cache (for example, one way of second-level cache)
during the strip-mined transformation loop and reused in the lighting loop. Keeping data in the cache
reduces both bus traffic and the number of prefetches used.
Table 7-1 summarizes the steps of the basic usage model that incorporates only software prefetch with
strip-mining. The steps are:

•
•

Do strip-mining: partition loops so that the dataset fits into second-level cache.
Use PREFETCHNTA if the data is only used once or the dataset fits into 32 KBytes (one way of secondlevel cache). Use PREFETCHT0 if the dataset exceeds 32 KBytes.

The above steps are platform-specific and provide an implementation example. The variables
NUM_STRIP and MAX_NUM_VX_PER_STRIP can be heuristically determined for peak performance for
specific application on a specific platform.

7-22

OPTIMIZING CACHE USAGE

Table 7-1. Software Prefetching Considerations into Strip-mining Code
Read-Multiple-Times Array References
Read-Once Array References

Adjacent Passes

Non-Adjacent Passes

Prefetchnta

Prefetch0, SM1

Prefetch0, SM1
(2nd Level Pollution)

Evict one way; Minimize pollution

Pay memory access cost for the first
pass of each array; Amortize the first
pass with subsequent passes

Pay memory access cost for the first
pass of every strip; Amortize the first
pass with subsequent passes

7.5.11

Hardware Prefetching and Cache Blocking Techniques

Tuning data access patterns for the automatic hardware prefetch mechanism can minimize the memory
access costs of the first-pass of the read-multiple-times and some of the read-once memory references.
An example of the situations of read-once memory references can be illustrated with a matrix or image
transpose, reading from a column-first orientation and writing to a row-first orientation, or vice versa.
Example 7-9 shows a nested loop of data movement that represents a typical matrix/image transpose
problem. If the dimension of the array are large, not only the footprint of the dataset will exceed the last
level cache but cache misses will occur at large strides. If the dimensions happen to be powers of 2,
aliasing condition due to finite number of way-associativity (see “Capacity Limits and Aliasing in Caches”
in Chapter ) will exacerbate the likelihood of cache evictions.
Example 7-9. Using HW Prefetch to Improve Read-Once Memory Traffic
a) Un-optimized image transpose
// dest and src represent two-dimensional arrays
for( i = 0;i < NUMCOLS; i ++) {
// inner loop reads single column
for( j = 0; j < NUMROWS ; j ++) {
// Each read reference causes large-stride cache miss
dest[i*NUMROWS +j] = src[j*NUMROWS + i];
}
}
b)
// tilewidth = L2SizeInBytes/2/TileHeight/Sizeof(element)
for( i = 0; i < NUMCOLS; i += tilewidth) {
for( j = 0; j < NUMROWS ; j ++) {
// access multiple elements in the same row in the inner loop
// access pattern friendly to hw prefetch and improves hit rate
for( k = 0; k < tilewidth; k ++)
dest[j+ (i+k)* NUMROWS] = src[i+k+ j* NUMROWS];
}
}
Example 7-9 (b) shows applying the techniques of tiling with optimal selection of tile size and tile width
to take advantage of hardware prefetch. With tiling, one can choose the size of two tiles to fit in the last
level cache. Maximizing the width of each tile for memory read references enables the hardware
prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear
addresses.

7-23

OPTIMIZING CACHE USAGE

7.5.12

Single-pass versus Multi-pass Execution

An algorithm can use single- or multi-pass execution defined as follows:

•

Single-pass, or unlayered execution passes a single data element through an entire computation
pipeline.

•

Multi-pass, or layered execution performs a single stage of the pipeline on a batch of data elements,
before passing the batch on to the next stage.

A characteristic feature of both single-pass and multi-pass execution is that a specific trade-off exists
depending on an algorithm’s implementation and use of a single-pass or multiple-pass execution. See
Figure 7-10.
Multi-pass execution is often easier to use when implementing a general purpose API, where the choice
of code paths that can be taken depends on the specific combination of features selected by the application (for example, for 3D graphics, this might include the type of vertex primitives used and the number
and type of light sources).
With such a broad range of permutations possible, a single-pass approach would be complicated, in
terms of code size and validation. In such cases, each possible permutation would require a separate
code sequence. For example, an object with features A, B, C, D can have a subset of features enabled,
say, A, B, D. This stage would use one code path; another combination of enabled features would have a
different code path. It makes more sense to perform each pipeline stage as a separate pass, with conditional clauses to select different features that are implemented within each stage. By using strip-mining,
the number of vertices processed by each stage (for example, the batch size) can be selected to ensure
that the batch stays within the processor caches through all passes. An intermediate cached buffer is
used to pass the batch of vertices from one stage or pass to the next one.
Single-pass execution can be better suited to applications which limit the number of features that may be
used at a given time. A single-pass approach can reduce the amount of data copying that can occur with
a multi-pass engine. See Figure 7-10.

7-24

OPTIMIZING CACHE USAGE

strip list
80 vis
60 invis
40 vis

80 vis
40 vis

Culling

Culling

Transform
Vertex
processing
(inner loop)

Transform

Outer loop is
processing
strips

Lighting

Lighting

Single-Pass

Multi-Pass

Figure 7-10. Single-Pass Vs. Multi-Pass 3D Geometry Engines

The choice of single-pass or multi-pass can have a number of performance implications. For instance, in
a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of
this performance limitation in overall execution time. In contrast, for a single-pass approach, bandwidthlimitations can be distributed/amortized across other computation-intensive stages. Also, the choice of
which prefetch hints to use are also impacted by whether a single-pass or multi-pass approach is used.

7.6

MEMORY OPTIMIZATION USING NON-TEMPORAL STORES

Non-temporal stores can also be used to manage data retention in the cache. Uses for non-temporal
stores include:

•
•

To combine many writes without disturbing the cache hierarchy.
To manage which data structures remain in the cache and which are transient.

Detailed implementations of these usage models are covered in the following sections.

7.6.1

Non-temporal Stores and Software Write-Combining

Use non-temporal stores in the cases when the data to be stored is:

•
•

Write-once (non-temporal).
Too large and thus cause cache thrashing.

7-25

OPTIMIZING CACHE USAGE

Non-temporal stores do not invoke a cache line allocation, which means they are not write-allocate. As a
result, caches are not polluted and no dirty writeback is generated to compete with useful data bandwidth. Without using non-temporal stores, bus bandwidth will suffer when caches start to be thrashed
because of dirty writebacks.
In Streaming SIMD Extensions implementation, when non-temporal stores are written into writeback or
write-combining memory regions, these stores are weakly-ordered and will be combined internally inside
the processor’s write-combining buffer and be written out to memory as a line burst transaction. To
achieve the best possible performance, it is recommended to align data along the cache line boundary
and write them consecutively in a cache line size while using non-temporal stores. If the consecutive
writes are prohibitive due to programming constraints, then software write-combining (SWWC) buffers
can be used to enable line burst transaction.
You can declare small SWWC buffers (a cache line for each buffer) in your application to enable explicit
write-combining operations. Instead of writing to non-temporal memory space immediately, the program
writes data into SWWC buffers and combines them inside these buffers. The program only writes a
SWWC buffer out using non-temporal stores when the buffer is filled up, that is, a cache line (128 bytes
for the Pentium 4 processor). Although the SWWC method requires explicit instructions for performing
temporary writes and reads, this ensures that the transaction on the front-side bus causes line transaction rather than several partial transactions. Application performance gains considerably from implementing this technique. These SWWC buffers can be maintained in the second-level and re-used
throughout the program.

7.6.2

Cache Management

Streaming instructions (PREFETCH and STORE) can be used to manage data and minimize disturbance of
temporal data held within the processor’s caches.
In addition, the Pentium 4 processor takes advantage of Intel C ++ Compiler support for C ++ languagelevel features for the Streaming SIMD Extensions. Streaming SIMD Extensions and MMX technology
instructions provide intrinsics that allow you to optimize cache utilization. Examples of such Intel
compiler intrinsics are _MM_PREFETCH, _MM_STREAM, _MM_LOAD, _MM_SFENCE. For detail, refer to
the Intel C ++ Compiler User’s Guide documentation.
The following examples of using prefetching instructions in the operation of video encoder and decoder
as well as in simple 8-byte memory copy, illustrate performance gain from using the prefetching instructions for efficient cache management.

7.6.2.1

Video Encoder

In a video encoder, some of the data used during the encoding process is kept in the processor’s secondlevel cache. This is done to minimize the number of reference streams that must be re-read from system
memory. To ensure that other writes do not disturb the data in the second-level cache, streaming stores
(MOVNTQ) are used to write around all processor caches.
The prefetching cache management implemented for the video encoder reduces the memory traffic. The
second-level cache pollution reduction is ensured by preventing single-use video frame data from
entering the second-level cache. Using a non-temporal PREFETCH (PREFETCHNTA) instruction brings
data into only one way of the second-level cache, thus reducing pollution of the second-level cache.
If the data brought directly to second-level cache is not re-used, then there is a performance gain from
the non-temporal prefetch over a temporal prefetch. The encoder uses non-temporal prefetches to avoid
pollution of the second-level cache, increasing the number of second-level cache hits and decreasing the
number of polluting write-backs to memory. The performance gain results from the more efficient use of
the second-level cache, not only from the prefetch itself.

7.6.2.2

Video Decoder

In the video decoder example, completed frame data is written to local memory of the graphics card,
which is mapped to WC (Write-combining) memory type. A copy of reference data is stored to the WB

7-26

OPTIMIZING CACHE USAGE

memory at a later time by the processor in order to generate future data. The assumption is that the size
of the reference data is too large to fit in the processor’s caches. A streaming store is used to write the
data around the cache, to avoid displaying other temporal data held in the caches. Later, the processor
re-reads the data using PREFETCHNTA, which ensures maximum bandwidth, yet minimizes disturbance
of other cached temporal data by using the non-temporal (NTA) version of prefetch.

7.6.2.3

Conclusions from Video Encoder and Decoder Implementation

These two examples indicate that by using an appropriate combination of non-temporal prefetches and
non-temporal stores, an application can be designed to lessen the overhead of memory transactions by
preventing second-level cache pollution, keeping useful data in the second-level cache and reducing
costly write-back transactions. Even if an application does not gain performance significantly from having
data ready from prefetches, it can improve from more efficient use of the second-level cache and
memory. Such design reduces the encoder’s demand for such critical resource as the memory bus. This
makes the system more balanced, resulting in higher performance.

7.6.2.4

Optimizing Memory Copy Routines

Creating memory copy routines for large amounts of data is a common task in software optimization.
Example 7-10 presents a basic algorithm for a the simple memory copy.
Example 7-10. Basic Algorithm of a Simple Memory Copy
#define N 512000
double a[N], b[N];
for (i = 0; i < N; i++) {
b[i] = a[i];
}
This task can be optimized using various coding techniques. One technique uses software prefetch and
streaming store instructions. It is discussed in the following paragraph and a code example shown in
Example 7-11.
The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations:

•
•
•
•
•

Alignment of data.
Proper layout of pages in memory.
Cache size.
Interaction of the transaction lookaside buffer (TLB) with memory accesses.
Combining prefetch and streaming-store instructions.

The guidelines discussed in this chapter come into play in this simple example. TLB priming is required
for the Pentium 4 processor just as it is for the Pentium III processor, since software prefetch instructions
will not initiate page table walks on either processor.

7-27

OPTIMIZING CACHE USAGE

Example 7-11. A Memory Copy Routine Using Software Prefetch
#define PAGESIZE 4096;
#define NUMPERPAGE 512
double a[N], b[N], temp;
for (kk=0; kk 3 < 80000000 are only visible when IA32_CR_MISC_ENABLES.BOOT_NT4 (bit 22) is clear (Default).

7-31

OPTIMIZING CACHE USAGE

The deterministic cache parameter leaf provides a means to implement software with a degree of forward
compatibility with respect to enumerating cache parameters. Deterministic cache parameters can be
used in several situations, including:

•
•

Determine the size of a cache level.

•

Determine multithreading resource topology in an MP system (See Chapter 8, “Multiple-Processor
Management,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).

•

Determine cache hierarchy topology in a platform using multicore processors (See topology
enumeration white paper and reference code listed at the end of CHAPTER 1).

•
•

Manage threads and processor affinities.

Adapt cache blocking parameters to different sharing topology of a cache-level across HyperThreading Technology, multicore and single-core processors.

Determine prefetch stride.

The size of a given level of cache is given by:
(# of Ways) * (Partitions) * (Line_size) * (Sets) = (EBX[31:22] + 1) * (EBX[21:12] + 1) *
(EBX[11:0] + 1) * (ECX + 1)

7.6.3.1

Cache Sharing Using Deterministic Cache Parameters

Improving cache locality is an important part of software optimization. For example, a cache blocking
algorithm can be designed to optimize block size at runtime for single-processor implementations and a
variety of multiprocessor execution environments (including processors supporting HT Technology, or
multicore processors).
The basic technique is to place an upper limit of the blocksize to be less than the size of the target cache
level divided by the number of logical processors serviced by the target level of cache. This technique is
applicable to multithreaded application programming. The technique can also benefit single-threaded
applications that are part of a multi-tasking workloads.

7.6.3.2

Cache Sharing in Single-Core or Multicore

Deterministic cache parameters are useful for managing shared cache hierarchy in multithreaded applications for more sophisticated situations. A given cache level may be shared by logical processors in a
processor or it may be implemented to be shared by logical processors in a physical processor package.
Using the deterministic cache parameter leaf and initial APIC_ID associated with each logical processor
in the platform, software can extract information on the number and the topological relationship of
logical processors sharing a cache level.

7.6.3.3

Determine Prefetch Stride

The prefetch stride (see description of CPUID.01H.EBX) provides the length of the region that the
processor will prefetch with the PREFETCHh instructions (PREFETCHT0, PREFETCHT1, PREFETCHT2 and
PREFETCHNTA). Software will use the length as the stride when prefetching into a particular level of the
cache hierarchy as identified by the instruction used. The prefetch size is relevant for cache types of Data
Cache (1) and Unified Cache (3); it should be ignored for other cache types. Software should not assume
that the coherency line size is the prefetch stride.
If the prefetch stride field is zero, then software should assume a default size of 64 bytes is the prefetch
stride. Software should use the following algorithm to determine what prefetch size to use depending on
whether the deterministic cache parameter mechanism is supported or the legacy mechanism:

•

If a processor supports the deterministic cache parameters and provides a non-zero prefetch size,
then that prefetch size is used.

•

If a processor supports the deterministic cache parameters and does not provides a prefetch size
then default size for each level of the cache hierarchy is 64 bytes.

7-32

OPTIMIZING CACHE USAGE

•

If a processor does not support the deterministic cache parameters but provides a legacy prefetch
size descriptor (0xF0 - 64 byte, 0xF1 - 128 byte) will be the prefetch size for all levels of the cache
hierarchy.

•

If a processor does not support the deterministic cache parameters and does not provide a legacy
prefetch size descriptor, then 32-bytes is the default size for all levels of the cache hierarchy.

7-33

OPTIMIZING CACHE USAGE

7-34

CHAPTER 8
MULTICORE AND HYPER-THREADING TECHNOLOGY
This chapter describes software optimization techniques for multithreaded applications running in an
environment using either multiprocessor (MP) systems or processors with hardware-based multithreading support. Multiprocessor systems are systems with two or more sockets, each mated with a
physical processor package. Intel 64 and IA-32 processors that provide hardware multithreading support
include dual-core processors, quad-core processors and processors supporting HT Technology1.
Computational throughput in a multithreading environment can increase as more hardware resources
are added to take advantage of thread-level or task-level parallelism. Hardware resources can be added
in the form of more than one physical-processor, processor-core-per-package, and/or logical-processorper-core. Therefore, there are some aspects of multithreading optimization that apply across MP, multicore, and HT Technology. There are also some specific microarchitectural resources that may be implemented differently in different hardware multithreading configurations (for example: execution
resources are not shared across different cores but shared by two logical processors in the same core if
HT Technology is enabled). This chapter covers guidelines that apply to these situations.
This chapter covers:

•
•
•

Performance characteristics and usage models.
Programming models for multithreaded applications.
Software optimization techniques in five specific areas.

8.1

PERFORMANCE AND USAGE MODELS

The performance gains of using multiple processors, multicore processors or HT Technology are greatly
affected by the usage model and the amount of parallelism in the control flow of the workload. Two
common usage models are:

•
•

Multithreaded applications.
Multitasking using single-threaded applications.

8.1.1

Multithreading

When an application employs multithreading to exploit task-level parallelism in a workload, the control
flow of the multi-threaded software can be divided into two parts: parallel tasks and sequential tasks.
Amdahl’s law describes an application’s performance gain as it relates to the degree of parallelism in the
control flow. It is a useful guide for selecting the code modules, functions, or instruction sequences that
are most likely to realize the most gains from transforming sequential tasks and control flows into
parallel code to take advantage multithreading hardware support.
Figure 8-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law.
The bar in Figure 8-1 represents an individual task unit or the collective workload of an entire application.

1. The presence of hardware multithreading support in Intel 64 and IA-32 processors can be detected by checking the feature flag CPUID .01H:EDX[28]. A return value of in bit 28 indicates that at least one form of hardware multithreading is
present in the physical processor package. The number of logical processors present in each package can also be
obtained from CPUID. The application must check how many logical processors are enabled and made available to application at runtime by making the appropriate operating system calls. See the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A for information.

MULTICORE AND HYPER-THREADING TECHNOLOGY

In general, the speed-up of running multiple threads on an MP systems with N physical processors, over
single-threaded execution, can be expressed as:
Tsequential
P
RelativeResponse = -------------------------------- =  1 – P + ---- + O


Tparallel
N

where P is the fraction of workload that can be parallelized, and O represents the overhead of multithreading and may vary between different operating systems. In this case, performance gain is the
inverse of the relative response.

Tsequential
Single Thread

1-P

P

Tparallel
1-P

P/2
P/2

Overhead

Multi-Thread on MP

Figure 8-1. Amdahl’s Law and MP Speed-up
When optimizing application performance in a multithreaded environment, control flow parallelism is
likely to have the largest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor.
If the control flow of a multi-threaded application contains a workload in which only 50% can be executed
in parallel, the maximum performance gain using two physical processors is only 33%, compared to using
a single processor. Using four processors can deliver no more than a 60% speed-up over a single
processor. Thus, it is critical to maximize the portion of control flow that can take advantage of parallelism.
Improper implementation of thread synchronization can significantly increase the proportion of serial
control flow and further reduce the application’s performance scaling.
In addition to maximizing the parallelism of control flows, interaction between threads in the form of
thread synchronization and imbalance of task scheduling can also impact overall processor scaling significantly.
Excessive cache misses are one cause of poor performance scaling. In a multithreaded execution environment, they can occur from:

•
•
•

Aliased stack accesses by different threads in the same process.
Thread contentions resulting in cache line evictions.
False-sharing of cache lines between different processors.

Techniques that address each of these situations (and many other areas) are described in sections in this
chapter.

8.1.2

Multitasking Environment

Hardware multithreading capabilities in Intel 64 and IA-32 processors can exploit task-level parallelism
when a workload consists of several single-threaded applications and these applications are scheduled to
run concurrently under an MP-aware operating system. In this environment, hardware multithreading
capabilities can deliver higher throughput for the workload, although the relative performance of a single
8-2

MULTICORE AND HYPER-THREADING TECHNOLOGY

task (in terms of time of completion relative to the same task when in a single-threaded environment)
will vary, depending on how much shared execution resources and memory are utilized.
For development purposes, several popular operating systems (for example Microsoft Windows* XP
Professional and Home, Linux* distributions using kernel 2.4.19 or later2) include OS kernel code that
can manage the task scheduling and the balancing of shared execution resources within each physical
processor to maximize the throughput.
Because applications run independently under a multitasking environment, thread synchronization
issues are less likely to limit the scaling of throughput. This is because the control flow of the workload is
likely to be 100% parallel3 (if no inter-processor communication is taking place and if there are no
system bus constraints).
With a multitasking workload, however, bus activities and cache access patterns are likely to affect the
scaling of the throughput. Running two copies of the same application or same suite of applications in a
lock-step can expose an artifact in performance measuring methodology. This is because an access
pattern to the first level data cache can lead to excessive cache misses and produce skewed performance
results. Fix this problem by:

•
•

Including a per-instance offset at the start-up of an application.

•

Randomizing the sequence of start-up of applications when running multiple copies of the same suite.

Introducing heterogeneity in the workload by using different datasets with each instance of the application.

When two applications are employed as part of a multitasking workload, there is little synchronization
overhead between these two processes. It is also important to ensure each application has minimal
synchronization overhead within itself.
An application that uses lengthy spin loops for intra-process synchronization is less likely to benefit from
HT Technology in a multitasking workload. This is because critical resources will be consumed by the long
spin loops.

8.2

PROGRAMMING MODELS AND MULTITHREADING

Parallelism is the most important concept in designing a multithreaded application and realizing optimal
performance scaling with multiple processors. An optimized multithreaded application is characterized by
large degrees of parallelism or minimal dependencies in the following areas:

•
•
•

Workload.
Thread interaction.
Hardware utilization.

The key to maximizing workload parallelism is to identify multiple tasks that have minimal inter-dependencies within an application and to create separate threads for parallel execution of those tasks.
Concurrent execution of independent threads is the essence of deploying a multithreaded application on
a multiprocessing system. Managing the interaction between threads to minimize the cost of thread
synchronization is also critical to achieving optimal performance scaling with multiple processors.
Efficient use of hardware resources between concurrent threads requires optimization techniques in
specific areas to prevent contentions of hardware resources. Coding techniques for optimizing thread
synchronization and managing other hardware resources are discussed in subsequent sections.
Parallel programming models are discussed next.

2. This code is included in Red Hat* Linux Enterprise AS 2.1.
3. A software tool that attempts to measure the throughput of a multitasking workload is likely to introduce control flows
that are not parallel. Thread synchronization issues must be considered as an integral part of its performance measuring
methodology.
8-3

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.2.1

Parallel Programming Models

Two common programming models for transforming independent task requirements into application
threads are:

•
•

Domain decomposition.
Functional decomposition.

8.2.1.1

Domain Decomposition

Usually large compute-intensive tasks use data sets that can be divided into a number of small subsets,
each having a large degree of computational independence. Examples include:

•

Computation of a discrete cosine transformation (DCT) on two-dimensional data by dividing the twodimensional data into several subsets and creating threads to compute the transform on each subset.

•

Matrix multiplication; here, threads can be created to handle the multiplication of half of matrix with
the multiplier matrix.

Domain Decomposition is a programming model based on creating identical or similar threads to process
smaller pieces of data independently. This model can take advantage of duplicated execution resources
present in a traditional multiprocessor system. It can also take advantage of shared execution resources
between two logical processors in HT Technology. This is because a data domain thread typically
consumes only a fraction of the available on-chip execution resources.
Section 8.3.4, “Key Practices of Execution Resource Optimization,” discusses additional guidelines that
can help data domain threads use shared execution resources cooperatively and avoid the pitfalls
creating contentions of hardware resources between two threads.

8.2.2

Functional Decomposition

Applications usually process a wide variety of tasks with diverse functions and many unrelated data sets.
For example, a video codec needs several different processing functions. These include DCT, motion estimation and color conversion. Using a functional threading model, applications can program separate
threads to do motion estimation, color conversion, and other functional tasks.
Functional decomposition will achieve more flexible thread-level parallelism if it is less dependent on the
duplication of hardware resources. For example, a thread executing a sorting algorithm and a thread
executing a matrix multiplication routine are not likely to require the same execution unit at the same
time. A design recognizing this could advantage of traditional multiprocessor systems as well as multiprocessor systems using processors supporting HT Technology.

8.2.3

Specialized Programming Models

Intel Core Duo processor and processors based on Intel Core microarchitecture offer a second-level
cache shared by two processor cores in the same physical package. This provides opportunities for two
application threads to access some application data while minimizing the overhead of bus traffic.
Multi-threaded applications may need to employ specialized programming models to take advantage of
this type of hardware feature. One such scenario is referred to as producer-consumer. In this scenario,
one thread writes data into some destination (hopefully in the second-level cache) and another thread
executing on the other core in the same physical package subsequently reads data produced by the first
thread.
The basic approach for implementing a producer-consumer model is to create two threads; one thread is
the producer and the other is the consumer. Typically, the producer and consumer take turns to work on
a buffer and inform each other when they are ready to exchange buffers. In a producer-consumer model,
there is some thread synchronization overhead when buffers are exchanged between the producer and
consumer. To achieve optimal scaling with the number of cores, the synchronization overhead must be
kept low. This can be done by ensuring the producer and consumer threads have comparable time
constants for completing each incremental task prior to exchanging buffers.
8-4

MULTICORE AND HYPER-THREADING TECHNOLOGY

Example 8-1 illustrates the coding structure of single-threaded execution of a sequence of task units,
where each task unit (either the producer or consumer) executes serially (shown in Figure 8-2). In the
equivalent scenario under multi-threaded execution, each producer-consumer pair is wrapped as a
thread function and two threads can be scheduled on available processor resources simultaneously.
Example 8-1. Serial Execution of Producer and Consumer Work Items
for (i = 0; i < number_of_iterations; i++) {
producer (i, buff); // pass buffer index and buffer address
consumer (i, buff);
}(

Main
Thread

P(1)

C(1)

P(1)

C(1)

P(1)

Figure 8-2. Single-threaded Execution of Producer-consumer Threading Model

8.2.3.1

Producer-Consumer Threading Models

Figure 8-3 illustrates the basic scheme of interaction between a pair of producer and consumer threads.
The horizontal direction represents time. Each block represents a task unit, processing the buffer
assigned to a thread.
The gap between each task represents synchronization overhead. The decimal number in the parenthesis
represents a buffer index. On an Intel Core Duo processor, the producer thread can store data in the
second-level cache to allow the consumer thread to continue work requiring minimal bus traffic.

Main
Thread

P: producer
C: consumer

P(1)

P(2)

P(1)

P(2)

P(1)

C(1)

C(2)

C(1)

C(2)

Figure 8-3. Execution of Producer-consumer Threading Model
on a Multicore Processor
The basic structure to implement the producer and consumer thread functions with synchronization to
communicate buffer index is shown in Example 8-2.

8-5

MULTICORE AND HYPER-THREADING TECHNOLOGY

Example 8-2. Basic Structure of Implementing Producer Consumer Threads
(a) Basic structure of a producer thread function
void producer_thread()
{
int iter_num = workamount - 1; // make local copy
int mode1 = 1; // track usage of two buffers via 0 and 1
produce(buffs[0],count); // placeholder function
while (iter_num--) {
Signal(&signal1,1); // tell the other thread to commence
produce(buffs[mode1],count); // placeholder function
WaitForSignal(&end1);
mode1 = 1 - mode1; // switch to the other buffer
}
}
b) Basic structure of a consumer thread
void consumer_thread()
{
int mode2 = 0; // first iteration start with buffer 0, than alternate
int iter_num = workamount - 1;
while (iter_num--) {
WaitForSignal(&signal1);
consume(buffs[mode2],count); // placeholder function
Signal(&end1,1);
mode2 = 1 - mode2;
}
consume(buffs[mode2],count);
}
It is possible to structure the producer-consumer model in an interlaced manner such that it can minimize bus traffic and be effective on multicore processors without shared second-level cache.
In this interlaced variation of the producer-consumer model, each scheduling quanta of an application
thread comprises of a producer task and a consumer task. Two identical threads are created to execute
in parallel. During each scheduling quanta of a thread, the producer task starts first and the consumer
task follows after the completion of the producer task; both tasks work on the same buffer. As each task
completes, one thread signals to the other thread notifying its corresponding task to use its designated
buffer. Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data
generated by the producer reside in either the first or second level cache of the same core, the consumer
can access them without incurring bus traffic. The scheduling of the interlaced producer-consumer model
is shown in Figure 8-4.

Thread 0

Thread 1

P(1)

C(1)

P(1)

C(1)

P(1)

P(2)

C(2)

P(2)

C(2)

Figure 8-4. Interlaced Variation of the Producer Consumer Model

8-6

MULTICORE AND HYPER-THREADING TECHNOLOGY

Example 8-3 shows the basic structure of a thread function that can be used in this interlaced producerconsumer model.

Example 8-3. Thread Function for an Interlaced Producer Consumer Model
// master thread starts first iteration, other thread must wait
// one iteration
void producer_consumer_thread(int master)
{
int mode = 1 - master; // track which thread and its designated
// buffer index
unsigned int iter_num = workamount >> 1;
unsigned int i=0;
iter_num += master & workamount & 1;
if (master) // master thread starts the first iteration
{
produce(buffs[mode],count);
Signal(sigp[1-mode1],1); // notify producer task in follower
// thread that it can proceed
consume(buffs[mode],count);
Signal(sigc[1-mode],1);
i = 1;
}

for (; i < iter_num; i++)
{
WaitForSignal(sigp[mode]);
produce(buffs[mode],count); // notify the producer task in
// other thread
Signal(sigp[1-mode],1);
WaitForSignal(sigc[mode]);
consume(buffs[mode],count);
Signal(sigc[1-mode],1);
}
}

8.2.4

Tools for Creating Multithreaded Applications

Programming directly to a multithreading application programming interface (API) is not the only method
for creating multithreaded applications. New tools (such as the Intel compiler) have become available
with capabilities that make the challenge of creating multithreaded application easier.
Features available in the latest Intel compilers are:

•
•

Generating multithreaded code using OpenMP* directives4.
Generating multithreaded code automatically from unmodified high-level code5.

4. Intel Compiler 5.0 and later supports OpenMP directives. Visit http://developer.intel.com/software/products for
details.
8-7

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.2.4.1

Programming with OpenMP Directives

OpenMP provides a standardized, non-proprietary, portable set of Fortran and C++ compiler directives
supporting shared memory parallelism in applications. OpenMP supports directive-based processing.
This uses special preprocessors or modified compilers to interpret parallelism expressed in Fortran
comments or C/C++ pragmas. Benefits of directive-based processing include:

•
•

The original source can be compiled unmodified.

•

Incremental code changes help programmers maintain serial consistency. When the code is run on
one processor, it gives the same result as the unmodified source code.

•
•

Offering directives to fine tune thread scheduling imbalance.

It is possible to make incremental code changes. This preserves algorithms in the original code and
enables rapid debugging.

Intel’s implementation of OpenMP runtime can add minimal threading overhead relative to handcoded multithreading.

8.2.4.2

Automatic Parallelization of Code

While OpenMP directives allow programmers to quickly transform serial applications into parallel applications, programmers must identify specific portions of the application code that contain parallelism and
add compiler directives. Intel Compiler 6.0 supports a new (-QPARALLEL) option, which can identify loop
structures that contain parallelism. During program compilation, the compiler automatically attempts to
decompose the parallelism into threads for parallel processing. No other intervention or programmer is
needed.

8.2.4.3

Supporting Development Tools

See Appendix A, “Application Performance Tools” for information on the various tools that Intel provides
for software development.

8.3

OPTIMIZATION GUIDELINES

This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are
listed (in order of importance):

•
•
•
•
•

Thread synchronization.
Bus utilization.
Memory optimization.
Front end optimization.
Execution resource optimization.

Practices associated with each area are listed in this section. Guidelines for each area are discussed in
greater depth in sections that follow.
Most of the coding recommendations improve performance scaling with processor cores; and scaling
due to HT Technology. Techniques that apply to only one environment are noted.

8.3.1

Key Practices of Thread Synchronization

Key practices for minimizing the cost of thread synchronization are summarized below:

•

Insert the PAUSE instruction in fast spin loops and keep the number of loop repetitions to a minimum
to improve overall system performance.

5. Intel Compiler 6.0 supports auto-parallelization.
8-8

MULTICORE AND HYPER-THREADING TECHNOLOGY

•

Replace a spin-lock that may be acquired by multiple threads with pipelined locks such that no more
than two threads have write accesses to one lock. If only one thread needs to write to a variable
shared by two threads, there is no need to acquire a lock.

•
•
•

Use a thread-blocking API in a long idle loop to free up the processor.
Prevent “false-sharing” of per-thread-data between two threads.
Place each synchronization variable alone, separated by 128 bytes or in a separate cache line.

See Section 8.4, “Thread Synchronization,” for details.

8.3.2

Key Practices of System Bus Optimization

Managing bus traffic can significantly impact the overall performance of multithreaded software and MP
systems. Key practices of system bus optimization for achieving high data throughput and quick
response are:

•
•

Improve data and code locality to conserve bus command bandwidth.

•

Consider using overlapping multiple back-to-back memory reads to improve effective cache miss
latencies.

•

Use full write transactions to achieve higher data throughput.

Avoid excessive use of software prefetch instructions and allow the automatic hardware prefetcher to
work. Excessive use of software prefetches can significantly and unnecessarily increase bus
utilization if used inappropriately.

See Section 8.5, “System Bus Optimization,” for details.

8.3.3

Key Practices of Memory Optimization

Key practices for optimizing memory operations are summarized below:

•

Use cache blocking to improve locality of data access. Target one quarter to one half of cache size
when targeting processors supporting HT Technology.

•

Minimize the sharing of data between threads that execute on different physical processors sharing a
common bus.

•
•

Minimize data access patterns that are offset by multiples of 64-KBytes in each thread.

•

Add a per-instance stack offset when two instances of the same application are executing in lock
steps to avoid memory accesses that are offset by multiples of 64 KByte or 1 MByte when targeting
processors supporting HT Technology.

Adjust the private stack of each thread in an application so the spacing between these stacks is not
offset by multiples of 64 KBytes or 1 MByte (prevents unnecessary cache line evictions) when
targeting processors supporting HT Technology.

See Section 8.6, “Memory Optimization,” for details.

8.3.4

Key Practices of Execution Resource Optimization

Each physical processor has dedicated execution resources. Logical processors in physical processors
supporting HT Technology share specific on-chip execution resources. Key practices for execution
resource optimization include:

•
•

Optimize each thread to achieve optimal frequency scaling first.

•

Use on-chip execution resources cooperatively if two threads are sharing the execution resources in
the same physical processor package.

Optimize multithreaded applications to achieve optimal scaling with respect to the number of
physical processors.

8-9

MULTICORE AND HYPER-THREADING TECHNOLOGY

•

For each processor supporting HT Technology, consider adding functionally uncorrelated threads to
increase the hardware resource utilization of each physical processor package.

See Section 8.8, “Affinities and Managing Shared Platform Resources,” for details.

8.3.5

Generality and Performance Impact

The next five sections cover the optimization techniques in detail. Recommendations discussed in each
section are ranked by importance in terms of estimated local impact and generality.
Rankings are subjective and approximate. They can vary depending on coding style, application and
threading domain. The purpose of including high, medium and low impact ranking with each recommendation is to provide a relative indicator as to the degree of performance gain that can be expected when
a recommendation is implemented.
It is not possible to predict the likelihood of a code instance across many applications, so an impact
ranking cannot be directly correlated to application-level performance gain. The ranking on generality is
also subjective and approximate.
Coding recommendations that do not impact all three scaling factors are typically categorized as medium
or lower.

8.4

THREAD SYNCHRONIZATION

Applications with multiple threads use synchronization techniques in order to ensure correct operation.
However, thread synchronization that are improperly implemented can significantly reduce performance.
The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s requirements for synchronization. Intel Thread Profiler can be used to profile the execution timeline
of each thread and detect situations where performance is impacted by frequent occurrences of synchronization overhead.
Several coding techniques and operating system (OS) calls are frequently used for thread synchronization. These include spin-wait loops, spin-locks, critical sections, to name a few. Choosing the optimal OS
call for the circumstance and implementing synchronization code with parallelism in mind are critical in
minimizing the cost of handling thread synchronization.
SSE3 provides two instructions (MONITOR/MWAIT) to help multithreaded software improve synchronization between multiple agents. In the first implementation of MONITOR and MWAIT, these instructions are
available to operating system so that operating system can optimize thread synchronization in different
areas. For example, an operating system can use MONITOR and MWAIT in its system idle loop (known as
C0 loop) to reduce power consumption. An operating system can also use MONITOR and MWAIT to implement its C1 loop to improve the responsiveness of the C1 loop. See Chapter 8 in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3A.

8.4.1

Choice of Synchronization Primitives

Thread synchronization often involves modifying some shared data while protecting the operation using
synchronization primitives. There are many primitives to choose from. Guidelines that are useful when
selecting synchronization primitives are:

•

Favor compiler intrinsics or an OS provided interlocked API for atomic updates of simple data
operation, such as increment and compare/exchange. This will be more efficient than other more
complicated synchronization primitives with higher overhead.
For more information on using different synchronization primitives, see the white paper Developing
Multi-threaded Applications: A Platform Consistent Approach. See
http://www3.intel.com/cd/ids/developer/asmo-na/eng/53797.htm.

•

When choosing between different primitives to implement a synchronization construct, using Intel
Thread Checker and Thread Profiler can be very useful in dealing with multithreading functional

8-10

MULTICORE AND HYPER-THREADING TECHNOLOGY

correctness issue and performance impact under multi-threaded execution. Additional information
on the capabilities of Intel Thread Checker and Thread Profiler are described in Appendix A.
Table 8-1 is useful for comparing the properties of three categories of synchronization objects available
to multi-threaded applications.
Table 8-1. Properties of Synchronization Objects
Operating System
Synchronization Objects

Light Weight User
Synchronization

Synchronization Object
based on MONITOR/MWAIT

Cycles to acquire and
release (if there is a
contention)

Thousands or Tens of thousands
cycles

Hundreds of cycles

Hundreds of cycles

Power consumption

Saves power by halting the core or
logical processor if idle

Some power saving if using
PAUSE

Saves more power than
PAUSE

Scheduling and
context switching

Returns to the OS scheduler if
contention exists (can be tuned
with earlier spin loop count)

Does not return to OS
scheduler voluntarily

Does not return to OS
scheduler voluntarily

Ring level

Ring 0

Ring 3

Ring 0

Miscellaneous

Some objects provide intra-process
synchronization and some are for
inter-process communication

Must lock accesses to
synchronization variable if
several threads may write
to it simultaneously.

Same as light weight.

Characteristics

Otherwise can write
without locks.
Recommended use
conditions

8.4.2

• Number of active threads is
greater than number of cores
• Waiting thousands of cycles for a
signal
• Synchronization among processes

• Number of active threads
is less than or equal to
number of cores
• Infrequent contention
• Need inter process
synchronization

Can be used only on
systems supporting
MONITOR/MWAIT

• Same as light weight
objects
• MONITOR/MWAIT available

Synchronization for Short Periods

The frequency and duration that a thread needs to synchronize with other threads depends application
characteristics. When a synchronization loop needs very fast response, applications may use a spin-wait
loop.
A spin-wait loop is typically used when one thread needs to wait a short amount of time for another
thread to reach a point of synchronization. A spin-wait loop consists of a loop that compares a synchronization variable with some pre-defined value. See Example 8-4(a).
On a modern microprocessor with a superscalar speculative execution engine, a loop like this results in
the issue of multiple simultaneous read requests from the spinning thread. These requests usually
execute out-of-order with each read request being allocated a buffer resource. On detection of a write by
a worker thread to a load that is in progress, the processor must guarantee no violations of memory
order occur. The necessity of maintaining the order of outstanding memory operations inevitably costs
the processor a severe penalty that impacts all threads.
This penalty occurs on the Pentium M processor, the Intel Core Solo and Intel Core Duo processors.
However, the penalty on these processors is small compared with penalties suffered on the Pentium 4
and Intel Xeon processors. There the performance penalty for exiting the loop is about 25 times more
severe.
On a processor supporting HT Technology, spin-wait loops can consume a significant portion of the
execution bandwidth of the processor. One logical processor executing a spin-wait loop can severely
impact the performance of the other logical processor.

8-11

MULTICORE AND HYPER-THREADING TECHNOLOGY

Example 8-4. Spin-wait Loop and PAUSE Instructions
(a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execution
resources without contributing computational work.
do {
// This loop can run faster than the speed of memory access,
// other worker threads cannot finish modifying sync_var until
// outstanding loads from the spinning loops are resolved.
} while( sync_var != constant_value);
(b) Inserting the PAUSE instruction in a fast spin-wait loop prevents performance-penalty to the spinning thread and the
worker thread
do {
_asm pause
// Ensure this loop is de-pipelined, i.e. preventing more than one
// load request to sync_var to be outstanding,
// avoiding performance penalty when the worker thread updates
// sync_var and the spinning thread exiting the loop.
}
while( sync_var != constant_value);
(c) A spin-wait loop using a “test, test-and-set” technique to determine the availability of the synchronization variable.
This technique is recommended when writing spin-wait loops to run on Intel 64 and IA-32 architecture processors.
Spin_Lock:
CMP lockvar, 0 ;
JE Get_lock
PAUSE;
JMP Spin_Lock;
Get_Lock:
MOV EAX, 1;
XCHG EAX, lockvar;
CMP EAX, 0;
JNE Spin_Lock;
Critical_Section:

MOV lockvar, 0;

// Check if lock is free.
// Short delay.

// Try to get lock.
// Test if successful.

// Release lock.

User/Source Coding Rule 18. (M impact, H generality) Insert the PAUSE instruction in fast spin
loops and keep the number of loop repetitions to a minimum to improve overall system performance.
On processors that use the Intel NetBurst microarchitecture core, the penalty of exiting from a spin-wait
loop can be avoided by inserting a PAUSE instruction in the loop. In spite of the name, the PAUSE
instruction improves performance by introducing a slight delay in the loop and effectively causing the
memory read requests to be issued at a rate that allows immediate detection of any store to the synchronization variable. This prevents the occurrence of a long delay due to memory order violation.
One example of inserting the PAUSE instruction in a simplified spin-wait loop is shown in Example 8-4(b).
The PAUSE instruction is compatible with all Intel 64 and IA-32 processors. On IA-32 processors prior to
Intel NetBurst microarchitecture, the PAUSE instruction is essentially a NOP instruction. Additional
examples of optimizing spin-wait loops using the PAUSE instruction are available in Application note AP949, “Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor.” See
http://www3.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/knowledgebase/19083.htm.
Inserting the PAUSE instruction has the added benefit of significantly reducing the power consumed
during the spin-wait because fewer system resources are used.

8-12

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.4.3

Optimization with Spin-Locks

Spin-locks are typically used when several threads needs to modify a synchronization variable and the
synchronization variable must be protected by a lock to prevent un-intentional overwrites. When the lock
is released, however, several threads may compete to acquire it at once. Such thread contention significantly reduces performance scaling with respect to frequency, number of discrete processors, and HT
Technology.
To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing
to acquire the same lock. Apply a software pipelining technique to handle data that must be shared
between multiple threads.
Instead of allowing multiple threads to compete for a given lock, no more than two threads should have
write access to a given lock. If an application must use spin-locks, include the PAUSE instruction in the
wait loop. Example 8-4(c) shows an example of the “test, test-and-set” technique for determining the
availability of the lock in a spin-wait loop.
User/Source Coding Rule 19. (M impact, L generality) Replace a spin lock that may be acquired
by multiple threads with pipelined locks such that no more than two threads have write accesses to one
lock. If only one thread needs to write to a variable shared by two threads, there is no need to use a
lock.

8.4.4

Synchronization for Longer Periods

When using a spin-wait loop not expected to be released quickly, an application should follow these
guidelines:

•
•

Keep the duration of the spin-wait loop to a minimum number of repetitions.
Applications should use an OS service to block the waiting thread; this can release the processor so
that other runnable threads can make use of the processor or available execution resources.

On processors supporting HT Technology, operating systems should use the HLT instruction if one logical
processor is active and the other is not. HLT will allow an idle logical processor to transition to a halted
state; this allows the active logical processor to use all the hardware resources in the physical package.
An operating system that does not use this technique must still execute instructions on the idle logical
processor that repeatedly check for work. This “idle loop” consumes execution resources that could
otherwise be used to make progress on the other active logical processor.
If an application thread must remain idle for a long time, the application should use a thread blocking API
or other method to release the idle processor. The techniques discussed here apply to traditional MP
system, but they have an even higher impact on processors that support HT Technology.
Typically, an operating system provides timing services, for example Sleep(dwMilliseconds)6; such variables can be used to prevent frequent checking of a synchronization variable.
Another technique to synchronize between worker threads and a control loop is to use a thread-blocking
API provided by the OS. Using a thread-blocking API allows the control thread to use less processor
cycles for spinning and waiting. This gives the OS more time quanta to schedule the worker threads on
available processors. Furthermore, using a thread-blocking API also benefits from the system idle loop
optimization that OS implements using the HLT instruction.
User/Source Coding Rule 20. (H impact, M generality) Use a thread-blocking API in a long idle
loop to free up the processor.
Using a spin-wait loop in a traditional MP system may be less of an issue when the number of runnable
threads is less than the number of processors in the system. If the number of threads in an application is
expected to be greater than the number of processors (either one processor or multiple processors), use
a thread-blocking API to free up processor resources. A multithreaded application adopting one control
thread to synchronize multiple worker threads may consider limiting worker threads to the number of
processors in a system and use thread-blocking APIs in the control thread.
6. The Sleep() API is not thread-blocking, because it does not guarantee the processor will be released. Example 8-5(a)
shows an example of using Sleep(0), which does not always realize the processor to another thread.
8-13

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.4.4.1

Avoid Coding Pitfalls in Thread Synchronization

Synchronization between multiple threads must be designed and implemented with care to achieve good
performance scaling with respect to the number of discrete processors and the number of logical
processor per physical processor. No single technique is a universal solution for every synchronization
situation.
The pseudo-code example in Example 8-5(a) illustrates a polling loop implementation of a control
thread. If there is only one runnable worker thread, an attempt to call a timing service API, such as
Sleep(0), may be ineffective in minimizing the cost of thread synchronization. Because the control thread
still behaves like a fast spinning loop, the only runnable worker thread must share execution resources
with the spin-wait loop if both are running on the same physical processor that supports HT Technology.
If there are more than one runnable worker threads, then calling a thread blocking API, such as Sleep(0),
could still release the processor running the spin-wait loop, allowing the processor to be used by another
worker thread instead of the spinning loop.
A control thread waiting for the completion of worker threads can usually implement thread synchronization using a thread-blocking API or a timing service, if the worker threads require significant time to
complete. Example 8-5(b) shows an example that reduces the overhead of the control thread in its
thread synchronization.

Example 8-5. Coding Pitfall using Spin Wait Loop
(a) A spin-wait loop attempts to release the processor incorrectly. It experiences a performance penalty if the only
worker thread and the control thread runs on the same physical processor package.
// Only one worker thread is running,
// the control loop waits for the worker thread to complete.
ResumeWorkThread(thread_handle);
While (!task_not_done ) {
Sleep(0) // Returns immediately back to spin loop.
…
}
(b) A polling loop frees up the processor correctly.
// Let a worker thread run and wait for completion.
ResumeWorkThread(thread_handle);
While (!task_not_done ) {
Sleep(FIVE_MILISEC)
// This processor is released for some duration, the processor
// can be used by other threads.
…
}
In general, OS function calls should be used with care when synchronizing threads. When using OSsupported thread synchronization objects (critical section, mutex, or semaphore), preference should be
given to the OS service that has the least synchronization overhead, such as a critical section.

8.4.5

Prevent Sharing of Modified Data and False-Sharing

Depending on the cache topology relative to processor/core topology and the specific underlying microarchitecture, sharing of modified data can incur some degree of performance penalty when a software
thread running on one core tries to read or write data that is currently present in modified state in the
local cache of another core. This will cause eviction of the modified cache line back into memory and
reading it into the first-level cache of the other core. The latency of such cache line transfer is much
higher than using data in the immediate first level cache or second level cache.
8-14

MULTICORE AND HYPER-THREADING TECHNOLOGY

False sharing applies to data used by one thread that happens to reside on the same cache line as
different data used by another thread. These situations can also incur a performance delay depending on
the topology of the logical processors/cores in the platform.
False sharing can experience a performance penalty when the threads are running on logical processors
reside on different physical processors or processor cores. For processors that support HT Technology,
false-sharing incurs a performance penalty when two threads run on different cores, different physical
processors, or on two logical processors in the physical processor package. In the first two cases, the
performance penalty is due to cache evictions to maintain cache coherency. In the latter case, performance penalty is due to memory order machine clear conditions.
A generic approach for multi-threaded software to prevent incurring false-sharing penalty is to allocate
separate critical data or locks with alignment granularity according to a “false-sharing threshold” size.
The following steps will allow software to determine the “false-sharing threshold” across Intel processors:
1. If the processor supports CLFLUSH instruction, i.e. CPUID.01H:EDX.CLFLUSH[bit 19] =1:
Use the CLFLUSH line size, i.e. the integer value of CPUID.01H:EBX[15:8], as the “false-sharing
threshold”.
2. If CLFLUSH line size is not available, use CPUID leaf 4 as described below:
Determine the “false-sharing threshold” by evaluating the largest system coherency line size among
valid cache types that are reported via the sub-leaves of CPUID leaf 4. For each sub-leaf n, its
associated system coherency line size is (CPUID.(EAX=4, ECX=n):EBX[11:0] + 1).
3. If neither CLFLUSH line size is available, nor CPUID leaf 4 is available, then software may choose the
“false-sharing threshold” from one of the following:
a. Query the descriptor tables of CPUID leaf 2 and choose from available descriptor entries.
b. A Family/Model-specific mechanism available in the platform or a Family/Model-specific known
value.
c.

Default to a safe value 64 bytes.

User/Source Coding Rule 21. (H impact, M generality) Beware of false sharing within a cache line
or within a sector. Allocate critical data or locks separately using alignment granularity not smaller than
the “false-sharing threshold”.
When a common block of parameters is passed from a parent thread to several worker threads, it is
desirable for each work thread to create a private copy (each copy aligned to multiples of the “falsesharing threshold”) of frequently accessed data in the parameter block.

8.4.6

Placement of Shared Synchronization Variable

On processors based on Intel NetBurst microarchitecture, bus reads typically fetch 128 bytes into a
cache, the optimal spacing to minimize eviction of cached data is 128 bytes. To prevent false-sharing,
synchronization variables and system objects (such as a critical section) should be allocated to reside
alone in a 128-byte region and aligned to a 128-byte boundary.
Example 8-6 shows a way to minimize the bus traffic required to maintain cache coherency in MP
systems. This technique is also applicable to MP systems using processors with or without HT Technology.
Example 8-6. Placement of Synchronization and Regular Variables
int
int
int
int

regVar;
padding[32];
SynVar[32*NUM_SYNC_VARS];
AnotherVar;

8-15

MULTICORE AND HYPER-THREADING TECHNOLOGY

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid falsesharing. Software must not allow a synchronization variable to span across page boundary.
User/Source Coding Rule 22. (M impact, ML generality) Place each synchronization variable
alone, separated by 128 bytes or in a separate cache line.
User/Source Coding Rule 23. (H impact, L generality) Do not place any spin lock variable to span
a cache line boundary.
At the code level, false sharing is a special concern in the following cases:

•

Global data variables and static data variables that are placed in the same cache line and are written
by different threads.

•

Objects allocated dynamically by different threads may share cache lines. Make sure that the
variables used locally by one thread are allocated in a manner to prevent sharing the cache line with
other threads.

Another technique to enforce alignment of synchronization variables and to avoid a cacheline being
shared is to use compiler directives when declaring data structures. See Example 8-7.
Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line
__declspec(align(64)) unsigned __int64 sum;
struct sync_struct {…};
__declspec(align(64)) struct sync_struct sync_var;

Other techniques that prevent false-sharing include:

•

Organize variables of different types in data structures (because the layout that compilers give to
data variables might be different than their placement in the source code).

•

When each thread needs to use its own copy of a set of variables, declare the variables with:
— Directive threadprivate, when using OpenMP.
— Modifier __declspec (thread), when using Microsoft compiler.

•

In managed environments that provide automatic object allocation, the object allocators and
garbage collectors are responsible for layout of the objects in memory so that false sharing through
two objects does not happen.

•

Provide classes such that only one thread writes to each object field and close object fields, in order
to avoid false sharing.

One should not equate the recommendations discussed in this section as favoring a sparsely populated
data layout. The data-layout recommendations should be adopted when necessary and avoid unnecessary bloat in the size of the work set.

8.4.7

Pause Latency in Skylake Microarchitecture

The PAUSE instruction is typically used with software threads executing on two logical processors located
in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between
tens and a few hundreds of cycles, so performance-wise it is more beneficial to wait while occupying the
CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is
preferable to yield to the operating system by calling one of the OS synchronization API functions, such
as WaitForSingleObject on Windows* OS.
The PAUSE instruction is intended to:

•

Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin
loop) with competitively shared hardware resources. The competitively-shared microarchitectural
resources that the sibling logical processor can utilize in the Skylake microarchitecture are:
— More front end slots in the Decode ICache, LSD and IDQ.

8-16

MULTICORE AND HYPER-THREADING TECHNOLOGY

— More execution slots in the RS.

•

Save power consumed by the processor core compared to executing equivalent spin loop instruction
sequence in the following configurations:
— One logical processor is inactive (e.g. entering a C-state).
— Both logical processors in the same core execute the PAUSE instruction.
— HT is disabled (e.g. using BIOS options).

The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on
Skylake microarchitecture it has been extended to as many as 140 cycles.
The increased latency (allowing more effective utilization of competitively-shared microarchitectural
resources to the logical processor ready to make forward progress) has a small positive performance
impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded
applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions.
There's also a small power benefit in 2-core and 4-core systems.
As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will
suffer some performance loss.

8.5

SYSTEM BUS OPTIMIZATION

The system bus services requests from bus agents (e.g. logical processors) to fetch data or code from
the memory sub-system. The performance impact due data traffic fetched from memory depends on the
characteristics of the workload, and the degree of software optimization on memory access, locality
enhancements implemented in the software code. A number of techniques to characterize memory traffic
of a workload is discussed in Appendix A. Optimization guidelines on locality enhancement is also
discussed in Section 3.6.11, “Locality Enhancement,” and Section 7.5.11, “Hardware Prefetching and
Cache Blocking Techniques.”
The techniques described in Chapter 3 and Chapter 7 benefit application performance in a platform
where the bus system is servicing a single-threaded environment. In a multi-threaded environment, the
bus system typically services many more logical processors, each of which can issue bus requests independently. Thus, techniques on locality enhancements, conserving bus bandwidth, reducing large-stridecache-miss-delay can have strong impact on processor scaling performance.

8.5.1

Conserve Bus Bandwidth

In a multithreading environment, bus bandwidth may be shared by memory traffic originated from
multiple bus agents (These agents can be several logical processors and/or several processor cores).
Preserving the bus bandwidth can improve processor scaling performance. Also, effective bus bandwidth
typically will decrease if there are significant large-stride cache-misses. Reducing the amount of largestride cache misses (or reducing DTLB misses) will alleviate the problem of bandwidth reduction due to
large-stride cache misses.
One way for conserving available bus command bandwidth is to improve the locality of code and data.
Improving the locality of data reduces the number of cache line evictions and requests to fetch data. This
technique also reduces the number of instruction fetches from system memory.
User/Source Coding Rule 24. (M impact, H generality) Improve data and code locality to
conserve bus command bandwidth.
Using a compiler that supports profiler-guided optimization can improve code locality by keeping
frequently used code paths in the cache. This reduces instruction fetches. Loop blocking can also improve
the data locality. Other locality enhancement techniques can also be applied in a multithreading environment to conserve bus bandwidth (see Section 7.5, “Memory Optimization Using Prefetch”).
Because the system bus is shared between many bus agents (logical processors or processor cores),
software tuning should recognize symptoms of the bus approaching saturation. One useful technique is
to examine the queue depth of bus read traffic. When the bus queue depth is high, locality enhancement

8-17

MULTICORE AND HYPER-THREADING TECHNOLOGY

to improve cache utilization will benefit performance more than other techniques, such as inserting more
software prefetches or masking memory latency with overlapping bus reads. An approximate working
guideline for software to operate below bus saturation is to check if bus read queue depth is significantly
below 5.
Some MP and workstation platforms may have a chipset that provides two system buses, with each bus
servicing one or more physical processors. The guidelines for conserving bus bandwidth described above
also applies to each bus domain.

8.5.2

Understand the Bus and Cache Interactions

Be careful when parallelizing code sections with data sets that results in the total working set exceeding
the second-level cache and /or consumed bandwidth exceeding the capacity of the bus. On an Intel Core
Duo processor, if only one thread is using the second-level cache and / or bus, then it is expected to get
the maximum benefit of the cache and bus systems because the other core does not interfere with the
progress of the first thread. However, if two threads use the second-level cache concurrently, there may
be performance degradation if one of the following conditions is true:

•
•
•

Their combined working set is greater than the second-level cache size.
Their combined bus usage is greater than the capacity of the bus.
They both have extensive access to the same set in the second-level cache, and at least one of the
threads writes to this cache line.

To avoid these pitfalls, multithreading software should try to investigate parallelism schemes in which
only one of the threads access the second-level cache at a time, or where the second-level cache and the
bus usage does not exceed their limits.

8.5.3

Avoid Excessive Software Prefetches

Pentium 4 and Intel Xeon Processors have an automatic hardware prefetcher. It can bring data and
instructions into the unified second-level cache based on prior reference patterns. In most situations, the
hardware prefetcher is likely to reduce system memory latency without explicit intervention from software prefetches. It is also preferable to adjust data access patterns in the code to take advantage of the
characteristics of the automatic hardware prefetcher to improve locality or mask memory latency.
Processors based on Intel Core microarchitecture also provides several advanced hardware prefetching
mechanisms. Data access patterns that can take advantage of earlier generations of hardware prefetch
mechanism generally can take advantage of more recent hardware prefetch implementations.
Using software prefetch instructions excessively or indiscriminately will inevitably cause performance
penalties. This is because excessively or indiscriminately using software prefetch instructions wastes the
command and data bandwidth of the system bus.
Using software prefetches delays the hardware prefetcher from starting to fetch data needed by the
processor core. It also consumes critical execution resources and can result in stalled execution. In some
cases, it may be fruitful to evaluate the reduction or removal of software prefetches to migrate towards
more effective use of hardware prefetch mechanisms. The guidelines for using software prefetch instructions are described in Chapter 3. The techniques for using automatic hardware prefetcher is discussed in
Chapter 7.
User/Source Coding Rule 25. (M impact, L generality) Avoid excessive use of software prefetch
instructions and allow automatic hardware prefetcher to work. Excessive use of software prefetches can
significantly and unnecessarily increase bus utilization if used inappropriately.

8.5.4

Improve Effective Latency of Cache Misses

System memory access latency due to cache misses is affected by bus traffic. This is because bus read
requests must be arbitrated along with other requests for bus transactions. Reducing the number of
outstanding bus transactions helps improve effective memory access latency.

8-18

MULTICORE AND HYPER-THREADING TECHNOLOGY

One technique to improve effective latency of memory read transactions is to use multiple overlapping
bus reads to reduce the latency of sparse reads. In situations where there is little locality of data or when
memory reads need to be arbitrated with other bus transactions, the effective latency of scattered
memory reads can be improved by issuing multiple memory reads back-to-back to overlap multiple
outstanding memory read transactions. The average latency of back-to-back bus reads is likely to be
lower than the average latency of scattered reads interspersed with other bus transactions. This is
because only the first memory read needs to wait for the full delay of a cache miss.
User/Source Coding Rule 26. (M impact, M generality) Consider using overlapping multiple backto-back memory reads to improve effective cache miss latencies.
Another technique to reduce effective memory latency is possible if one can adjust the data access
pattern such that the access strides causing successive cache misses in the last-level cache is predominantly less than the trigger threshold distance of the automatic hardware prefetcher. See Section 7.5.3,
“Example of Effective Latency Reduction with Hardware Prefetch.”
User/Source Coding Rule 27. (M impact, M generality) Consider adjusting the sequencing of
memory references such that the distribution of distances of successive cache misses of the last level
cache peaks towards 64 bytes.

8.5.5

Use Full Write Transactions to Achieve Higher Data Rate

Write transactions across the bus can result in write to physical memory either using the full line size of
64 bytes or less than the full line size. The latter is referred to as a partial write. Typically, writes to writeback (WB) memory addresses are full-size and writes to write-combine (WC) or uncacheable (UC) type
memory addresses result in partial writes. Both cached WB store operations and WC store operations
utilize a set of six WC buffers (64 bytes wide) to manage the traffic of write transactions. When
competing traffic closes a WC buffer before all writes to the buffer are finished, this results in a series of
8-byte partial bus transactions rather than a single 64-byte write transaction.
User/Source Coding Rule 28. (M impact, M generality) Use full write transactions to achieve
higher data throughput.
Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software
write-combining technique to separate WC store operations from competing with WB store traffic. To
implement software write-combining, uncacheable writes to memory with the WC attribute are written to
a small, temporary buffer (WB type) that fits in the first level data cache. When the temporary buffer is
full, the application copies the content of the temporary buffer to the final WC destination.
When partial-writes are transacted on the bus, the effective data rate to system memory is reduced to
only 1/8 of the system bus bandwidth.

8.6

MEMORY OPTIMIZATION

Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches
needs to address the following:

•
•
•
•

Cache blocking.
Shared memory optimization.
Eliminating 64-KByte aliased data accesses.
Preventing excessive evictions in first-level cache.

8.6.1

Cache Blocking Technique

Loop blocking is useful for reducing cache misses and improving memory access performance. The selection of a suitable block size is critical when applying the loop blocking technique. Loop blocking is applicable to single-threaded applications as well as to multithreaded applications running on processors with
or without HT Technology. The technique transforms the memory access pattern into blocks that efficiently fit in the target cache size.
8-19

MULTICORE AND HYPER-THREADING TECHNOLOGY

When targeting Intel processors supporting HT Technology, the loop blocking technique for a unified
cache can select a block size that is no more than one half of the target cache size, if there are two logical
processors sharing that cache. The upper limit of the block size for loop blocking should be determined
by dividing the target cache size by the number of logical processors available in a physical processor
package. Typically, some cache lines are needed to access data that are not part of the source or destination buffers used in cache blocking, so the block size can be chosen between one quarter to one half of
the target cache (see Chapter 3, “General Optimization Guidelines”).
Software can use the deterministic cache parameter leaf of CPUID to discover which subset of logical
processors are sharing a given cache (see Chapter 7, “Optimizing Cache Usage”). Therefore, guideline
above can be extended to allow all the logical processors serviced by a given cache to use the cache
simultaneously, by placing an upper limit of the block size as the total size of the cache divided by the
number of logical processors serviced by that cache. This technique can also be applied to singlethreaded applications that will be used as part of a multitasking workload.
User/Source Coding Rule 29. (H impact, H generality) Use cache blocking to improve locality of
data access. Target one quarter to one half of the cache size when targeting Intel processors
supporting HT Technology or target a block size that allow all the logical processors serviced by a cache
to share that cache simultaneously.

8.6.2

Shared-Memory Optimization

Maintaining cache coherency between discrete processors frequently involves moving data across a bus
that operates at a clock rate substantially slower that the processor frequency.

8.6.2.1

Minimize Sharing of Data between Physical Processors

When two threads are executing on two physical processors and sharing data, reading from or writing to
shared data usually involves several bus transactions (including snooping, request for ownership
changes, and sometimes fetching data across the bus). A thread accessing a large amount of shared
memory is likely to have poor processor-scaling performance.
User/Source Coding Rule 30. (H impact, M generality) Minimize the sharing of data between
threads that execute on different bus agents sharing a common bus. The situation of a platform
consisting of multiple bus domains should also minimize data sharing across bus domains.
One technique to minimize sharing of data is to copy data to local stack variables if it is to be accessed
repeatedly over an extended period. If necessary, results from multiple threads can be combined later by
writing them back to a shared memory location. This approach can also minimize time spent to synchronize access to shared data.

8.6.2.2

Batched Producer-Consumer Model

The key benefit of a threaded producer-consumer design, shown in Figure 8-5, is to minimize bus traffic
while sharing data between the producer and the consumer using a shared second-level cache. On an
Intel Core Duo processor and when the work buffers are small enough to fit within the first-level cache,
re-ordering of producer and consumer tasks are necessary to achieve optimal performance. This is
because fetching data from L2 to L1 is much faster than having a cache line in one core invalidated and
fetched from the bus.
Figure 8-5 illustrates a batched producer-consumer model that can be used to overcome the drawback of
using small work buffers in a standard producer-consumer model. In a batched producer-consumer
model, each scheduling quanta batches two or more producer tasks, each producer working on a designated buffer. The number of tasks to batch is determined by the criteria that the total working set be
greater than the first-level cache but smaller than the second-level cache.

8-20

MULTICORE AND HYPER-THREADING TECHNOLOGY

Main
Thread

P(1)

P(2)

P(3)

P: producer
C: consumer

C(1)

P(4)

C(2)

P(5)

P(6)

C(3)

C(4)

Figure 8-5. Batched Approach of Producer Consumer Model
Example 8-8 shows the batched implementation of the producer and consumer thread functions.
Example 8-8. Batched Implementation of the Producer Consumer Threads
void producer_thread()
{ int iter_num = workamount - batchsize;
int mode1;
for (mode1=0; mode1 < batchsize; mode1++)
{
produce(buffs[mode1],count); }
while (iter_num--)
{
Signal(&signal1,1);
produce(buffs[mode1],count); // placeholder function
WaitForSignal(&end1);
mode1++;
if (mode1 > batchsize)
mode1 = 0;
}
}
void consumer_thread()
{ int mode2 = 0;
int iter_num = workamount - batchsize;
while (iter_num--)
{
WaitForSignal(&signal1);
consume(buffs[mode2],count); // placeholder function
Signal(&end1,1);
mode2++;
if (mode2 > batchsize)
mode2 = 0;
}
for (i=0;i batchsize)
mode2 = 0;
}
}

8-21

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.6.3

Eliminate 64-KByte Aliased Data Accesses

The 64-KByte aliasing condition is discussed in Chapter 3. Memory accesses that satisfy the 64-KByte
aliasing condition can cause excessive evictions of the first-level data cache. Eliminating 64-KByte
aliased data accesses originating from each thread helps improve frequency scaling in general. Furthermore, it enables the first-level data cache to perform efficiently when HT Technology is fully utilized by
software applications.
User/Source Coding Rule 31. (H impact, H generality) Minimize data access patterns that are
offset by multiples of 64 KBytes in each thread.
The presence of 64-KByte aliased data access can be detected using Pentium 4 processor performance
monitoring events. Appendix B includes an updated list of Pentium 4 processor performance metrics.
These metrics are based on events accessed using the Intel VTune Performance Analyzer.
Performance penalties associated with 64-KByte aliasing are applicable mainly to current processor
implementations of HT Technology or Intel NetBurst microarchitecture. The next section discusses
memory optimization techniques that are applicable to multithreaded applications running on processors
supporting HT Technology.

8.7

FRONT END OPTIMIZATION

For dual-core processors where the second-level unified cache is shared by two processor cores (Intel
Core Duo processor and processors based on Intel Core microarchitecture), multi-threaded software
should consider the increase in code working set due to two threads fetching code from the unified cache
as part of front end and cache optimization. For quad-core processors based on Intel Core microarchitecture, the considerations that applies to Intel Core 2 Duo processors also apply to quad-core processors.

8.7.1

Avoid Excessive Loop Unrolling

Unrolling loops can reduce the number of branches and improve the branch predictability of application
code. Loop unrolling is discussed in detail in Chapter 3. Loop unrolling must be used judiciously. Be sure
to consider the benefit of improved branch predictability and the cost of under-utilization of the loop
stream detector (LSD).
User/Source Coding Rule 32. (M impact, L generality) Avoid excessive loop unrolling to ensure
the LSD is operating efficiently.

8.8

AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES

Modern OSes provide either API and/or data constructs (e.g. affinity masks) that allow applications to
manage certain shared resources , e.g. logical processors, Non-Uniform Memory Access (NUMA) memory
sub-systems.
Before multithreaded software considers using affinity APIs, it should consider the recommendations in
Table 8-2.

8-22

MULTICORE AND HYPER-THREADING TECHNOLOGY

Table 8-2. Design-Time Resource Management Choices
Runtime Environment

Thread Scheduling/Processor
Affinity Consideration

Memory Affinity Consideration

A single-threaded application

Support OS scheduler objectives on
system response and throughput by
letting OS scheduler manage
scheduling. OS provides facilities for
end user to optimize runtime specific
environment.

Not relevant; let OS do its job.

A multi-threaded application
requiring:

Rely on OS default scheduler policy.

Rely on OS default scheduler policy.

Hard-coded affinity-binding will likely
harm system response and throughput;
and/or in some cases hurting
application performance.

Use API that could provide
transparent NUMA benefit without
managing NUMA explicitly.

If application-customized thread
binding policy is considered, a
cooperative approach with OS
scheduler should be taken instead of
hard-coded thread affinity binding
policy. For example, the use of
SetThreadIdealProcessor() can provide
a floating base to anchor a next-freecore binding policy for localityoptimized application binding policy,
and cooperate with default OS policy.

Use API that could provide
transparent NUMA benefit without
managing NUMA explicitly.

Application-customized thread binding
policy can be more efficient than default
OS policy. Use performance event to
help optimize locality and cache
transfer opportunities.

Application-customized memory
affinity binding policy can be more
efficient than default OS policy. Use
performance event to diagnose nonlocal memory access issues related
to either OS or custom policy

i) less than all processor
resource in the system,
ii) share system resource with
other concurrent applications,
iii) other concurrent
applications may have higher
priority.
A multi-threaded application
requiring
i) foreground and higher
priority,
ii) uses less than all
processor resource in the
system,
iii) share system resource
with other concurrent
applications,

Use performance event to diagnose
non-local memory access issue if
default OS policy cause
performance issue.

iv) but other concurrent
applications have lower
priority.
A multi-threaded application
runs in foreground, requiring
all processor resource in the
system and not sharing
system resource with
concurrent applications; MPIbased multi-threading.

A multi-threaded application that
employs its own explicit thread affinitybinding policy should deploy with some
form of opt-in choice granted by the
end-user or administrator. For example,
permission to deploy explicit thread
affinity-binding policy can be activated
after permission is granted after
installation.

8-23

MULTICORE AND HYPER-THREADING TECHNOLOGY

8.8.1

Topology Enumeration of Shared Resources

Whether multithreaded software ride on OS scheduling policy or need to use affinity APIs for customized
resource management, understanding the topology of the shared platform resource is essential. The
processor topology of logical processors (SMT), processor cores, and physical processors in the platform
can enumerated using information provided by CPUID. This is discussed in Chapter 8, “MultipleProcessor Management” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
A white paper and reference code is also available from Intel.

8.8.2

Non-Uniform Memory Access

Platforms using two or more Intel Xeon processors based on Intel microarchitecture code name Nehalem
support non-uniform memory access (NUMA) topology because each physical processor provides its own
local memory controller. NUMA offers system memory bandwidth that can scale with the number of physical processors. System memory latency will exhibit asymmetric behavior depending on the memory
transaction occurring locally in the same socket or remotely from another socket. Additionally, OSspecific construct and/or implementation behavior may present additional complexity at the API level
that the multi-threaded software may need to pay attention to memory allocation/initialization in a
NUMA environment.
Generally, latency sensitive workload would favor memory traffic to stay local over remote. If multiple
threads shares a buffer, the programmer will need to pay attention to OS-specific behavior of memory
allocation/initialization on a NUMA system.
Bandwidth sensitive workloads will find it convenient to employ a data composition threading model and
aggregates application threads executing in each socket to favor local traffic on a per-socket basis to
achieve overall bandwidth scalable with the number of physical processors.
The OS construct that provides the programming interface to manage local/remote NUMA traffic is
referred to as memory affinity. Because OS manages the mapping between physical address (populated
by system RAM) to linear address (accessed by application software); and paging allows dynamic reassignment of a physical page to map to different linear address dynamically, proper use of memory affinity
will require a great deal of OS-specific knowledge.
To simplify application programming, OS may implement certain APIs and physical/linear address
mapping to take advantage of NUMA characteristics transparently in certain situations. One common
technique is for OS to delay commit of physical memory page assignment until the first memory reference on that physical page is accessed in the linear address space by an application thread. This means
that the allocation of a memory buffer in the linear address space by an application thread does not
necessarily determine which socket will service local memory traffic when the memory allocation API
returns to the program. However, the memory allocation API that supports this level of NUMA transparency varies across different OSes. For example, the portable C-language API “malloc” provides some
degree of transparency on Linux*, whereas the API “VirtualAlloc” behave similarly on Windows*.
Different OSes may also provide memory allocation APIs that require explicit NUMA information, such
that the mapping between linear address to local/remote memory traffic are fixed at allocation.
Example 8-9 shows an example that multi-threaded application could undertake the least amount of
effort dealing with OS-specific APIs and to take advantage of NUMA hardware capability. This parallel

8-24

MULTICORE AND HYPER-THREADING TECHNOLOGY

approach to memory buffer initialization is conducive to having each worker thread keep memory traffic
local on NUMA systems.

Example 8-9. Parallel Memory Initialization Technique Using OpenMP and NUMA
#ifdef _LINUX // Linux implements malloc to commit physical page at first touch/access
buf1 = (char *) malloc(DIM*(sizeof (double))+1024);
buf2 = (char *) malloc(DIM*(sizeof (double))+1024);
buf3 = (char *) malloc(DIM*(sizeof (double))+1024);
#endif
#ifdef windows
// Windows implements malloc to commit physical page at allocation, so use VirtualAlloc
buf1 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
buf2 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
buf3 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
#endif
(continue)
a = (double *) buf1;
b = (double *) buf2;
c = (double *) buf3;
#pragma omp parallel
{ // use OpenMP threads to execute each iteration of the loop
// number of OpenMP threads can be specified by default or via environment variable
#pragma omp for private(num)
// each loop iteration is dispatched to execute in different OpenMP threads using private iterator
for(num=0;num> N) + ((R*C2/Divisor) >> N).
If “divisor“ is known at compile time, (C2/Divisor) can be pre-computed into a congruent constant Cx =
Ceil( C2/divisor), then the quotient can computed by an integer multiple, followed by a shift:
Q = (Dividend * Cx ) >> N;
R = Dividend - ((Dividend * Cx ) >> N) * divisor;
The 128-bit IDIV/DIV instructions restrict the range of divisor, quotient, and remainder to be within 64bits to avoid causing numerical exceptions. This presents a challenge for situations with either of the

9-2

64-BIT MODE CODING GUIDELINES

three having a value near the upper bounds of 64-bit and for dividend values nearing the upper bound of
128 bits.
This challenge can be overcome with choosing a larger shift count N, and extending the (Dividend * Cx)
operation from 128-bit range to the next computing-efficient range. For example, if (Dividend * Cx) is
greater and 128 bits and N is greater than 63 bits, one can take advantage of computing bits 191:64 of
the 192-bit results using 128-bit MUL without implementing a full 192-bit multiply.
A convenient way to choose the congruent constant Cx is as follows:

•
•

If the range of dividend is within 64 bits: Nmin ~ BSR(Divisor) + 63.
In situations of disparate dynamic range of quotient/remainder relative to the range of divisor, raise
N accordingly so that quotient/remainder can be computed efficiently.

Consider the computation of quotient/remainder computation for the divisor 10^16 on unsigned dividends near the range of 64-bits. Example 9-1 illustrates using “MUL r64” instruction to handle 64-bit
dividend with 64-bit divisors.

Example 9-1. Compute 64-bit Quotient and Remainder with 64-bit Divisor
_Cx10to16:
DD
DD
_tento16:
DD
DD
mov
mov
mov
mul
mov
shr
mov
mov
mul
sub
jae
sub
mov
mul
sub
remain:
mov
mov

; Congruent constant for 10^16 with shift count ‘N’ = 117
0c44de15ch
; floor ( (2^117 / 10^16) + 1)
0e69594beh
; Optimize length of Cx to reduce # of 128-bit multiplies
; 10^16
6fc10000h
002386f2h
r9, qword ptr [rcx]
; load 64-bit dividend value
rax, r9
rsi, _Cx10to16
; Congruent Constant for 10^16 with shift count 117
[rsi]
; 128-bit multiply
r10, qword ptr 8[rsi] ; load divisor 10^16
rdx, 53;
;
r8, rdx
rax, r8
r10
r9, rax;
remain
r8, 1
rax, r8
r10
r9, rax;
rdx, r8
rax, r9

; 128-bit multiply
;
; this may be off by one due to round up
; 128-bit multiply
;
; quotient
; remainder

Example 9-2 shows a similar technique to handle 128-bit dividend with 64-bit divisors.

9-3

64-BIT MODE CODING GUIDELINES

Example 9-2. Quotient and Remainder of 128-bit Dividend with 64-bit Divisor
mov
mov
mov
mul
xor
mov
shr
mov
mul
add
adc
shr
shl
or
mov
mov
mov
mul
sub
sbb
jb
sub
mov
mul
sub
sbb
remain:
mov
neg

rax, qword ptr [rcx]
; load bits 63:0 of 128-bit dividend from memory
rsi, _Cx10to16
; Congruent Constant for 10^16 with shift count 117
r9, qword ptr [rsi]
; load Congruent Constant
r9
; 128-bit multiply
r11, r11
; clear accumulator
rax, qword ptr 8[rcx] ; load bits 127:64 of 128-bit dividend
rdx, 53;
;
r10, rdx
; initialize bits 127:64 of 192 b it result
r9
; Accumulate to bits 191:128
rax, r10;
;
rdx, r11;
;
rax, 53;
;
rdx, 11;
;
rdx, rax;
;
r8, qword ptr 8[rsi]
; load Divisor 10^16
r9, rdx;
; approximate quotient, may be off by 1
rax, r8
r9
; will quotient * divisor > dividend?
rdx, qword ptr 8[rcx] ;
rax, qword ptr [rcx]
;
remain
r9, 1
; this may be off by one due to round up
rax, r8 ; retrieve Divisor 10^16
r9
; final quotient * divisor
rax, qword ptr [rcx]
;
rdx, qword ptr 8[rcx] ;
rdx, r9
rax

; quotient
; remainder

The techniques illustrated in Example 9-1 and Example 9-2 can increase the speed of remainder/quotient
calculation of 128-bit dividends to at or below the cost of a 32-bit integer division.
Extending the technique above to deal with divisor greater than 64-bits is relatively straightforward. One
optimization worth considering is to choose shift count N > 128 bits. This can reduce the number of 128bit MUL needed to compute the relevant upper bits of (Dividend * Cx).

9.2.5

Sign Extension to Full 64-Bits

When in 64-bit mode, processors based on Intel NetBurst microarchitecture can sign-extend to 64 bits in
a single micro-op. In 64-bit mode, when the destination is 32 bits, the upper 32 bits must be zeroed.
Zeroing the upper 32 bits requires an extra micro-op and is less optimal than sign extending to 64 bits.
While sign extending to 64 bits makes the instruction one byte longer, it reduces the number of microops that the trace cache has to store, improving performance.
For example, to sign-extend a byte into ESI, use:
movsx rsi, BYTE PTR[rax]

9-4

64-BIT MODE CODING GUIDELINES

instead of:
movsx esi, BYTE PTR[rax]
If the next instruction uses the 32-bit form of esi register, the result will be the same. This optimization
can also be used to break an unintended dependency. For example, if a program writes a 16-bit value to
a register and then writes the register with an 8-bit value, if bits 15:8 of the destination are not needed,
use the sign-extended version of writes when available.
For example:
mov r8w, r9w; Requires a merge to preserve
; bits 63:15.
mov r8b, r10b; Requires a merge to preserve bits 63:8
Can be replaced with:
movsx r8, r9w ; If bits 63:8 do not need to be
; preserved.
movsx r8, r10b ; If bits 63:8 do not need to
; be preserved.
In the above example, the moves to R8W and R8B both require a merge to preserve the rest of the bits
in the register. There is an implicit real dependency on R8 between the 'MOV R8W, R9W' and 'MOV R8B,
R10B'. Using MOVSX breaks the real dependency and leaves only the output dependency, which the
processor can eliminate through renaming.
For processors based on Intel Core microarchitecture, zeroing the upper 32 bits is faster than signextend to 64 bits. For processors based on Intel microarchitecture code name Nehalem, zeroing or signextend the upper bits is single micro-op.

9.3

ALTERNATE CODING RULES FOR 64-BIT MODE

9.3.1

Use 64-Bit Registers Instead of Two 32-Bit Registers
for 64-Bit Arithmetic Result

Legacy 32-bit mode offers the ability to support extended precision integer arithmetic (such as 64-bit
arithmetic). However, 64-bit mode offers native support for 64-bit arithmetic. When 64-bit integers are
desired, use the 64-bit forms of arithmetic instructions.
In 32-bit legacy mode, getting a 64-bit result from a 32-bit by 32-bit integer multiply requires three
registers; the result is stobbred in 32-bit chunks in the EDX:EAX pair. When the instruction is available in
64-bit mode, using the 32-bit version of the instruction is not the optimal implementation if a 64-bit
result is desired. Use the extended registers.
For example, the following code sequence loads the 32-bit values sign-extended into the 64-bit registers
and performs a multiply:
movsx rax, DWORD PTR[x]
movsx rcx, DWORD PTR[y]
imul rax, rcx
The 64-bit version above is more efficient than using the following 32-bit version:
mov eax, DWORD PTR[x]
mov ecx, DWORD PTR[y]
imul ecx
In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead
of in a single 64-bit register.

9-5

64-BIT MODE CODING GUIDELINES

Assembly/Compiler Coding Rule 68. (ML impact, M generality) Use the 64-bit versions of
multiply for 32-bit integer multiplies that require a 64 bit result.
To add two 64-bit numbers in 32-bit legacy mode, the add instruction followed by the addc instruction is
used. For example, to add two 64-bit variables (X and Y), the following four instructions could be used:
mov eax, DWORD PTR[X]
mov edx, DWORD PTR[X+4]
add eax, DWORD PTR[Y]
adc edx, DWORD PTR[Y+4]
The result will end up in the two-register EDX:EAX.
In 64-bit mode, the above sequence can be reduced to the following:
mov rax, QWORD PTR[X]
add rax, QWORD PTR[Y]
The result is stored in rax. One register is required instead of two.
Assembly/Compiler Coding Rule 69. (ML impact, M generality) Use the 64-bit versions of add for
64-bit adds.

9.3.2

CVTSI2SS and CVTSI2SD

In processors based on Intel Core microarchitecture and later, CVTSI2SS and CVTSI2SD are improved
significantly over those in Intel NetBurst microarchitecture, in terms of latency and throughput. The
improvements applies equally to 64-bit and 32-bit versions.

9.3.3

Using Software Prefetch

Intel recommends that software developers follow the recommendations in Chapter 3 and Chapter 7
when considering the choice of organizing data access patterns to take advantage of the hardware
prefetcher (versus using software prefetch).
Assembly/Compiler Coding Rule 70. (L impact, L generality) If software prefetch instructions are
necessary, use the prefetch instructions provided by SSE.

9-6

CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXTPROCESSING/LEXING/PARSING
String/text processing spans a discipline that often employs techniques different from traditional SIMD
integer vector processing. Much of the traditional string/text algorithms are character based, where
characters may be represented by encodings (or code points) of fixed or variable byte sizes. Textual data
represents a vast amount of raw data and often carrying contextual information. The contextual information embedded in raw textual data often requires algorithmic processing dealing with a wide range of
attributes, such as character values, character positions, character encoding formats, subsetting of character sets, strings of explicit or implicit lengths, tokens, delimiters; contextual objects may be represented by sequential characters within a pre-defined character subsets (e.g. decimal-valued strings);
textual streams may contain embedded state transitions separating objects of different contexts (e.g.
tag-delimited fields).
Traditional Integer SIMD vector instructions may, in some simpler situations, be successful to speed up
simple string processing functions. SSE4.2 includes four new instructions that offer advances to computational algorithms targeting string/text processing, lexing and parsing of either unstructured or structured textual data.

10.1

SSE4.2 STRING AND TEXT INSTRUCTIONS

SSE4.2 provides four instructions, PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM that can accelerate
string and text processing by combining the efficiency of SIMD programming techniques and the lexical
primitives that are embedded in these 4 instructions. Simple examples of these instructions include
string length determination, direct string comparison, string case handling, delimiter/token processing,
locating word boundaries, locating sub-string matches in large text blocks. Sophisticated application of
SSE4.2 can accelerate XML parsing and Schema validation.
Processor’s support for SSE4.2 is indicated by the feature flag value returned in ECX [bit 20] after
executing CPUID instruction with EAX input value of 1 (i.e. SSE4.2 is supported if
CPUID.01H:ECX.SSE4_2 [bit 20] = 1). Therefore, software must verify CPUID.01H:ECX.SSE4_2 [bit 20]
is set before using these 4 instructions. (Verifying CPUID.01H:ECX.SSE4_2 = 1 is also required before
using PCMPGTQ or CRC32. Verifying CPUID.01H:ECX.POPCNT[Bit 23] = 1 is required before using the
POPCNT instruction.)
These string/text processing instructions work by performing up to 256 comparison operations on text
fragments. Each text fragment can be 16 bytes. They can handle fragments of different formats: either
byte or word elements. Each of these four instructions can be configured to perform four types of parallel
comparison operation on two text fragments.
The aggregated intermediate result of a parallel comparison of two text fragments become a bit
patterns:16 bits for processing byte elements or 8 bits for word elements. These instruction provide
additional flexibility, using bit fields in the immediate operand of the instruction syntax, to configure an
unary transformation (polarity) on the first intermediate result.
Lastly, the instruction’s immediate operand offers a output selection control to further configure the flexibility of the final result produced by the instruction. The rich configurability of these instruction is
summarized in Figure 10-1.

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

PCMPxSTRy XMM1, XMM2/M128, imm

Data Format Imm[1:0]:

0

127

00b: unsigned bytes
01b: unsigned words
10b: signed bytes
11b: signed words
Fragment2 of words

Fragment1
Imm[3:2] Compare

0

15|7

Polarity

IntRes2

15|7

Imm[5:4]
Output
Select
0

Imm[6]

31

0

IntRes1

Imm[6]

Index Result

Mask Result
XMM0

ECX

Figure 10-1. SSE4.2 String/Text Instruction Immediate Operand Control
The PCMPxSTRI instructions produce final result as an integer index in ECX, the PCMPxSTRM instructions
produce final result as a bit mask in the XMM0 register. The PCMPISTRy instructions support processing
string/text fragments using implicit length control via null termination for handling string/text of
unknown size. the PCMPESTRy instructions support explicit length control via EDX:EAX register pair to
specify the length text fragments in the source operands.
The first intermediate result, IntRes1, is an aggregated result of bit patterns from parallel comparison
operations done on pairs of data elements from each text fragment, according to the imm[3:2] bit field
encoding, see Table 10-1.
Table 10-1. SSE4.2 String/Text Instructions Compare Operation on N-elements
Imm[3:2]

Name

IntRes1[i] is TRUE if

Potential Usage

00B

Equal Any

Element i in fragment2 matches any element j in
fragment1

Tokenization, XML parser

01B

Ranges

Element i in fragment2 is within any range pairs specified
in fragment1

Subsetting, Case handling,
XML parser, Schema validation

10B

Equal Each

Element i in fragment2 matches element i in fragment1

Strcmp()

11B

Equal
Ordered

Element i and subsequent, consecutive valid elements in
fragment2 match fully or partially with fragment1 starting
from element 0

Substring Searches, KMP, Strstr()

Input data element format selection using imm[1:0] can support signed or unsigned byte/word
elements.
The bit field imm[5:4] allows applying a unary transformation on IntRes1, see Table 10-2.

10-2

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Table 10-2. SSE4.2 String/Text Instructions Unary Transformation on IntRes1
Imm[5:4]

Name

IntRes2[i] =

Potential Usage

00B

No Change

IntRes1[i]

01B

Invert

-IntRes1[i]

10B

No Change

IntRes1[i]

11B

Mask Negative

IntRes1[i] if element i of fragment2 is invalid, otherwise IntRes1[i]

The output selection field, imm[6] is described in Table 10-3.
Table 10-3. SSE4.2 String/Text Instructions Output Selection Imm[6]
Imm[6]

Instruction

Final Result

Potential Usage

0B

PCMPxSTRI

ECX = offset of least significant bit set in IntRes2 if IntRes2
!= 0, otherwise
ECX = number of data element per 16 bytes

0B

PCMPxSTRM

XMM0 = ZeroExtend(IntRes2);

1B

PCMPxSTRI

ECX = offset of most significant bit set in IntRes2 if IntRes2
!= 0, otherwise
ECX = number of data element per 16 bytes

1B

PCMPxSTRM

Data element i of XMM0 = SignExtend(IntRes2[i]);

The comparison operation on each data element pair is defined in Table 10-4. Table 10-4 defines the type
of comparison operation between valid data elements (last row of Table 10-4) and boundary conditions
when the fragment in a source operand may contain invalid data elements (rows 1 through 3 of
Table 10-4). Arithmetic comparison are performed only if both data elements are valid element in
fragment1 and fragment2, as shown in row 4 of Table 10-4.
Table 10-4. SSE4.2 String/Text Instructions Element-Pair Comparison Definition
Fragment1
Element

Fragment2
Element

Imm[3:2]=
00B, Equal Any

Imm[3:2]=
01B, Ranges

Imm[3:2]=
10B, Equal Each

Imm[3:2]=
11B, Equal Ordered

Invalid

Invalid

Force False

Force False

Force True

Force True

Invalid

Valid

Force False

Force False

Force False

Force True

Valid

Invalid

Force False

Force False

Force False

Force False

Valid

Valid

Compare

Compare

Compare

Compare

The string and text processing instruction provides several aid to handle end-of-string situations, see
Table 10-5. Additionally, the PCMPxSTRy instructions are designed to not require 16-byte alignment to
simplify text processing requirements.
Table 10-5. SSE4.2 String/Text Instructions Eflags Behavior
EFLAGs

Description

Potential Usage

CF

Reset if IntRes2 = 0; Otherwise set

When CF=0, ECX= #of data element to scan next

ZF

Reset if entire 16-byte fragment2 is valid

likely end-of-string

SF

Reset if entire 16-byte fragment1 is valid

OF

IntRes2[0];

10-3

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

10.1.1

CRC32

CRC32 instruction computes the 32-bit cyclic redundancy checksum signature for byte/word/dword or
qword stream of data. It can also be used as a hash function. For example, a dictionary uses hash indices
to de-reference strings. CRC32 instruction can be easily adapted for use in this situation.
Example 10-1 shows a straight forward hash function that can be used to evaluate the hash index of a
string to populate a hash table. Typically, the hash index is derived from the hash value by taking the
remainder of the hash value modulo the size of a hash table.

Example 10-1. A Hash Function Examples
unsigned int hash_str(unsigned char* pStr)
{ unsigned int hVal = (unsigned int)(*pStr++);
while (*pStr)
{ hVal = (hashVal * CONST_A) + (hVal >> 24) + (unsigned int)(*pStr++);
}
return hVal;
}

CRC32 instruction can be use to derive an alternate hash function. Example 10-2 takes advantage the
32-bit granular CRC32 instruction to update signature value of the input data stream. For string of small
to moderate sizes, using the hardware accelerated CRC32 can be twice as fast as Example 10-1.

Example 10-2. Hash Function Using CRC32
static unsigned cn_7e = 0x7efefeff, Cn_81 = 0x81010100;
unsigned int hash_str_32_crc32x(unsigned char* pStr)
{ unsigned *pDW = (unsigned *) &pStr[1];
unsigned short *pWd = (unsigned short *) &pStr[1];
unsigned int tmp, hVal = (unsigned int)(*pStr);
if( !pStr[1]) ;
else {
tmp = ((pDW[0] +cn_7e ) ^(pDW[0]^ -1)) & Cn_81;
while ( !tmp ) // loop until there is byte in *pDW had 0x00
{
hVal = _mm_crc32_u32 (hVal, *pDW ++);
tmp = ((pDW[0] +cn_7e ) ^(pDW[0]^ -1)) & Cn_81;
};
if(!pDW[0]);
else if(pDW[0] < 0x100) { // finish last byte that’s non-zero
hVal = _mm_crc32_u8 (hVal, pDW[0]);
}

10-4

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-2. Hash Function Using CRC32 (Contd.)
else if(pDW[0] < 0x10000) { // finish last two byte that’s non-zero
hVal = _mm_crc32_u16 (hVal, pDW[0]);
}
else { // finish last three byte that’s non-zero
hVal = _mm_crc32_u32 (hVal, pDW[0]);
}
}
return hVal;
}

10.2

USING SSE4.2 STRING AND TEXT INSTRUCTIONS

String libraries provided by high-level languages or as part of system library are used in a wide range of
situations across applications and privileged system software. These situations can be accelerated using
a replacement string library that implements PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM.
Although system-provided string library provides standardized string handling functionality and interfaces, most situations dealing with structured document processing requires considerable more sophistication, optimization, and services not available from system-provided string libraries. For example,
structured document processing software often architect different class objects to provide building block
functionality to service specific needs of the application. Often application may choose to disperse equivalent string library services into separate classes (string, lexer, parser) or integrate memory management capability into string handling/lexing/parsing objects.
PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM instructions are general-purpose primitives that software can use to build replacement string libraries or build class hierarchy to provide lexing/parsing
services for structured document processing. XML parsing and schema validation are examples of the
latter situations.
Unstructured, raw text/string data consist of characters, and have no natural alignment preferences.
Therefore, PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM instructions are architected to not require
the 16-Byte alignment restrictions of other 128-bit SIMD integer vector processing instructions.
With respect to memory alignment, PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM support unaligned
memory loads like other unaligned 128-bit memory access instructions, e.g. MOVDQU.
Unaligned memory accesses may encounter special situations that require additional coding techniques,
depending on the code running in ring 3 application space or in privileged space. Specifically, an
unaligned 16-byte load may cross page boundary. Section 10.2.1 discusses a technique that application
code can use. Section 10.2.2 discusses the situation string library functions needs to deal with. Section
10.3 gives detailed examples of using PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM instructions to
implement equivalent functionality of several string library functions in situations that application code
has control over memory buffer allocation.

10.2.1

Unaligned Memory Access and Buffer Size Management

In application code, the size requirements for memory buffer allocation should consider unaligned SIMD
memory semantics and application usage.
For certain types of application usage, it may be desirable to make distinctions between valid buffer
range limit versus valid application data size (e.g. a video frame). The former must be greater or equal
to the latter.

10-5

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

To support algorithms requiring unaligned 128-bit SIMD memory accesses, memory buffer allocation by
a caller function should consider adding some pad space so that a callee function can safely use the
address pointer safely with unaligned 128-bit SIMD memory operations.
The minimal padding size should be the width of the SIMD register that might be used in conjunction with
unaligned SIMD memory access.

10.2.2

Unaligned Memory Access and String Library

String library functions may be used by application code or privileged code. String library functions must
be careful not to violate memory access rights. Therefore, a replacement string library that employ SIMD
unaligned access must employ special techniques to ensure no memory access violation occur.
Section 10.3.6 provides an example of a replacement string library function implemented with SSE4.2
and demonstrates a technique to use 128-bit unaligned memory access without unintentionally crossing
page boundary.

10.3

SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES

Software implementing SSE4.2 instruction must use CPUID feature flag mechanism to verify processor’s
support for SSE4.2. Details can be found in CHAPTER 12 of Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1 and in CPUID of CHAPTER 3 in Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A.
In the following sections, we use several examples in string/text processing of progressive complexity to
illustrates the basic techniques of adapting the SIMD approach to implement string/text processing using
PCMPxSTRy instructions in SSE4.2. For simplicity, we will consider string/text in byte data format in situations that caller functions have allocated sufficient buffer size to support unaligned 128-bit SIMD loads
from memory without encountering side-effects of cross page boundaries.

10.3.1

Null Character Identification (Strlen equivalent)

The most widely used string function is probably strlen(). One can view the lexing requirement of strlen()
is to identify the null character in a text block of unknown size (end of string condition). Brute-force,
byte-granular implementation fetches data inefficiently by loading one byte at a time.
Optimized implementation using general-purpose instructions can take advantage of dword operations in
32-bit environment (and qword operations in 64-bit environment) to reduce the number of iterations.
A 32-bit assembly implementation of strlen() is shown Example 10-3. The peak execution throughput of
handling EOS condition is determined by eight ALU instructions in the main loop.

Example 10-3. Strlen() Using General-Purpose Instructions
int strlen_asm(const char* s1)
{int len = 0;
_asm{
mov ecx, s1
test ecx, 3 ; test addr aligned to dword
je
short _main_loop1 ; dword aligned loads would be faster
_malign_str1:
mov al, byte ptr [ecx] ; read one byte at a time
add ecx, 1
test al, al ; if we find a null, go calculate the length
je
short _byte3a
(continue)

10-6

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-3. Strlen() Using General-Purpose Instructions (Contd.)
test ecx, 3; test if addr is now aligned to dword
jne short _malign_str1; if not, repeat
align16
_main_loop1:; read each 4-byte block and check for a NULL char in the dword
mov eax, [ecx]; read 4 byte to reduce loop count
mov edx, 7efefeffh
add edx, eax
xor eax, -1
xor eax, edx
add ecx, 4; increment address pointer by 4
test eax, 81010100h ; if no null code in 4-byte stream, do the next 4 bytes
je
short _main_loop1
;
there is a null char in the dword we just read,
;
since we already advanced pointer ecx by 4, and the dword is lost
mov eax, [ecx -4]; re-read the dword that contain at least a null char
test al, al
; if byte0 is null
je
short _byte0a; the least significant byte is null
test ah, ah ; if byte1 is null
je
short _byte1a
test eax, 00ff0000h; if byte2 is null
je
short _byte2a
test eax, 00ff000000h; if byte3 is null
je
short _byte3a
jmp short _main_loop1
_byte3a:
; we already found the null, but pointer already advanced by 1
lea eax, [ecx-1]; load effective address corresponding to null code
mov ecx, s1
sub eax, ecx; difference between null code and start address
jmp short _resulta
_byte2a:
lea eax, [ecx-2]
mov ecx, s1
sub eax, ecx
jmp short _resulta
_byte1a:
lea eax, [ecx-3]
mov ecx, s1
sub eax, ecx
jmp short _resulta
_byte0a:
lea eax, [ecx-4]
mov ecx, s1
sub eax, ecx
_resulta:
mov len, eax; store result
}
return len;
}
The equivalent functionality of EOS identification can be implemented using PCMPISTRI. Example 10-4
shows a simplistic SSE4.2 implementation to scan a text block by loading 16-byte text fragments and
locate the null termination character. Example 10-5 shows the optimized SSE4.2 implementation that
demonstrates the importance of using memory disambiguation to improve instruction-level parallelism.

10-7

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-4. Sub-optimal PCMPISTRI Implementation of EOS handling
static char ssch2[16]= {0x1, 0xff, 0x00, }; // range values for non-null characters
int strlen_un_optimized(const char* s1)
{int len = 0;
_asm{
mov eax, s1
movdquxmm2, ssch2 ; load character pair as range (0x01 to 0xff)
xor ecx, ecx ; initial offset to 0
(continue)
_loopc:
add eax, ecx ; update addr pointer to start of text fragment
pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx
; if there is a null char in the 16Byte fragment at [eax], zf will be set.
; if all 16 bytes of the fragment are non-null characters, ECX will return 16,
jnz short _loopc; xmm1 has no null code, ecx has 16, continue search
;
we have a null code in xmm1, ecx has the offset of the null code i
add eax, ecx
; add ecx to the address of the last fragment2/xmm1
mov edx, s1; retrieve effective address of the input string
sub eax, edx;the string length
mov len, eax; store result
}
return len;
}

The code sequence shown in Example 10-4 has a loop consisting of three instructions. From a performance tuning perspective, the loop iteration has loop-carry dependency because address update is done
using the result (ECX value) of a previous loop iteration. This loop-carry dependency deprives the out-oforder engine’s capability to have multiple iterations of the instruction sequence making forward progress.
The latency of memory loads, the latency of these instructions, any bypass delay could not be amortized
by OOO execution in the presence of loop-carry dependency.
A simple optimization technique to eliminate loop-carry dependency is shown in Example 10-5.
Using memory disambiguation technique to eliminate loop-carry dependency, the cumulative latency
exposure of the 3-instruction sequence of Example 10-5 is amortized over multiple iterations, the net
cost of executing each iteration (handling 16 bytes) is less then 3 cycles. In contrast, handling 4 bytes of
string data using 8 ALU instructions in Example 10-3 will also take a little less than 3 cycles per iteration.
Whereas each iteration of the code sequence in Example 10-4 will take more than 10 cycles because of
loop-carry dependency.

Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency
int strlen_sse4_2(const char* s1)
{int len = 0;
_asm{
mov eax, s1
movdquxmm2, ssch2 ; load character pair as range (0x01 to 0xff)
xor ecx, ecx ; initial offset to 0
sub eax, 16 ; address arithmetic to eliminate extra instruction and a branch

10-8

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency (Contd.)
_loopc:
add eax, 16 ; adjust address pointer and disambiguate load address for each iteration
pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx
; if there is a null char in [eax] fragment, zf will be set.
; if all 16 bytes of the fragment are non-null characters, ECX will return 16,
jnz short _loopc ; ECX will be 16 if there is no null byte in [eax], so we disambiguate
_endofstring:
add eax, ecx
; add ecx to the address of the last fragment
mov edx, s1; retrieve effective address of the input string
sub eax, edx;the string length
mov len, eax; store result
}
return len;
}

SSE4.2 Coding Rule 5. (H impact, H generality) Loop-carry dependency that depends on the ECX
result of PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM for address adjustment must be minimized.
Isolate code paths that expect ECX result will be 16 (bytes) or 8 (words), replace these values of ECX
with constants in address adjustment expressions to take advantage of memory disambiguation
hardware.

10.3.2

White-Space-Like Character Identification

Character-granular-based text processing algorithms have developed techniques to handle specific tasks
to remedy the efficiency issue of character-granular approaches. One such technique is using look-up
tables for character subset classification. For example, some application may need to separate alphanumeric characters from white-space-like characters. More than one character may be treated as whitespace characters.
Example 10-6 illustrates a simple situation of identifying white-space-like characters for the purpose of
marking the beginning and end of consecutive non-white-space characters.

Example 10-6. WordCnt() Using C and Byte-Scanning Technique
// Counting words involves locating the boundary of contiguous non-whitespace characters.
// Different software may choose its own mapping of white space character set.
// This example employs a simple definition for tutorial purpose:
// Non-whitespace character set will consider: A-Z, a-z, 0-9, and the apostrophe mark '
// The example uses a simple technique to map characters into bit patterns of square waves
// we can simply count the number of falling edges
static char alphnrange[16]= {0x27, 0x27, 0x30, 0x39, 0x41, 0x5a, 0x61, 0x7a, 0x0};
static char alp_map8[32] = {0x0, 0x0, 0x0, 0x0, 0x80,0x0,0xff, 0x3,0xfe, 0xff, 0xff, 0x7, 0xfe, 0xff, 0xff, 0x7}; // 32
byte lookup table, 1s map to bit patterns of alpha numerics in alphnrange
int wordcnt_c(const char* s1)
{int i, j, cnt = 0;
char cc, cc2;
char flg[3]; // capture the a wavelet to locate a falling edge
cc2 = cc = s1[0];
// use the compacted bit pattern to consolidate multiple comparisons into one look up
if( alp_map8[cc>>3] & ( 1<< ( cc & 7) ) )
{ flg[1] = 1; } // non-white-space char that is part of a word,
(continue)

10-9

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-6. WordCnt() Using C and Byte-Scanning Technique (Contd.)
// we're including apostrophe in this example since counting the
// following 's' as a separate word would be kind of silly
else
{ flg[1] = 0; } // 0: whitespace, punctuations not be considered as part of a word
i = 1; // now we’re ready to scan through the rest of the block
// we'll try to pick out each falling edge of the bit pattern to increment word count.
// this works with consecutive white spaces, dealing with punctuation marks, and
// treating hyphens as connecting two separate words.
while (cc2 )
{ cc2 = s1[i];
if( alp_map8[cc2>>3] & ( 1<< ( cc2 & 7) ) )
{ flg[2] = 1;} // non-white-space
else
{ flg[2] = 0;} // white-space-like
if( !flg[2] && flg[1] )
{ cnt ++; }// found the falling edge
flg[1] = flg[2];
i++;

}
return cnt;
}

In Example 10-6, a 32-byte look-up table is constructed to represent the ascii code values 0x0-0xff, and
partitioned with each bit of 1 corresponding to the specified subset of characters. While this bit-lookup
technique simplifies the comparison operations, data fetching remains byte-granular.
Example 10-7 shows an equivalent implementation of counting words using PCMPISTRM. The loop iteration is performed at 16-byte granularity instead of byte granularity. Additionally, character set subsetting
is easily expressed using range value pairs and parallel comparisons between the range values and each
byte in the text fragment are performed by executing PCMPISTRI once.

Example 10-7. WordCnt() Using PCMPISTRM
// an SSE4.2 example of counting words using the definition of non-whitespace character
// set of {A-Z, a-z, 0-9, '}. Each text fragment (up to 16 bytes) are mapped to a
// 16-bit pattern, which may contain one or more falling edges. Scanning bit-by-bit
// would be inefficient and goes counter to leveraging SIMD programming techniques.
// Since each falling edge must have a preceding rising edge, we take a finite
// difference approach to derive a pattern where each rising/falling edge maps to 2-bit pulse,
// count the number of bits in the 2-bit pulses using popcnt and divide by two.
int wdcnt_sse4_2(const char* s1)
{int len = 0;
_asm{
mov eax, s1
movdquxmm3, alphnrange ; load range value pairs to detect non-white-space codes
xor ecx, ecx
xor esi, esi
xor edx, edx
(continue)

10-10

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-7. WordCnt() Using PCMPISTRM (Contd.)
movdquxmm1, [eax]
pcmpistrm xmm3, xmm1, 04h ; white-space-like char becomes 0 in xmm0[15:0]
movdqa xmm4, xmm0
movdqa xmm1, xmm0
psrld xmm4, 15 ; save MSB to use in next iteration
movdqa xmm5, xmm1
psllw xmm5, 1; lsb is effectively mapped to a white space
pxor xmm5, xmm0; the first edge is due to the artifact above
pextrd edi, xmm5, 0
jz
_lastfragment; if xmm1 had a null, zf would be set
popcnt edi, edi; the first fragment will include a rising edge
add
esi, edi
mov ecx, 16
_loopc:
add eax, ecx ; advance address pointer
movdquxmm1, [eax]
pcmpistrm xmm3, xmm1, 04h ; white-space-like char becomes 0 in xmm0[15:0]
movdqa xmm5, xmm4 ; retrieve the MSB of the mask from last iteration
movdqa xmm4, xmm0
psrld xmm4, 15 ; save mSB of this iteration for use in next iteration
movdqa xmm1, xmm0
psllw xmm1, 1
por xmm5, xmm1 ; combine MSB of last iter and the rest from current iter
pxor xmm5, xmm0; differentiate binary wave form into pattern of edges
pextrdedi, xmm5, 0 ; the edge patterns has (1 bit from last, 15 bits from this round)
jz
_lastfragment; if xmm1 had a null, zf would be set
mov ecx, 16; xmm1, had no null char, advance 16 bytes
popcntedi, edi; count both rising and trailing edges
add esi, edi; keep a running count of both edges
jmp short _loopc
_lastfragment:
popcntedi, edi; count both rising and trailing edges
add esi, edi; keep a running count of both edges
shr esi, 1 ; word count corresponds to the trailing edges
mov len, esi
}
return len;
}

10.3.3

Substring Searches

Strstr() is a common function in the standard string library. Typically, A library may implement
strstr(sTarg, sRef) with a brute-force, byte-granular technique of iterative comparisons between the
reference string with a round of string comparison with a subset of the target string. Brute-force, bytegranular techniques provide reasonable efficiency when the first character of the target substring and the
reference string are different, allowing subsequent string comparisons of target substrings to proceed
forward to the next byte in the target string.
When a string comparison encounters partial matches of several characters (i.e. the sub-string search
found a partial match starting from the beginning of the reference string) and determined the partial
match led to a false-match. The brute-force search process need to go backward and restart string
comparisons from a location that had participated in previous string comparison operations. This is
referred to as re-trace inefficiency of the brute-force substring search algorithm. See Figure 10-2.

10-11

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Target Str

A

C

G

C

M

C

A

G

M

C

M

T

T

T

F

B

A

C

A

G

M

C

M

B

A

C

A

G

M

C

B

A

Ref str

B

A

T/F:

T

C

B

A

C

A

G

M

C

M

Retrace 3 bytes after partial match of first 4 bytes

F
M

F

Figure 10-2. Retrace Inefficiency of Byte-Granular, Brute-Force Search
The Knuth, Morris, Pratt algorithm1 (KMP) provides an elegant enhancement to overcome the re-trace
inefficiency of brute-force substring searches. By deriving an overlap table that is used to manage
retrace distance when a partial match leads to a false match, KMP algorithm is very useful for applications that search relevant articles containing keywords from a large corpus of documents.
Example 10-8 illustrates a C-code example of using KMP substring searches.

Example 10-8. KMP Substring Search in C
// s1 is the target string of length cnt1
// s2 is the reference string of length cnt2
// j is the offset in target string s1 to start each round of string comparison
// i is the offset in reference string s2 to perform byte granular comparison
(continue)
int str_kmp_c(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int i, j;
i = 0; j = 0;
while ( i+j < cnt1) {
if( s2[i] == s1[i+j]) {
i++;
if( i == cnt2) break; // found full match
}
else {
j = j+i - ovrlap_tbl[i]; // update the offset in s1 to start next round of string compare
if( i > 0) {
i = ovrlap_tbl[i]; // update the offset of s2 for next string compare should start at
}
}
};
return j;
}

1. Donald E. Knuth, James H. Morris, and Vaughan R. Pratt; SIAM J. Comput. Volume 6, Issue 2, pp. 323-350 (1977)
10-12

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-8. KMP Substring Search in C (Contd.)
void kmp_precalc(const char * s2, int cnt2)
{int i = 2;
char nch = 0;
ovrlap_tbl[0] = -1; ovrlap_tbl[1] = 0;
// pre-calculate KMP table
while( i < cnt2) {
if( s2[i-1] == s2[nch]) {
ovrlap_tbl[i] = nch +1;
i++; nch++;
}
else if ( nch > 0) nch = ovrlap_tbl[nch];
else {
ovrlap_tbl[i] = 0;
i++;
}
};
ovrlap_tbl[cnt2] = 0;
}
Example 10-8 also includes the calculation of the KMP overlap table. Typical usage of KMP algorithm
involves multiple invocation of the same reference string, so the overhead of precalculating the overlap
table is easily amortized. When a false match is determined at offset i of the reference string, the overlap
table will predict where the next round of string comparison should start (updating the offset j), and the
offset in the reference string that byte-granular character comparison should resume/restart.
While KMP algorithm provides efficiency improvement over brute-force byte-granular substring search,
its best performance is still limited by the number of byte-granular operations. To demonstrate the versatility and built-in lexical capability of PCMPISTRI, we show an SSE4.2 implementation of substring search
using brute-force 16-byte granular approach in Example 10-9, and combining KMP overlap table with
substring search using PCMPISTRI in Example 10-10.

Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic
int strsubs_sse4_2i(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int kpm_i=0, idx;
int ln1= 16, ln2=16, rcnt1 = cnt1, rcnt2= cnt2;
__m128i *p1 = (__m128i *) s1;
__m128i *p2 = (__m128i *) s2;
__m128ifrag1, frag2;
int cmp, cmp2, cmp_s;
__m128i *pt = NULL;
if( cnt2 > cnt1 || !cnt1) return -1;
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
(continue)

10-13

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic (Contd.)
while(rcnt1 > 0)
{ cmp_s = _mm_cmpestrs(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
cmp = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
if( !cmp) { // we have a partial match that needs further analysis
if( cmp_s) { // if we're done with s2
if( pt)
{idx = (int) ((char *) pt - (char *) s1) ; }
else
{idx = (int) ((char *) p1 - (char *) s1) ; }
return idx;
}
// we do a round of string compare to verify full match till end of s2
if( pt == NULL) pt = p1;
cmp2 = 16;
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
while( cmp2 == 16 && rcnt2) { // each 16B frag matches,
rcnt1 = cnt1 - 16 -(int) ((char *)p1-(char *)s1);
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
if( rcnt1 <=0 || rcnt2 <= 0 ) break;
p1 = (__m128i *)(((char *)p1) + 16);
p2 = (__m128i *)(((char *)p2) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
cmp2 = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x18); // lsb, eq each
};
if( !rcnt2 || rcnt2 == cmp2) {
idx = (int) ((char *) pt - (char *) s1) ;
return idx;
}
else if ( rcnt1 <= 0) { // also cmp2 < 16, non match
if( cmp2 == 16 && ((rcnt1 + 16) >= (rcnt2+16) ) )
{idx = (int) ((char *) pt - (char *) s1) ;
return idx;
}
else return -1;
}
(continue)

10-14

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic (Contd.)
else { // in brute force, we advance fragment offset in target string s1 by 1
p1 = (__m128i *)(((char *)pt) + 1); // we're not taking advantage of kmp
rcnt1 = cnt1 -(int) ((char *)p1-(char *)s1);
pt = NULL;
p2 = (__m128i *)((char *)s2) ;
rcnt2 = cnt2 -(int) ((char *)p2-(char *)s2);
frag1 = _mm_loadu_si128(p1);// load next fragment from s1
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
}

}
else{
if( cmp == 16) p1 = (__m128i *)(((char *)p1) + 16);
else p1 = (__m128i *)(((char *)p1) + cmp);
rcnt1 = cnt1 -(int) ((char *)p1-(char *)s1);
if( pt && cmp ) pt = NULL;
frag1 = _mm_loadu_si128(p1);// load next fragment from s1
}

}

}
return idx;

In Example 10-9, address adjustment using a constant to minimize loop-carry dependency is practised
in two places:
•

In the inner while loop of string comparison to determine full match or false match (the result cmp2
is not used for address adjustment to avoid dependency).

•

In the last code block when the outer loop executed PCMPISTRI to compare 16 sets of ordered
compare between a target fragment with the first 16-byte fragment of the reference string, and all
16 ordered compare operations produced false result (producing cmp with a value of 16).

Example 10-10 shows an equivalent intrinsic implementation of substring search using SSE4.2 and KMP
overlap table. When the inner loop of string comparison determines a false match, the KMP overlap table
is consulted to determine the address offset for the target string fragment and the reference string fragment to minimize retrace.
It should be noted that a significant portions of retrace with retrace distance less than 15 bytes are
avoided even in the brute-force SSE4.2 implementation of Example 10-9. This is due to the ordercompare primitive of PCMPISTRI. “Ordered compare” performs 16 sets of string fragment compare, and
many false match with less than 15 bytes of partial matches can be filtered out in the same iteration that
executed PCMPISTRI.
Retrace distance of greater than 15 bytes does not get filtered out by the Example 10-9. By consulting
with the KMP overlap table, Example 10-10 can eliminate retraces of greater than 15 bytes.

Example 10-10. Substring Search Using PCMPISTRI and KMP Overlap Table
int strkmp_sse4_2(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int kpm_i=0, idx;
int ln1= 16, ln2=16, rcnt1 = cnt1, rcnt2= cnt2;
__m128i *p1 = (__m128i *) s1;
__m128i *p2 = (__m128i *) s2;
__m128ifrag1, frag2;
(continue)

10-15

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-10. Substring Search Using PCMPISTRI and KMP Overlap Table (Contd.)
int cmp, cmp2, cmp_s;
__m128i *pt = NULL;
if( cnt2 > cnt1 || !cnt1) return -1;
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
while(rcnt1 > 0)
{ cmp_s = _mm_cmpestrs(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
cmp = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
if( !cmp) { // we have a partial match that needs further analysis
if( cmp_s) { // if we've reached the end with s2
if( pt)
{idx = (int) ((char *) pt - (char *) s1) ; }
else
{idx = (int) ((char *) p1 - (char *) s1) ; }
return idx;
}
// we do a round of string compare to verify full match till end of s2
if( pt == NULL) pt = p1;
cmp2 = 16;
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
while( cmp2 == 16 && rcnt2) { // each 16B frag matches
rcnt1 = cnt1 - 16 -(int) ((char *)p1-(char *)s1);
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
if( rcnt1 <=0 || rcnt2 <= 0 ) break;
p1 = (__m128i *)(((char *)p1) + 16);
p2 = (__m128i *)(((char *)p2) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
cmp2 = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x18); // lsb, eq each
};
if( !rcnt2 || rcnt2 == cmp2) {
idx = (int) ((char *) pt - (char *) s1) ;
return idx;
}
else if ( rcnt1 <= 0 ) { // also cmp2 < 16, non match
return -1;
}
(continue)

10-16

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-10. Substring Search Using PCMPISTRI and KMP Overlap Table (Contd.)
else { // a partial match led to false match, consult KMP overlap table for addr adjustment
kpm_i = (int) ((char *)p1 - (char *)pt)+ cmp2 ;
p1 = (__m128i *)(((char *)pt) + (kpm_i - ovrlap_tbl[kpm_i])); // use kmp to skip retrace
rcnt1 = cnt1 -(int) ((char *)p1-(char *)s1);
pt = NULL;
p2 = (__m128i *)(((char *)s2) + (ovrlap_tbl[kpm_i]));
rcnt2 = cnt2 -(int) ((char *)p2-(char *)s2);
frag1 = _mm_loadu_si128(p1);// load next fragment from s1
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
}
}
else{
if( kpm_i && ovrlap_tbl[kpm_i]) {
p2 = (__m128i *)(((char *)s2) );
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
//p1 = (__m128i *)(((char *)p1) );
//rcnt1 = cnt1 -(int) ((char *)p1-(char *)s1);
if( pt && cmp ) pt = NULL;
rcnt2 = cnt2 ;
//frag1 = _mm_loadu_si128(p1);// load next fragment from s1
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
kpm_i = 0;
}
else { // equ order comp resulted in sub-frag match or non-match
if( cmp == 16) p1 = (__m128i *)(((char *)p1) + 16);
else p1 = (__m128i *)(((char *)p1) + cmp);
rcnt1 = cnt1 -(int) ((char *)p1-(char *)s1);
if( pt && cmp ) pt = NULL;
frag1 = _mm_loadu_si128(p1);// load next fragment from s1
}
}
}
return idx;
}

The relative speed up of byte-granular KMP, brute-force SSE4.2, and SSE4.2 with KMP overlap table over
byte-granular brute-force substring search is illustrated in the graph that plots relative speedup over
percentage of retrace for a reference string of 55 bytes long. A retrace of 40% in the graph meant, after
a partial match of the first 22 characters, a false match is determined.
So when brute-force, byte-granular code has to retrace, the other three implementation may be able to
avoid the need to retrace because:
•

Example 10-8 can use KMP overlap table to predict the start offset of next round of string compare
operation after a partial-match/false-match, but forward movement after a first-character-falsematch is still byte-granular.
10-17

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

•

Example 10-9 can avoid retrace of shorter than 15 bytes but will be subject to retrace of 21 bytes
after a partial-match/false-match at byte 22 of the reference string. Forward movement after each
order-compare-false-match is 16 byte granular.

•

Example 10-10 avoids retrace of 21 bytes after a partial-match/false-match, but KMP overlap table
lookup incurs some overhead. Forward movement after each order-compare-false-match is 16 byte
granular.

SSE4.2 Sub-String Match Performance
7.0
6.0

RelativePerf.

5.0

Brute

4.0

KMP

3.0

STTNI

2.0

STTNI+KMP

1.0

8.7
%
17
.4%
27
.8%
34
.8%
43
.4%
52
.1%
60
.8%
69
.5%
78
.2%
86
.9%
95
.6%

0.0

Retrace of non-degen. String n = 55

Figure 10-3. SSE4.2 Speedup of SubString Searches

10.3.4

String Token Extraction and Case Handling

Token extraction is a common task in text/string handling. It is one of the foundation of implementing
lexer/parser objects of higher sophistication. Indexing services also build on tokenization primitives to
sort text data from streams.
Tokenization requires the flexibility to use an array of delimiter characters.
A library implementation of Strtok_s() may employ a table-lookup technique to consolidate sequential
comparisons of the delimiter characters into one comparison (similar to Example 10-6). An SSE4.2
implementation of the equivalent functionality of strtok_s() using intrinsic is shown in Example 10-11.

10-18

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-11. I Equivalent Strtok_s() Using PCMPISTRI Intrinsic
char ws_map8[32]; // packed bit lookup table for delimiter characters
char * strtok_sse4_2i(char* s1, char *sdlm, char ** pCtxt)
{
__m128i *p1 = (__m128i *) s1;
__m128ifrag1, stmpz, stmp1;
int cmp_z, jj =0;
int start, endtok, s_idx, ldx;
if (sdlm == NULL || pCtxt == NULL) return NULL;
if( p1 == NULL && *pCtxt == NULL) return NULL;
if( s1 == NULL) {
if( *pCtxt[0] == 0 ) { return NULL; }
p1 = (__m128i *) *pCtxt;
s1 = *pCtxt;
}
else p1 = (__m128i *) s1;
memset(&ws_map8[0], 0, 32);
while (sdlm[jj] ) {
ws_map8[ (sdlm[jj] >> 3) ] |= (1 << (sdlm[jj] & 7) ); jj ++
}
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
stmpz = _mm_loadu_si128((__m128i *)sdelimiter);
// if the first char is not a delimiter , proceed to check non-delimiter,
// otherwise need to skip leading delimiter chars
if( ws_map8[s1[0]>>3] & (1 << (s1[0]&7)) ) {
start = s_idx = _mm_cmpistri(stmpz, frag1, 0x10);// unsigned bytes/equal any, invert, lsb
}
else start = s_idx = 0;
// check if we're dealing with short input string less than 16 bytes
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x10);
if( cmp_z) { // last fragment
if( !start) {
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);
if( endtok == 16) { // didn't find delimiter at the end, since it's null-terminated
// find where is the null byte
*pCtxt = s1+ 1+ _mm_cmpistri(frag1, frag1, 0x40);
return s1;
}
else { // found a delimiter that ends this word
s1[start+endtok] = 0;
*pCtxt = s1+start+endtok+1;
}
}
(continue)

10-19

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-11. I Equivalent Strtok_s() Using PCMPISTRI Intrinsic (Contd.)

}

else {
if(!s1[start] ) {
*pCtxt = s1 + start +1;
return NULL;
}
p1 = (__m128i *)(((char *)p1) + start);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
if( endtok == 16) { // looking for delimiter, found none
*pCtxt = (char *)p1 + 1+ _mm_cmpistri(frag1, frag1, 0x40);
return s1+start;
}
else { // found delimiter before null byte
s1[start+endtok] = 0;
*pCtxt = s1+start+endtok+1;
}
}

else
{ while ( !cmp_z && s_idx == 16) {
p1 = (__m128i *)(((char *)p1) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
s_idx = _mm_cmpistri(stmpz, frag1, 0x10);// unsigned bytes/equal any, invert, lsb
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x10);
}
if(s_idx != 16) start = ((char *) p1 -s1) + s_idx;
else { // corner case if we ran to the end looking for delimiter and never found a non-dilimiter
*pCtxt = (char *)p1 +1+ _mm_cmpistri(frag1, frag1, 0x40);
return NULL;
}
if( !s1[start] ) { // in case a null byte follows delimiter chars
*pCtxt = s1 + start+1;
return NULL;
}
// now proceed to find how many non-delimiters are there
p1 = (__m128i *)(((char *)p1) + s_idx);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
cmp_z = 0;
while ( !cmp_z && ldx == 16) {
p1 = (__m128i *)(((char *)p1) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x00);
if(cmp_z) { endtok += ldx; }
}
(continue)

10-20

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-11. I Equivalent Strtok_s() Using PCMPISTRI Intrinsic (Contd.)
if( cmp_z ) { // reached the end of s1
if( ldx < 16) // end of word found by finding a delimiter
endtok += ldx;
else { // end of word found by finding the null
if( s1[start+endtok]) // ensure this frag don’t start with null byte
endtok += 1+ _mm_cmpistri(frag1, frag1, 0x40);
}
}
*pCtxt = s1+start+endtok+1;
s1[start+endtok] = 0;

}

}
return (char *) (s1+ start);

An SSE4.2 implementation of the equivalent functionality of strupr() using intrinsic is shown in
Example 10-12.

Example 10-12. I Equivalent Strupr() Using PCMPISTRM Intrinsic
static char uldelta[16]= {0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20};
static char ranglc[6]= {0x61, 0x7a, 0x00, 0x00, 0x00, 0x00};
char * strup_sse4_2i( char* s1)
{int len = 0, res = 0;
__m128i *p1 = (__m128i *) s1;
__m128ifrag1, ranglo, rmsk, stmpz, stmp1;
int cmp_c, cmp_z, cmp_s;
if( !s1[0]) return (char *) s1;
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
ranglo = _mm_loadu_si128((__m128i *)ranglc);// load up to 16 bytes of fragment
stmpz = _mm_loadu_si128((__m128i *)uldelta);
cmp_z = _mm_cmpistrz(ranglo, frag1, 0x44);// range compare, produce byte masks
while (!cmp_z)
{
rmsk = _mm_cmpistrm(ranglo, frag1, 0x44); // producing byte mask
stmp1 = _mm_blendv_epi8(stmpz, frag1, rmsk); // bytes of lc preserved, other bytes replaced by const
stmp1 =_mm_sub_epi8(stmp1, stmpz); // bytes of lc becomes uc, other bytes are now zero
stmp1 = _mm_blendv_epi8(frag1, stmp1, rmsk); //bytes of lc replaced by uc, other bytes unchanged
_mm_storeu_si128(p1, stmp1);//
p1 = (__m128i *)(((char *)p1) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
cmp_z = _mm_cmpistrz(ranglo, frag1, 0x44);
}
(continue)

10-21

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-12. I Equivalent Strupr() Using PCMPISTRM Intrinsic (Contd.)
if( *(char *)p1 == 0) return (char *) s1;
rmsk = _mm_cmpistrm(ranglo, frag1, 0x44);// byte mask, valid lc bytes are 1, all other 0
stmp1 = _mm_blendv_epi8(stmpz, frag1, rmsk); // bytes of lc continue, other bytes replaced by const
stmp1 =_mm_sub_epi8(stmp1, stmpz); // bytes of lc becomes uc, other bytes are now zero
stmp1 = _mm_blendv_epi8(frag1, stmp1, rmsk); //bytes of lc replaced by uc, other bytes unchanged
rmsk = _mm_cmpistrm(frag1, frag1, 0x44);// byte mask, valid bytes are 1, invalid bytes are zero
_mm_maskmoveu_si128(stmp1, rmsk, (char *) p1);//
return (char *) s1;
}

10.3.5

Unicode Processing and PCMPxSTRy

Unicode representation of string/text data is required for software localization. UTF-16 is a common
encoding scheme for localized content. In UTF-16 representation, each character is represented by a
code point. There are two classes of code points: 16-bit code points and 32-bit code points which consists
of a pair of 16-bit code points in specified value range, the latter is also referred to as a surrogate pair.
A common technique in unicode processing uses a table-loop up method, which has the benefit of
reduced branching. As a tutorial example we compare the analogous problem of determining properlyencoded UTF-16 string length using general purpose code with table-lookup vs. SSE4.2.
Example 10-13 lists the C code sequence to determine the number of properly-encoded UTF-16 code
points (either 16-bit or 32-bit code points) in a unicode text block. The code also verifies if there are any
improperly-encoded surrogate pairs in the text block.

Example 10-13. UTF16 VerStrlen() Using C and Table Lookup Technique
// This example demonstrates validation of surrogate pairs (32-bit code point) and
// tally the number of16-bit and 32-bit code points in the text block
// Parameters: s1 is pointer to input utf-16 text block.
// pLen: store count of utf-16 code points
// return the number of 16-bit code point encoded in the surrogate range but do not form
// a properly encoded surrogate pair. if 0: s1 is a properly encoded utf-16 block,
// If return value >0 then s1 contains invalid encoding of surrogates
int u16vstrlen_c(const short* s1, unsigned * pLen)
{int i, j, cnt = 0, cnt_invl = 0, spcnt= 0;
unsigned short cc, cc2;
char flg[3];
cc2 = cc = s1[0];
// map each word in s1into bit patterns of 0, 1or 2 using a table lookup
// the first half of a surrogate pair must be encoded between D800-DBFF and mapped as 2
// the 2nd half of a surrogate pair must be encoded between DC00-DFFF and mapped as 1
// regular 16-bit encodings are mapped to 0, except null code mapped to 3
flg[1] = utf16map[cc];
flg[0] = flg[1];
if(!flg[1]) cnt ++;
i = 1;
(continue)

10-22

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-13. UTF16 VerStrlen() Using C and Table Lookup Technique (Contd.)
while (cc2 ) // examine each non-null word encoding
{ cc2 = s1[i];
flg[2] = utf16map[cc2];
if( (flg[1] && flg[2] && (flg[1]-flg[2] == 1) ) )
{ spcnt ++; }// found a surrogate pair
else if(flg[1] == 2 && flg[2] != 1)
{ cnt_invl += 1; } // orphaned 1st half
else if( !flg[1] && flg[2] == 1)
{ cnt_invl += 1; } // orphaned 2nd half
else
{ if(!flg[2]) cnt ++;// regular non-null code16-bit code point
else ;
}
flg[0] = flg[1];// save the pair sequence for next iteration
flg[1] = flg[2];
i++;
}
*pLen = cnt + spcnt;
return cnt_invl;
}

The VerStrlen() function for UTF-16 encoded text block can be implemented using SSE4.2.
Example 10-14 shows the listing of SSE4.2 assembly implementation and Example 10-15 shows the
listing of SSE4.2 intrinsic listings of VerStrlen().

Example 10-14. Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI
// complementary range values for detecting either halves of 32-bit UTF-16 code point
static short ssch0[16]= {0x1, 0xd7ff, 0xe000, 0xffff, 0, 0};
// complementary range values for detecting the 1st half of 32-bit UTF-16 code point
static short ssch1[16]= {0x1, 0xd7ff, 0xdc00, 0xffff, 0, 0};
// complementary range values for detecting the 2nd half of 32-bit UTF-16 code point
static short ssch2[16]= {0x1, 0xdbff, 0xe000, 0xffff, 0, 0};
int utf16slen_sse4_2a(const short* s1, unsigned * pLen)
{int len = 0, res = 0;
_asm{
mov eax, s1
movdquxmm2, ssch0 ; load range value to identify either halves
movdquxmm3, ssch1 ; load range value to identify 1st half (0xd800 to 0xdbff)
movdquxmm4, ssch2 ; load range value to identify 2nd half (0xdc00 to 0xdfff)
xor ecx, ecx
xor edx, edx; store # of 32-bit code points (surrogate pairs)
xor ebx, ebx; store # of non-null 16-bit code points
xor edi, edi ; store # of invalid word encodings
(continue)

10-23

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-14. Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI (Contd.)
_loopc:
shl ecx, 1; pcmpistri with word processing return ecx in word granularity, multiply by 2 to get byte offset
add eax, ecx
movdquxmm1, [eax] ; load a string fragment of up to 8 words
pcmpistri xmm2, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
; if there is a utf-16 null wchar in xmm1, zf will be set.
; if all 8 words in the comparison matched range,
; none of bits in the intermediate result will be set after polarity inversions,
; and ECX will return with a value of 8
jz
short _lstfrag; if null code, handle last fragment
; if ecx < 8, ecx point to a word of either 1st or 2nd half of a 32-bit code point
cmp ecx, 8
jne _chksp
add ebx, ecx ; accumulate # of 16-bit non-null code points
mov ecx, 8 ; ecx must be 8 at this point, we want to avoid loop carry dependency
jmp _loopc
_chksp:; this fragment has word encodings in the surrogate value range
add ebx, ecx ; account for the 16-bit code points
shl ecx, 1; pcmpistri with word processing return ecx in word granularity, multiply by 2 to get byte offset
add eax, ecx
movdquxmm1, [eax] ; ensure the fragment start with word encoding in either half
pcmpistri xmm3, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
jz
short _lstfrag2; if null code, handle the last fragment
cmp ecx, 0 ; properly encoded 32-bit code point must start with 1st half
jg
_invalidsp; some invalid s-p code point exists in the fragment
pcmpistri xmm4, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
cmp ecx, 1 ; the 2nd half must follow the first half
jne _invalidsp
add edx, 1; accumulate # of valid surrogate pairs
add ecx, 1 ; we want to advance two words
jmp _loopc
_invalidsp:; the first word of this fragment is either the 2nd half or an un-paired 1st half
add edi, 1 ; we have an invalid code point (not a surrogate pair)
mov ecx, 1 ; advance one word and continue scan for 32-bit code points
jmp _loopc
_lstfrag:
add ebx, ecx ; account for the non-null 16-bit code points
_morept:
shl ecx, 1; pcmpistri with word processing return ecx in word granularity, multiply by 2 to get byte offset
add eax, ecx
mov si, [eax] ; need to check for null code
cmp si, 0
je
_final
movdquxmm1, [eax] ; load remaining word elements which start with either 1st/2nd half
pcmpistri xmm3, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
_lstfrag2:
cmp ecx, 0 ; a valid 32-bit code point must start from 1st half
jne _invalidsp2
pcmpistri xmm4, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
cmp ecx, 1
jne _invalidsp2
(continue)

10-24

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-14. Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI (Contd.)
add edx, 1
mov ecx, 2
jmp _morept
_invalidsp2:
add edi, 1
mov ecx, 1
jmp _morept
_final:
add edx, ebx; add # of 16-bit and 32-bit code points
mov ecx, pLen; retrieve address of pointer provided by caller
mov [ecx], edx; store result of string length to memory
mov res, edi
}
return res;
}

Example 10-15. Intrinsic Listings of UTF16 VerStrlen() Using PCMPISTRI
int utf16slen_i(const short* s1, unsigned * pLen)
{int len = 0, res = 0;
__m128i *pF = (__m128i *) s1;
__m128iu32 =_mm_loadu_si128((__m128i *)ssch0);
__m128i u32a =_mm_loadu_si128((__m128i *)ssch1);
__m128i u32b =_mm_loadu_si128((__m128i *)ssch2);
__m128ifrag1;
int offset1 = 0, cmp, cmp_1, cmp_2;
intcnt_16 = 0, cnt_sp=0, cnt_invl= 0;
short *ps;
while (1) {
pF = (__m128i *)(((short *)pF) + offset1);
frag1 = _mm_loadu_si128(pF);// load up to 8 words
// does frag1 contain either halves of a 32-bit UTF-16 code point?
cmp = _mm_cmpistri(u32, frag1, 0x15);// unsigned bytes, equal order, lsb index returned to ecx
if (_mm_cmpistrz(u32, frag1, 0x15))// there is a null code in frag1
{ cnt_16 += cmp;
ps = (((short *)pF) + cmp);
while (ps[0])
{ frag1 = _mm_loadu_si128( (__m128i *)ps);
cmp_1 = _mm_cmpistri(u32a, frag1, 0x15);
if(!cmp_1)
{ cmp_2 = _mm_cmpistri(u32b, frag1, 0x15);
if( cmp_2 ==1) { cnt_sp++; offset1 = 2;}
else {cnt_invl++; offset1= 1;}
}
(continue)

10-25

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-15. Intrinsic Listings of UTF16 VerStrlen() Using PCMPISTRI (Contd.)
else
{ cmp_2 = _mm_cmpistri(u32b, frag1, 0x15);
if(!cmp_2) {cnt_invl ++; offset1 = 1;}
else {cnt_16 ++; offset1 = 1; }
}
ps = (((short *)ps) + offset1);

}

};
}

}
break;

if(cmp != 8) // we have at least some half of 32-bit utf-16 code points
{ cnt_16 += cmp; // regular 16-bit UTF16 code points
pF = (__m128i *)(((short *)pF) + cmp);
frag1 = _mm_loadu_si128(pF);
cmp_1 = _mm_cmpistri(u32a, frag1, 0x15);
if(!cmp_1)
{ cmp_2 = _mm_cmpistri(u32b, frag1, 0x15);
if( cmp_2 ==1) { cnt_sp++; offset1 = 2;}
else {cnt_invl++; offset1= 1;}
}
else
{ cnt_invl ++;
offset1 = 1;
}
}
else {
offset1 = 8; // increment address by 16 bytes to handle next fragment
cnt_16+= 8;
}
*pLen = cnt_16 + cnt_sp;
return cnt_invl;

10.3.6

Replacement String Library Function Using SSE4.2

Unaligned 128-bit SIMD memory access can fetch data cross page boundary, since system software
manages memory access rights with page granularity.
Implementing a replacement string library function using SIMD instructions must not cause memory
access violation. This requirement can be met by adding a small amounts of code to check the memory
address of each string fragment. If a memory address is found to be within 16 bytes of crossing over to
the next page boundary, string processing algorithm can fall back to byte-granular technique.
Example 10-16 shows an SSE4.2 implementation of strcmp() that can replace byte-granular implementation supplied by standard tools.

10-26

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-16. Replacement String Library Strcmp Using SSE4.2
// return 0 if strings are equal, 1 if greater, -1 if less
int strcmp_sse4_2(const char *src1, const char *src2)
{
int val;
__asm{
mov
esi, src1 ;
mov
edi, src2
mov
edx, -16 ; common index relative to base of either string pointer
xor
eax, eax
topofloop:
add
edx, 16 ; prevent loop carry dependency
next:
lea
ecx, [esi+edx] ; address of fragment that we want to load
and
ecx, 0x0fff ; check least significant12 bits of addr for page boundary
cmp
ecx, 0x0ff0
jg
too_close_pgb ; branch to byte-granular if within 16 bytes of boundary
lea
ecx, [edi+edx] ; do the same check for each fragment of 2nd string
and
ecx, 0x0fff
cmp
ecx, 0x0ff0
jg
too_close_pgb
movdqu
xmm2, BYTE PTR[esi+edx]
movdqu
xmm1, BYTE PTR[edi+edx]
pcmpistri
xmm2, xmm1, 0x18 ; equal each
ja
topofloop
jnc
ret_tag
add
edx, ecx ; ecx points to the byte offset that differ
not_equal:
movzx
eax, BYTE PTR[esi+edx]
movzx
edx, BYTE PTR[edi+edx]
cmp
eax, edx
cmova
eax, ONE
cmovb
eax, NEG_ONE
jmp
ret_tag
too_close_pgb:
add
movzx
movzx
cmp
jne
add
jnz
jmp
inequality:
cmovb
cmova
(continue)

edx, 1 ; do byte granular compare
ecx, BYTE PTR[esi+edx-1]
ebx, BYTE PTR[edi+edx-1]
ecx, ebx
inequality
ebx, ecx
next
ret_tag
eax, NEG_ONE
eax, ONE

10-27

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-16. Replacement String Library Strcmp Using SSE4.2 (Contd.)
ret_tag:
mov
}
return(val);

[val], eax

}

In Example 10-16, 8 instructions were added following the label “next” to perform 4KByte boundary
checking of address that will be used to load two string fragments into registers. If either address is
found to be within 16 bytes of crossing over to the next page, the code branches to byte-granular
comparison path following the label “too_close_pgb“.
The return values of Example 10-16 uses the convention of returning 0, +1, -1 using CMOV. It is straight
forward to modify a few instructions to implement the convention of returning 0, positive integer, negative integer.

10.4

SSE4.2 ENABLED NUMERICAL AND LEXICAL COMPUTATION

SSE4.2 can enable SIMD programming techniques to explore byte-granular computational problems that
were considered unlikely candidates for using SIMD instructions. We consider a common library function
atol() in its full 64-bit flavor of converting a sequence of alpha numerical characters within the range
representable by the data type __int64.
There are several attributes of this string-to-integer problem that poses as difficult challenges for using
prior SIMD instruction sets (before the introduction of SSE4.2) to accelerate the numerical computation
aspect of string-to-integer conversions:
•

Character subset validation: Each character in the input stream must be validated with respect to the
character subset definitions and conform to data representation rules of white space, signs,
numerical digits. SSE4.2 provides the perfect tools for character subset validation.

•

State-dependent nature of character validation: While SIMD computation instructions can expedite
the arithmetic operations of “multiply by 10 and add“, the arithmetic computation requires the input
byte stream to consist of numerical digits only. For example, the validation of numerical digits, whitespace, and the presence/absence of sign, must be validated in mid-stream. The flexibility of the
SSE4.2 primitive can handle these state-dependent validation well.

•

Additionally, exit condition to wrap up arithmetic computation can happen in mid-stream due to
invalid characters, or due to finite representable range of the data type (~10^19 for int64, no more
than 10 non-zero-leading digits for int32) may lead one to believe this type data stream consisting of
short bursts are not suited for exploring SIMD ISA and be content with byte-granular solutions.

Because of the character subset validation and state-dependent nature, byte-granular solutions of the
standard library function tends to have a high start-up cost (for example, converting a single numerical
digit to integer may take 50 or 60 cycles), and low throughput (each additional numeric digit in the input
character stream may take 6-8 cycles per byte).
A high level pseudo-operation flow of implementing a library replacement of atol() is described in
Example 10-17.

10-28

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-17. High-level flow of Character Subset Validation for String Conversion
1. Check Early_Out Exit Conditions (e.g. first byte is not valid).
2. Check if 1st byte is white space and skip any additional leading white space.
3. Check for the presence of a sign byte.
4. Check the validity of the remaining byte stream if they are numeric digits.
5. If the byte stream starts with ‘0’, skip all leading digits that are ‘0’.
6. Determine how many valid non-zero-leading numeric digits.
7. Convert up to 16 non-zero-leading digits to int64 value.
8. load up to the next 16 bytes safely and check for consecutive numeric digits
9. Normalize int64 value converted from the first 16 digits, according to # of remaining digits,
10. Check for out-of-bound results of normalized intermediate int64 value,
11. Convert remaining digits to int64 value and add to normalized intermediate result,
12. Check for out-of-bound final results.

Example 10-18 shows the code listing of an equivalent functionality of atol() capable of producing int64
output range. Auxiliary function and data constants are listed in Example 10-19.

Example 10-18. Intrinsic Listings of atol() Replacement Using PCMPISTRI
__int64 sse4i_atol(const char* s1)
{char *p = ( char *) s1;
int NegSgn = 0;
__m128i mask0;
__m128i value0, value1;
__m128i w1, w1_l8, w1_u8, w2, w3 = _mm_setzero_si128();
__int64 xxi;
int index, cflag, sflag, zflag, oob=0;
// check the first character is valid via lookup
if ( (BtMLValDecInt[ *p >> 3] & (1 << ((*p) & 7)) ) == 0) return 0;
// if the first character is white space, skip remaining white spaces
if (BtMLws[*p >>3] & (1 <<((*p) & 7)) )
{ p ++;
value0 = _mm_loadu_si128 ((__m128i *) listws);
skip_more_ws:
mask0 = __m128i_strloadu_page_boundary (p);
/* look for the 1st non-white space character */
index = _mm_cmpistri (value0, mask0, 0x10);
cflag = _mm_cmpistrc (value0, mask0, 0x10);
sflag = _mm_cmpistrs (value0, mask0, 0x10);
if( !sflag && !cflag)
{ p = ( char *) ((size_t) p + 16);
goto skip_more_ws;
}
else
p = ( char *) ((size_t) p + index);
}
(continue)

10-29

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-18. Intrinsic Listings of atol() Replacement Using PCMPISTRI (Contd.)
if( *p == '-')
{ p++;
NegSgn = 1;
}
else if( *p == '+') p++;
/* load up to 16 byte safely and check how many valid numeric digits we can do SIMD */
value0 = _mm_loadu_si128 ((__m128i *) rangenumint);
mask0 = __m128i_strloadu_page_boundary (p);
index = _mm_cmpistri (value0, mask0, 0x14);
zflag = _mm_cmpistrz (value0, mask0, 0x14);
/* index points to the first digit that is not a valid numeric digit */
if( !index) return 0;
else if (index == 16)
{ if( *p == '0') /* if all 16 bytes are numeric digits */
{ /* skip leading zero */
value1 = _mm_loadu_si128 ((__m128i *) rangenumintzr);
index = _mm_cmpistri (value1, mask0, 0x14);
zflag = _mm_cmpistrz (value1, mask0, 0x14);
while(index == 16 && !zflag )
{ p = ( char *) ((size_t) p + 16);
mask0 = __m128i_strloadu_page_boundary (p);
index = _mm_cmpistri (value1, mask0, 0x14);
zflag = _mm_cmpistrz (value1, mask0, 0x14);
}
/* now the 1st digit is non-zero, load up to 16 bytes and update index */
if( index < 16)
p = ( char *) ((size_t) p + index);
/* load up to 16 bytes of non-zero leading numeric digits */
mask0 = __m128i_strloadu_page_boundary (p);
/* update index to point to non-numeric character or indicate we may have more than 16 bytes */
index = _mm_cmpistri (value0, mask0, 0x14);
}
}
if( index == 0) return 0;
else if( index == 1) return (NegSgn? (long long) -(p[0]-48): (long long) (p[0]-48));
// Input digits in xmm are ordered in reverse order. the LS digit of output is next to eos
// least sig numeric digit aligned to byte 15 , and subtract 0x30 from each ascii code
mask0 = ShfLAlnLSByte( mask0, 16 -index);
w1_u8 = _mm_slli_si128 ( mask0, 1);
w1 = _mm_add_epi8( mask0, _mm_slli_epi16 (w1_u8, 3)); /* mul by 8 and add */
w1 = _mm_add_epi8( w1, _mm_slli_epi16 (w1_u8, 1)); /* 7 LS bits per byte, in bytes 0, 2, 4, 6, 8, 10, 12, 14*/
w1 = _mm_srli_epi16( w1, 8); /* clear out upper bits of each wd*/
w2 = _mm_madd_epi16(w1, _mm_loadu_si128( (__m128i *) &MulplyPairBaseP2[0]) ); /* multiply base^2, add adjacent
word,*/
w1_u8 = _mm_packus_epi32 ( w2, w2); /* pack 4 low word of each dword into 63:0 */
w1 = _mm_madd_epi16(w1_u8, _mm_loadu_si128( (__m128i *) &MulplyPairBaseP4[0]) ); /* multiply base^4, add
adjacent word,*/
w1 = _mm_cvtepu32_epi64( w1); /* converted dw was in 63:0, expand to qw */
w1_l8 = _mm_mul_epu32(w1, _mm_setr_epi32( 100000000, 0, 0, 0) );
w2 = _mm_add_epi64(w1_l8, _mm_srli_si128 (w1, 8) );
(continue)

10-30

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-18. Intrinsic Listings of atol() Replacement Using PCMPISTRI (Contd.)
if( index < 16)
{ xxi = _mm_extract_epi64(w2, 0);
return (NegSgn? (long long) -xxi: (long long) xxi);
}
/* 64-bit integer allow up to 20 non-zero-leading digits. */
/* accumulate each 16-digit fragment*/
w3 = _mm_add_epi64(w3, w2);
/* handle next batch of up to 16 digits, 64-bit integer only allow 4 more digits */
p = ( char *) ((size_t) p + 16);
if( *p == 0)
{ xxi = _mm_extract_epi64(w2, 0);
return (NegSgn? (long long) -xxi: (long long) xxi);
}
mask0 = __m128i_strloadu_page_boundary (p);
/* index points to first non-numeric digit */
index = _mm_cmpistri (value0, mask0, 0x14);
zflag = _mm_cmpistrz (value0, mask0, 0x14);
if( index == 0) /* the first char is not valid numeric digit */
{ xxi = _mm_extract_epi64(w2, 0);
return (NegSgn? (long long) -xxi: (long long) xxi);
}
if ( index > 3) return (NegSgn? (long long) RINT64VALNEG: (long long) RINT64VALPOS);
/* multiply low qword by base^index */
w1 = _mm_mul_epu32( _mm_shuffle_epi32( w2, 0x50), _mm_setr_epi32( MulplyByBaseExpN
[index - 1] , 0, MulplyByBaseExpN[index-1], 0));
w3 = _mm_add_epi64(w1, _mm_slli_epi64 (_mm_srli_si128(w1, 8), 32 ) );
mask0 = ShfLAlnLSByte( mask0, 16 -index);
// convert upper 8 bytes of xmm: only least sig. 4 digits of output will be added to prev 16 digits
w1_u8 = _mm_cvtepi8_epi16(_mm_srli_si128 ( mask0, 8));
/* merge 2 digit at a time with multiplier into each dword*/
w1_u8 = _mm_madd_epi16(w1_u8, _mm_loadu_si128( (__m128i *) &MulplyQuadBaseExp3To0 [ 0]));
/* bits 63:0 has two dword integer, bits 63:32 is the LS dword of output; bits 127:64 is not needed*/
w1_u8 = _mm_cvtepu32_epi64( _mm_hadd_epi32(w1_u8, w1_u8) );
w3 = _mm_add_epi64(w3, _mm_srli_si128( w1_u8, 8) );
xxi = _mm_extract_epi64(w3, 0);
if( xxi >> 63 )
return (NegSgn? (long long) RINT64VALNEG: (long long) RINT64VALPOS);
else return (NegSgn? (long long) -xxi: (long long) xxi);
}
The general performance characteristics of an SSE4.2 enhanced atol() replacement have a start-up cost
that is somewhat lower than byte-granular implementations generated from C code.

Example 10-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing
// bit lookup table of valid ascii code for decimal string conversion, white space, sign, numeric digits
static char BtMLValDecInt[32] = {0x0, 0x3e, 0x0, 0x0, 0x1, 0x28, 0xff, 0x03,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
(continue)

10-31

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing (Contd.)
// bit lookup table, white space only
static char BtMLws[32] = {0x0, 0x3e, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
// list of white space for sttni use
static char listws[16] =
{0x20, 0x9, 0xa, 0xb, 0xc, 0xd, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
// list of numeric digits for sttni use
static char rangenumint[16] =
{0x30, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
static char rangenumintzr[16] =
{0x30, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
// we use pmaddwd to merge two adjacent short integer pair, this is the second step of merging each pair of 2-digit
integers
static short MulplyPairBaseP2[8] =
{ 100, 1, 100, 1, 100, 1, 100, 1};
// Multiplier-pair for two adjacent short integer pair, this is the third step of merging each pair of 4-digit integers
static short MulplyPairBaseP4[8] =
{ 10000, 1, 10000, 1, 10000, 1, 10000, 1 };
// multiplier for pmulld for normalization of > 16 digits
static int MulplyByBaseExpN[8] =
{ 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000};
static short MulplyQuadBaseExp3To0[8] =
{ 1000, 100, 10, 1, 1000, 100, 10, 1};
__m128i __m128i_shift_right (__m128i value, int offset)
{ switch (offset)
{
case 1: value = _mm_srli_si128 (value, 1); break;
case 2: value = _mm_srli_si128 (value, 2); break;
case 3: value = _mm_srli_si128 (value, 3); break;
case 4: value = _mm_srli_si128 (value, 4); break;
case 5: value = _mm_srli_si128 (value, 5); break;
case 6: value = _mm_srli_si128 (value, 6); break;
case 7: value = _mm_srli_si128 (value, 7); break;
case 8: value = _mm_srli_si128 (value, 8); break;
case 9: value = _mm_srli_si128 (value, 9); break;
case 10: value = _mm_srli_si128 (value, 10); break;
case 11: value = _mm_srli_si128 (value, 11); break;
case 12: value = _mm_srli_si128 (value, 12); break;
case 13: value = _mm_srli_si128 (value, 13); break;
case 14: value = _mm_srli_si128 (value, 14); break;
case 15: value = _mm_srli_si128 (value, 15); break;
}
return value;
}
(continue)

10-32

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing (Contd.)
/* Load string at S near page boundary safely. */
__m128i __m128i_strloadu_page_boundary (const char *s)
{
int offset = ((size_t) s & (16 - 1));
if (offset)
{
__m128i v = _mm_load_si128 ((__m128i *) (s - offset));
__m128i zero = _mm_setzero_si128 ();
int bmsk = _mm_movemask_epi8 (_mm_cmpeq_epi8 (v, zero));
if ( (bmsk >> offset) != 0 ) return __m128i_shift_right (v, offset);
}
return _mm_loadu_si128 ((__m128i *) s);
}
__m128i ShfLAlnLSByte( __m128i value, int offset)
{
/*now remove constant bias, so each byte element are unsigned byte int */
value = _mm_sub_epi8(value, _mm_setr_epi32(0x30303030, 0x30303030, 0x30303030, 0x30303030));
switch (offset)
{
case 1:
value = _mm_slli_si128 (value, 1); break;
case 2:
value = _mm_slli_si128 (value, 2); break;
case 3:
value = _mm_slli_si128 (value, 3); break;
case 4:
value = _mm_slli_si128 (value, 4); break;
case 5:
value = _mm_slli_si128 (value, 5); break;
case 6:
value = _mm_slli_si128 (value, 6); break;
case 7:
value = _mm_slli_si128 (value, 7); break;
case 8:
value = _mm_slli_si128 (value, 8); break;
case 9:
value = _mm_slli_si128 (value, 9); break;
case 10:
value = _mm_slli_si128 (value, 10); break;
case 11:
value = _mm_slli_si128 (value, 11); break;
case 12:
value = _mm_slli_si128 (value, 12); break;
case 13:
value = _mm_slli_si128 (value, 13); break;
case 14:
value = _mm_slli_si128 (value, 14); break;
case 15:
value = _mm_slli_si128 (value, 15); break;
}
return value;
}

10-33

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

With an input byte stream no more than 16 non-zero-leading digits, it has a constant performance. An
input string consisting of more than 16 bytes of non-zero-leading digits can be processed in about 100
cycles or less, compared byte-granular solution needing around 200 cycles. Even for shorter input strings
of 9 non-zero-leading digits, SSE4.2 enhanced replacement can also achieve ~2X performance of bytegranular solutions.

10.5

NUMERICAL DATA CONVERSION TO ASCII FORMAT

Conversion of binary integer data to ASCII format gets used in many situations from simple C library
functions to computations with finances. Some C libraries provides exported conversion functions like
itoa, ltoa; other libraries implement internal equivalents to support data formatting needs of standard
output functions. Among the most common binary integer to ascii conversion is conversion based on
radix 10. Example 10-20 shows the basic technique implemented in many libraries for base 10 conversion to ascii of a 64-bit integer. For simplicity, the example produces lower-case output format.

Example 10-20. Conversion of 64-bit Integer to ASCII
// Convert 64-bit signed binary integer to lower-case ASCII format
static char lc_digits[]= "0123456789abcdefghijklmnopqrstuvwxyz";
int lltoa_cref( __int64 x, char* out)
{const char *digits = &lc_digits[0];
char lbuf[32] // base 10 conversion of 64-bit signed integer need only 21 digits
char * p_bkwd = &lbuf[2];
__int64 y ;
unsigned int base = 10, len = 0, r, cnt;
if( x < 0)
{ y = -x;
while (y > 0)
{ r = (int) (y % base); // one digit at a time from least significant digit
y = y /base;
* --p_bkwd = digits[r];
len ++;
}
*out++ = ‘-’;
cnt = len +1;
while( len--) *out++ = p_bkwd++; // copy each converted digits
} else
{

}

10-34

y = x;
while (y > 0)
{ r = (int) (y % base); // one digit at a time from least significant digit
y = y /base;
* --p_bkwd = digits[r];
len ++;
}
cnt = len;;
while( len--) *out++ = p_bkwd++; // copy each converted digits
(continue)

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-20. Conversion of 64-bit Integer to ASCII (Contd.)
out[cnt] = 0;
return (int) cnt;
}

Example 10-20 employs iterative sequence that process one digit at a time using the hardware native
integer divide instruction. The reliance on integer divide can be replaced by fixed-point multiply technique discussed in Chapter 9. This is shown in Example 10-21.

Example 10-21. Conversion of 64-bit Integer to ASCII without Integer Division
// Convert 64-bit signed binary integer to lower-case ASCII format and
// replace integer division with fixed-point multiply
;__int64 umul_64x64(__int64* p128, __int64 u, __int64 v)
umul_64x64 PROC
mov
rax, rdx ; 2nd parameter
mul
r8 ; u * v
mov
qword ptr [rcx], rax
mov
qword ptr [rcx+8], rdx
ret 0
umul_64x64 ENDP
#define cg_10_pms3 0xcccccccccccccccdull
static char lc_digits[]= "0123456789";
int lltoa_cref( __int64 x, char* out)
{const char *digits = &lc_digits[0];
char lbuf[32] // base 10 conversion of 64-bit signed integer need only 21 digits
char * p_bkwd = &lbuf[2];
__int64 y, z128[2];
unsigned __int64 q;
unsigned int base = 10, len = 0, r, cnt;
if( x < 0)
{ y = -x;
while (y > 0)
{ umul_64x64( &z128[0], y, cg_10_pms3);
q = z128[1] >> 3;
q = (y < q * (unsigned __int64) base)? q-1: q;
r = (int) (y - q * (unsigned __int64) base); // one digit at a time from least significant digit
y =q;
* --p_bkwd = digits[r];
len ++;
}
*out++ = ‘-’;
cnt = len +1;
while( len--) *out++ = p_bkwd++; // copy each converted digits
} else
(continue)

10-35

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-21. Conversion of 64-bit Integer to ASCII without Integer Division (Contd.)
{

y = x;
while (y > 0)
{ umul_64x64( &z128[0], y, cg_10_pms3);
q = z128[1] >> 3;
q = (y < q * (unsigned __int64) base)? q-1: q;
r = (int) (y - q * (unsigned __int64) base); // one digit at a time from least significant digit
y =q;
* --p_bkwd = digits[r];
len ++;
}
cnt = len;;
while( len--) *out++ = p_bkwd++; // copy each converted digits

}
out[cnt] = 0;
return cnt;
}

Example 10-21 provides significant speed improvement by eliminating the reliance on integer divisions.
However, the numeric format conversion problem is still constrained by the dependent chain that process
one digit at a time.
SIMD technique can apply to this class of integer numeric conversion problem by noting that an unsigned
64-bit integer can expand a dynamic range of up to 20 digits. Such a wide dynamic range can be
expressed as polynomial expressions of the form:
a0 + a1 *10^4 + a2 *10^8 + a3 *10^12 + a4 *10^16 where
the dynamic range of ai is between [0, 9999].
Reduction of an unsigned 64-bit integer into up-to 5 reduced-range coefficients can be computed using
fixed-point multiply in stages. Once the dynamic range of coefficients are reduced to no more than 4
digits, one can apply SIMD techniques to compute ascii conversion in parallel.
The SIMD technique to convert an unsigned 16-bit integer via radix 10 with input dynamic range [0,
9999] is shown in Figure 10-4. This technique can also be generalized to apply to other non-power-of-2
radix that is less than 16.

10-36

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

U (range < 10^4) -> r0 + r1 *10 + r2*100 + r3*1000
127

0
0x0

U

0x0

U

0x0

0x0

U/10

0x0

U/100

0x0

U/1000

U

U/10

U/10

U/100

U/100

U/1000

U-(U/10)*10

U/10 - (U/100)*10

U

0x0

U

0x0

U/10000

U/1000

U/100-(U/1000)*10

0x0

U/1000

127

0
r0

r1

r2

r3

Generalization to dynamic range of input value beyond 10^4...
U32 (range <10^8) -> r0 + r1 *10 + r2*100 + r3*1000 + r4*10^4 + r5*10^5 + r6*10^6 + r7*10^7
127

0
r0

r1

r2

r3

r4

r5

r6

r7

Figure 10-4. Compute Four Remainders of Unsigned Short Integer in Parallel
To handle greater input dynamic ranges, the input is reduced into multiple unsigned short integers and
converted sequentially. The most significant U16 conversion is computed first, followed by the conversion
of the next four significant digits.
Example 10-22 shows the fixed-point multiply combined with parallel remainder computation using SSE4
instructions for 64-bit integer conversion up to 19 digits.

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4
#include 
#include 
#define QWCG10to8
#define QWCONST10to8

0xabcc77118461cefdull
100000000ull

/* macro to convert input parameter of short integer "hi4" into output variable "x3" which is __m128i;
the input value "hi4" is assume to be less than 10^4;
the output is 4 single-digit integer between 0-9, located in the low byte of each dword,
most significant digit in lowest DW.
implicit overwrites: locally allocated __m128i variable "x0", "x2"
*/
(continue)

10-37

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
#define __ParMod10to4SSSE3( x3, hi4 ) \
{
\
x0 = _mm_shuffle_epi32( _mm_cvtsi32_si128( (hi4)), 0); \
x2 = _mm_mulhi_epu16(x0, _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d));\
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d)), 10); \
(x3) = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) (hi4), 1); \
(x3) = _mm_or_si128(x2, (x3));\
(x3) = _mm_madd_epi16((x3), _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;\
}
/* macro to convert input parameter of the 3rd dword element of "t5" ( __m128i type)
into output variable "x3" which is __m128i;
the third dword element "t5" is assume to be less than 10^4, the 4th dword must be 0;
the output is 4 single-digit integer between 0-9, located in the low byte of each dword,
MS digit in LS DW.
implicit overwrites: locally allocated __m128i variable "x0", "x2"
*/
#define __ParMod10to4SSSE3v( x3, t5 ) \
{
\
x0 = _mm_shuffle_epi32( t5, 0xaa); \
x2 = _mm_mulhi_epu16(x0, _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d));\
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d)), 10); \
(x3) = _mm_or_si128(_mm_slli_si128(x2, 6), _mm_srli_si128(t5, 6)); \
(x3) = _mm_or_si128(x2, (x3));\
(x3) = _mm_madd_epi16((x3), _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;\
}
static __attribute__ ((aligned(16)) ) short quo4digComp_mulplr_d[8] =
{ 1024, 0, 64, 0, 8, 0, 0, 0};
static __attribute__ ((aligned(16)) ) short quoTenThsn_mulplr_d[8] =
{ 0x199a, 0, 0x28f6, 0, 0x20c5, 0, 0x1a37, 0};
static __attribute__ ((aligned(16)) ) short mten_mulplr_d[8] =
{ -10, 1, -10, 1, -10, 1, -10, 1};
static __attribute__ ((aligned(16)) ) unsigned short bcstpklodw[8] =
{0x080c, 0x0004, 0x8080, 0x8080, 0x8080, 0x8080, 0x8080, 0x8080};
static __attribute__ ((aligned(16)) ) unsigned short bcstpkdw1[8] =
{0x8080, 0x8080, 0x080c, 0x0004, 0x8080, 0x8080, 0x8080, 0x8080};
static __attribute__ ((aligned(16)) ) unsigned short bcstpkdw2[8] =
{0x8080, 0x8080, 0x8080, 0x8080, 0x080c, 0x0004, 0x8080, 0x8080};
static __attribute__ ((aligned(16)) ) unsigned short bcstpkdw3[8] =
{0x8080, 0x8080, 0x8080, 0x8080, 0x8080, 0x8080, 0x080c, 0x0004};
static __attribute__ ((aligned(16)) ) int asc0bias[4] =
{0x30, 0x30, 0x30, 0x30};
static __attribute__ ((aligned(16)) ) int asc0reversebias[4] =
{0xd0d0d0d0, 0xd0d0d0d0, 0xd0d0d0d0, 0xd0d0d0d0};
static __attribute__ ((aligned(16)) ) int pr_cg_10to4[4] =
{ 0x68db8db, 0 , 0x68db8db, 0};
static __attribute__ ((aligned(16)) ) int pr_1_m10to4[4] =
{ -10000, 0 , 1, 0};
(continue)

10-38

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
/*input value "xx" is less than 2^63-1 */
/* In environment that does not support binary integer arithmetic on __int128_t,
this helper can be done as asm routine
*/
__inline __int64_t u64mod10to8( __int64_t * pLo, __int64_t xx)
{__int128_t t, b = (__int128_t)QWCG10to8;
__int64_t q;
t = b * (__int128_t)xx;
q = t>>(64 +26); // shift count associated with QWCG10to8
*pLo = xx - QWCONST10to8 * q;
return q;
}
/* convert integer between 2^63-1 and 0 to ASCII string */
int sse4i_q2a_u63 ( __int64_t xx, char *ps)
{int j, tmp, idx=0, cnt;
__int64_t lo8, hi8, abv16, temp;
__m128i x0, m0, x1, x2, x3, x4, x5, x6, m1;
long long w, u;
if ( xx < 10000 )
{ j = ubs_Lt10k_2s_i2 ( (unsigned ) xx, ps);
ps[j] = 0;
return j;
}
if (xx < 100000000 ) // dynamic range of xx is less than 32-bits
{ m0 = _mm_cvtsi32_si128( xx);
x1 = _mm_shuffle_epi32(m0, 0x44); // broadcast to dw0 and dw2
x3 = _mm_mul_epu32(x1, _mm_loadu_si128( (__m128i *) pr_cg_10to4 ));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x1, 8), x3); // quotient in dw2, remainder in dw0
__ParMod10to4SSSE3v( x3, m0); // pack single digit from each dword to dw0
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8)); // move the remainder to dw2 first
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5); // pack digits in bytes 0-7 with leading 0
cnt = 8;
}
else
{ hi8 = u64mod10to8(&lo8, xx);
if ( hi8 < 10000) // decompose lo8 dword into quotient and remainder mod 10^4
{ m0 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3); // quotient in dw0
__ParMod10to4SSSE3( x3, hi8); // handle digist 11:8 first
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, m0); // handle digits 7:4
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw2) );
x4 = _mm_or_si128(x4, x5); // pack single digist in bytes 0-11 with leading 0
cnt = 12;
}
(continue)

10-39

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
else
{ cnt = 0;
if ( hi8 >= 100000000) // handle input greater than 10^16
{ abv16 = u64mod10to8(&temp, (__int64_t)hi8);
hi8 = temp;
__ParMod10to4SSSE3( x3, abv16);
x6 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
cnt = 4;
} // start with handling digits 15:12
m0 = _mm_cvtsi32_si128( hi8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
m1 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m1, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m1 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
__ParMod10to4SSSE3v( x3, m0);
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, m1);
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw2) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m1, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw3) );
x4 = _mm_or_si128(x4, x5);
cnt += 16;
}

}
m0 = _mm_loadu_si128( (__m128i *) asc0reversebias);
if( cnt > 16)
{ tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(x6,_mm_setzero_si128()) );
x6 = _mm_sub_epi8(x6, m0);
} else {
tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(x4,_mm_setzero_si128()) );
}
#ifndef __USE_GCC__
__asm__ ("bsfl %1, %%ecx; movl %%ecx, %0;" :"=r"(idx) :"r"(tmp) : "%ecx");
#else
_BitScanForward(&idx, tmp);
#endif
x4 = _mm_sub_epi8(x4, m0);
cnt -= idx;
w = _mm_cvtsi128_si64(x4);
switch(cnt)
{ case5:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
break;
case6:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
break;
(continue)

10-40

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
case7:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(unsigned *) (&ps[3]) = (w >>32);
break;
case 8: *(long long *)ps = w;
break;
case9:*ps++ = (char) (w >>24);
*(long long *) (&ps[0]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case10:*(short *)ps = (short) (w >>16);
*(long long *) (&ps[2]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case11:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(long long *) (&ps[3]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case 12: *(unsigned *)ps = w;
*(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case13:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
*(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case14:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
*(long long *) (&ps[6]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case15: *ps = (char) (w >>8);
*(short *) (&ps[1]) = (short) (w >>16); *(unsigned *) (&ps[3]) = (w >>32);
*(long long *) (&ps[7]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case 16: _mm_storeu_si128( (__m128i *) ps, x4);
break;
case17:u = _mm_cvtsi128_si64(x6); *ps++ = (char) (u >>24);
_mm_storeu_si128( (__m128i *) &ps[0], x4);
break;
case18:u = _mm_cvtsi128_si64(x6); *(short *)ps = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
break;
case19:u = _mm_cvtsi128_si64(x6); *ps = (char) (u >>8);
*(short *) (&ps[1]) = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[3], x4);
break;
case20:u = _mm_cvtsi128_si64(x6); *(unsigned *)ps = (short) (u);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
break;

}

}
return cnt;

(continue)

10-41

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
/* convert input value into 4 single digits via parallel fixed-point arithmetic with each dword
element, and pack each digit into low dword element and write to buffer without leading
white space; input value must be < 10000 and > 9
*/
__inline int ubs_Lt10k_2s_i2(int x_Lt10k, char *ps)
{int tmp;
__m128i x0, m0, x2, x3, x4, compv;
// Use a set of scaling constant to compensate for lack for per-element shift count
compv = _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d);
// broadcast input value to each dword element
x0 = _mm_shuffle_epi32( _mm_cvtsi32_si128( x_Lt10k), 0);
// low to high dword in x0 : u16, u16, u16, u16
m0 = _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d); // load 4 congruent consts
x2 = _mm_mulhi_epu16(x0, m0); // parallel fixed-point multiply for base 10,100, 1000, 10000
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, compv), 10);
// dword content in x2: u16/10, u16/100, u16/1000, u16/10000
x3 = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) x_Lt10k, 1);
//word content in x3: 0, u16, 0, u16/10, 0, u16/100, 0, u16/1000
x4 = _mm_or_si128(x2, x3);
// perform parallel remainder operation with each word pair to derive 4 unbiased single-digit result
x4 = _mm_madd_epi16(x4, _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;
x2 = _mm_add_epi32( x4, _mm_loadu_si128( (__m128i *) asc0bias) ) ;
// pack each ascii-biased digits from respective dword to the low dword element
x3 = _mm_shuffle_epi8(x2, _mm_loadu_si128( (__m128i *) bcstpklodw) );

}

// store ascii result to buffer without leading white space
if (x_Lt10k > 999 )
{ *(int *) ps = _mm_cvtsi128_si32( x3);
return 4;
}
else if (x_Lt10k > 99 )
{ tmp = _mm_cvtsi128_si32( x3);
*ps = (char ) (tmp >>8);
*((short *) (++ps)) = (short ) (tmp >>16);
return 3;
}
else if ( x_Lt10k > 9) // take advantage of reduced dynamic range > 9 to reduce branching
{ *((short *) ps) = (short ) _mm_extract_epi16( x3, 1);
return 2;
}
*ps = '0' + x_Lt10k;
return 1;

10-42

(continue)

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-22. Conversion of 64-bit Integer to ASCII Using SSE4 (Contd.)
char lower_digits[] = "0123456789";
int ltoa_sse4 (const long long s1, char * buf)
{long long temp ;
int j = 1, len = 0;
const char *digits = &lower_digits[0];
if( s1 < 0) {
temp = -s1;
len ++;
beg[0] = '-';
if( temp < 10) beg[1] = digits[ (int) temp];
else len += sse4i_q2a_u63( temp, &buf[ 1]); // parallel conversion in 4-digit granular operation
}
else {
if( s1 < 10) beg[ 0 ] = digits[(int)s1];
else len += sse4i_q2a_u63( s1, &buf[ 1] );
}
buf[len] = 0;
return len;
}

When an ltoa()-like utility implementation executes native IDIV instruction to convert one digit at a time,
it can produce output at a speed of about 45-50 cycles per digit. Using fixed-point multiply to replace
IDIV (like Example 10-21) can reduce 10-15 cycles per digit. Using 128-bit SIMD technique to perform
parallel fixed-point arithmetic, the output speed can further improve to 4-5 cycles per digit with recent
Intel microarchitectures like Sandy Bridge and Nehalem.
The range-reduction technique demonstrated in Example 10-22 reduces up-to 19 levels of dependency
chain down to 5 hierarchy and allows parallel SIMD technique to perform 4-wide numeric conversion.
This technique can also be done with only SSSE3, and with similar speed improvement.
Support for conversion to wide character strings can be easily adapted using the code snippet shown in
Example 10-23.

Example 10-23. Conversion of 64-bit Integer to Wide Character String Using SSE4
static __attribute__ ((aligned(16)) ) int asc0bias[4] =
{0x30, 0x30, 0x30, 0x30};
// exponent_x must be < 10000 and > 9
__inline int ubs_Lt10k_2wcs_i2(int x_Lt10k, wchar_t *ps)
{
__m128i x0, m0, x2, x3, x4, compv;
compv = _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d);
x0 = _mm_shuffle_epi32( _mm_cvtsi32_si128( x_Lt10k), 0); // low to high dw: u16, u16, u16, u16
m0 = _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d);
// u16, 0, u16, 0, u16, 0, u16, 0
x2 = _mm_mulhi_epu16(x0, m0);
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, compv), 10); // u16/10, u16/100, u16/1000, u16/10000
x3 = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) x_Lt10k, 1); // 0, u16, 0, u16/10, 0, u16/100, 0, u16/1000
x4 = _mm_or_si128(x2, x3);
x4 = _mm_madd_epi16(x4, _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;
(continue)
10-43

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-23. Conversion of 64-bit Integer to Wide Character String Using SSE4 (Contd.)

}

x2 = _mm_add_epi32( x4, _mm_loadu_si128( (__m128i *) asc0bias) ) ;
x2 = _mm_shuffle_epi32(x2, 0x1b); // switch sequence
if (x_Lt10k > 999 ) {
_mm_storeu_si128( (__m128i *) ps, x2);
return 4;
}
else if (x_Lt10k > 99 ) {
*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x2, 4));
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x2, 8));
return 3;
}
else if ( x_Lt10k > 9){ // take advantage of reduced dynamic range > 9 to reduce branching
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x2, 8));
return 2;
}
*ps = L'0' + x_Lt10k;
return 1;

long long sse4i_q2wcs_u63 ( __int64_t xx, wchar_t *ps)
{int j, tmp, idx=0, cnt;
__int64_t lo8, hi8, abv16, temp;
__m128i x0, m0, x1, x2, x3, x4, x5, x6, x7, m1;
if ( xx < 10000 ) {
j = ubs_Lt10k_2wcs_i2 ( (unsigned ) xx, ps); ps[j] = 0; return j;
}
if (xx < 100000000 ) { // dynamic range of xx is less than 32-bits
m0 = _mm_cvtsi32_si128( xx);
x1 = _mm_shuffle_epi32(m0, 0x44); // broadcast to dw0 and dw2
x3 = _mm_mul_epu32(x1, _mm_loadu_si128( (__m128i *) pr_cg_10to4 ));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x1, 8), x3); // quotient in dw2, remainder in dw0
__ParMod10to4SSSE3v( x3, m0);
//x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
x3 = _mm_shuffle_epi32(x3, 0x1b);
__ParMod10to4SSSE3v( x4, _mm_slli_si128(m0, 8)); // move the remainder to dw2 first
x4 = _mm_shuffle_epi32(x4, 0x1b);
cnt = 8;
} else {
hi8 = u64mod10to8(&lo8, xx);
if( hi8 < 10000) {
m0 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
__ParMod10to4SSSE3( x3, hi8);
x3 = _mm_shuffle_epi32(x3, 0x1b);
__ParMod10to4SSSE3v( x4, m0);
x4 = _mm_shuffle_epi32(x4, 0x1b);
(continue)

10-44

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-23. Conversion of 64-bit Integer to Wide Character String Using SSE4 (Contd.)

}

}

__ParMod10to4SSSE3v( x5, _mm_slli_si128(m0, 8));
x5 = _mm_shuffle_epi32(x5, 0x1b);
cnt = 12;
} else {
cnt = 0;
if ( hi8 > 100000000) {
abv16 = u64mod10to8(&temp, (__int64_t)hi8);
hi8 = temp;
__ParMod10to4SSSE3( x7, abv16);
x7 = _mm_shuffle_epi32(x7, 0x1b);
cnt = 4;
}
m0 = _mm_cvtsi32_si128( hi8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
m1 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m1, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m1 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
__ParMod10to4SSSE3v( x3, m0);
x3 = _mm_shuffle_epi32(x3, 0x1b);
__ParMod10to4SSSE3v( x4, _mm_slli_si128(m0, 8));
x4 = _mm_shuffle_epi32(x4, 0x1b);
__ParMod10to4SSSE3v( x5, m1);
x5 = _mm_shuffle_epi32(x5, 0x1b);
__ParMod10to4SSSE3v( x6, _mm_slli_si128(m1, 8));
x6 = _mm_shuffle_epi32(x6, 0x1b);
cnt += 16;

m0 = _mm_loadu_si128( (__m128i *) asc0bias);
if( cnt > 16) {
tmp = _mm_movemask_epi8( _mm_cmpgt_epi32(x7,_mm_setzero_si128()) ) ;
//x7 = _mm_add_epi32(x7, m0);
} else {
tmp = _mm_movemask_epi8( _mm_cmpgt_epi32(x3,_mm_setzero_si128()) );
}
#ifndef __USE_GCC__
__asm__ ("bsfl %1, %%ecx; movl %%ecx, %0;" :"=r"(idx) :"r"(tmp) : "%ecx");
#else
_BitScanForward(&idx, tmp);
#endif
x3 = _mm_add_epi32(x3, m0);
cnt -= (idx >>2);
x4 = _mm_add_epi32(x4, m0);
switch(cnt) {
case5:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 12));
_mm_storeu_si128( (__m128i *) ps, x4);
break;
case6:*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
_mm_storeu_si128( (__m128i *) &ps[2], x4);
break;
(continue)
10-45

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-23. Conversion of 64-bit Integer to Wide Character String Using SSE4 (Contd.)
case7:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 4));
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
_mm_storeu_si128( (__m128i *) &ps[2], x4);
break;
case 8: _mm_storeu_si128( (__m128i *) &ps[0], x3);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
break;
case9:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 12));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) ps, x4);
_mm_storeu_si128( (__m128i *) &ps[4], x5);
break;
case10:*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
_mm_storeu_si128( (__m128i *) &ps[6], x5);
break;
case11:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 4));
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
_mm_storeu_si128( (__m128i *) &ps[6], x5);
break;
case 12: _mm_storeu_si128( (__m128i *) &ps[0], x3);
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
_mm_storeu_si128( (__m128i *) &ps[8], x5);
break;
case13:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 12));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) ps, x4);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[4], x5);
_mm_storeu_si128( (__m128i *) &ps[8], x6);
break;
case14:*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[6], x5);
_mm_storeu_si128( (__m128i *) &ps[10], x6);
break;
case15:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 4));
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[6], x5);
_mm_storeu_si128( (__m128i *) &ps[10], x6);
break;
(continue)

10-46

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-23. Conversion of 64-bit Integer to Wide Character String Using SSE4 (Contd.)
case 16: _mm_storeu_si128( (__m128i *) &ps[0], x3);
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[8], x5);
_mm_storeu_si128( (__m128i *) &ps[12], x6);
break;

}

case17:x7 = _mm_add_epi32(x7, m0);
*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x7, 12));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) ps, x3);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
_mm_storeu_si128( (__m128i *) &ps[8], x5);
_mm_storeu_si128( (__m128i *) &ps[12], x6);
break;
case18:x7 = _mm_add_epi32(x7, m0);
*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x7, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x3);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[6], x4);
_mm_storeu_si128( (__m128i *) &ps[10], x5);
_mm_storeu_si128( (__m128i *) &ps[14], x6);
break;
case19:x7 = _mm_add_epi32(x7, m0);
*ps++ = (wchar_t) _mm_cvtsi128_si64( _mm_srli_si128( x7, 4));
*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x7, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x3);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[6], x4);
_mm_storeu_si128( (__m128i *) &ps[10], x5);
_mm_storeu_si128( (__m128i *) &ps[14], x6);
break;
case20:x7 = _mm_add_epi32(x7, m0);
_mm_storeu_si128( (__m128i *) &ps[0], x7);
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[4], x3);
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[8], x4);
_mm_storeu_si128( (__m128i *) &ps[12], x5);
_mm_storeu_si128( (__m128i *) &ps[16], x6);
break;
}
return cnt;

10-47

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

10.5.1

Large Integer Numeric Computation

10.5.1.1

MULX Instruction and Large Integer Numeric Computation

The MULX instruction is similar to the MUL instruction but does not read or write arithmetic flags and is
enhanced with more flexibility in register allocations for the destination operands. These enhancements
allow better out-of-order operation of the hardware and for software to intermix add-carry instruction
without corrupting the carry chain.
For computations calculating large integers (e.g. 2048-bit RSA key), MULX can improve performance
significantly over techniques based on MUL/ADC chain sequences (see http://download.intel.com/embedded/processor/whitepaper/327831.pdf). AVX2 can be used to build efficient techniques, see Section 11.16.2.
Example 10-24 gives an example of how MULX is used to improve the carry chain computation of integer
numeric greater than 64-bit wide.

Example 10-24. MULX and Carry Chain in Large Integer Numeric
mov rax, [rsi+8*1]

mulx rbx, r8, [rsi+8*1]

mul rbp

add r8, r9

; rdx:rax = rax * rbp

mov r8, rdx

adc rbx, 0

add r9, rax

add r8, rbp

adc r8, 0

adc rbx, 0

; rbx:r8 = rdx * [rsi+8*1]

add r9, rbx
adc r8, 0
Using MULX to implement 128-bit integer output can be a useful building block for implementing library
functions ranging from atof/strtod or intermediate mantisa computation or mantissa/exponent normalization in 128-bit binary decimal floating-point operations. Example 10-25 gives examples of buildingblock macros, used in 128-bit binary-decimal floating-point operations, which can take advantage MULX
to calculate intermediate results of multiple-precision integers of widths between 128 to 256 bits. Details
of binary-integer-decimal (BID) floating-point format and library implementation of BID operation can be
found at http://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library.

10-48

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

Example 10-25. Building-block Macro Used in Binary Decimal Floating-point Operations
// Portable C macro of 64x64-bit product using 32-bit word granular operations
// Output: BID_UINT128 P128
#define __mul_64x64_to_128MACH(P128, CX64, CY64) \
{

\
BID_UINT64 CXH,CXL,CYH,CYL,PL,PH,PM,PM2;
CXH = (CX64) >> 32;

\

\

CXL = (BID_UINT32)(CX64);

\

CYH = (CY64) >> 32;

\

CYL = (BID_UINT32)(CY64);

\

PM = CXH*CYL;

\

PH = CXH*CYH;

\

PL = CXL*CYL;

\

PM2 = CXL*CYH;

\

PH += (PM>>32);

\

PM = (BID_UINT64)((BID_UINT32)PM)+PM2+(PL>>32);
(P128).w[1] = PH + (PM>>32);
(P128).w[0] = (PM<<32)+(BID_UINT32)PL;

\

\
\

}
// 64x64-bit product using intrinsic producing 128-bit output in 64-bit mode
// Output: BID_UINT128 P128
#define __mul_64x64_to_128MACH_x64(P128, CX64, CY64) \
{

\

(P128).w[0] = mulx_u64(CX64, CY64, &( (P128).w[1]) );

\

}

10-49

SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING

10-50

CHAPTER 11
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Intel® Advanced Vector Extension (Intel® AVX), is a major enhancement to Intel Architecture. It extends
the functionality of previous generations of 128-bit SSE vector instructions and increased the vector
register width to support 256-bit operations. The Intel AVX ISA enhancement is focused on float-point
instructions. Some 256-bit integer vectors are supported via floating-point to integer and integer to
floating-point conversions.
Intel microarchitecture code name Sandy Bridge implements the Intel AVX instructions, in most cases,
on 256-bit hardware. Thus, each core has 256-bit floating-point Add and Multiply units. The Divide and
Square-root units are not enhanced to 256-bits. Thus, Intel AVX instructions use the 128-bit hardware in
two steps to complete these 256-bit operations.
Prior generations of Intel® Streaming SIMD Extensions (Intel® SSE) instructions generally are twooperand syntax, where one of the operands serves both as source and as destination. Intel AVX instructions are encoded with a VEX prefix, which includes a bit field to encode vector lengths and support
three-operand syntax. A typical instruction has two sources and one destination. Four operand instructions such as VBLENDVPS and VBLENDVPD exist as well. The added operand enables non destructive
source (NDS) and it eliminates the need for register duplication using MOVAPS operations.
With the exception of MMX instructions, almost all legacy 128-bit SSE instructions have AVX equivalents
that support three operand syntax. 256-bit AVX instructions employ three-operand syntax and some
with 4-operand syntax.
The 256-bit vector register, YMM, extends the 128-bit XMM register to 256 bits. Thus the lower 128-bits
of YMM is aliased to the legacy XMM registers.
While 256-bit AVX instructions writes 256 bits of results to YMM, 128-bit AVX instructions writes 128-bits
of results into the XMM register and zeros the upper bits above bit 128 of the corresponding YMM. 16
vector registers are available in 64-bit mode. Only the lower 8 vector registers are available in non-64bit modes.
Software can continue to use any mixture of legacy SSE code, 128-bit AVX code and 256-bit AVX code.
Section covers guidelines to deliver optimal performance across mixed-vector-length code modules
without experiencing transition delays between legacy SSE and AVX code. There are no transition delays
of mixing 128-bit AVX code and 256-bit AVX code.
The optimal memory alignment of an Intel AVX 256-bit vector, stored in memory, is 32 bytes. Some
data-movement 256-bit Intel AVX instructions enforce 32-byte alignment and will signal #GP fault if
memory operand is not properly aligned. The majority of 256-bit Intel AVX instructions do not require
address alignment. These instructions generally combine load and compute operations, so any nonaligned memory address can be used in these instructions.
For best performance, software should pay attention to align the load and store addresses to 32 bytes
whenever possible.
The major differences between using AVX instructions and legacy SSE instructions are summarized in
Table 11-1 below:

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Table 11-1. Features between 256-bit AVX, 128-bit AVX and Legacy SSE Extensions
Features

256-bit AVX

128-bit AVX

Legacy SSE-AESNI

Functionality Scope

Floating-point operation,
Data Movement.

Matches legacy SIMD ISA
(except MMX).

128-bit FP and integer SIMD
ISA.

Register Operand

YMM.

XMM.

XMM.

Operand Syntax

Up to 4; non-destructive
source.

Up to 4; non-destructive
source.

2 operand syntax;
destructive source.

Memory alignment

Load-Op semantics do not
require alignment.

Load-Op semantics do not
require alignment.

Always enforce 16B
alignment.

Aligned Move Instructions

32 byte alignment.

16 byte alignment.

16 byte alignment.

Non-destructive source
operand

Yes.

Yes.

No.

Register State Handling

Updates bits 255:0.

Updates 127:0; Zeroes bits
above 128.

Updates 127:0; Bits above
128 unmodified.

Intrinsic Support

• New 256-bit data types.
• _mm256 prefix for
promoted functionality.
• New intrinsics for new
functionalities.

• Existing data types.
• Inherit same prototype for
exiting functionalities.
• Use “_mm” prefix for new
VEX-128 functionalities.

Baseline datatypes and
prototype definitions.

128-bit Lanes

Applies to most 256-bit
operations.

One 128-bit lane.

One 128-bit lane.

Mixed Code Handling

Use VZEROUPPER to
avoid transition penalty.

No transition penalty.

Transition penalty after
executing 256-bit AVX code.

11.1

INTEL® AVX INTRINSICS CODING

256-bit AVX instructions have new intrinsics. Specifically, 256-bit AVX instruction that are promoted to
256-bit vector length from existing SSE functionality are generally prototyped with a “_mm256” prefix
instead of the “_mm” prefix and using new data types defined for 256-bit operation. New functionality in
256-bit AVX instructions have brand new prototype.
The 128-bit AVX instruction that were promoted from legacy SIMD ISA uses the same prototype as
before. Newer functionality common in 256-bit and 128-bit AVX instructions are prototyped with
“_mm256” and “_mm” prefixes respectively.
Thus porting from legacy SIMD code written in intrinsic can be ported to 256-bit AVX code with a modest
effort.
The following guidelines show how to convert a simple intrinsic from Intel SSE code sequence to Intel
AVX:

•
•
•
•
•

Align statically and dynamically allocated buffers to 32-bytes.
May need to double supplemental buffer size.
Change __mm_ intrinsic name prefix with __mm256_.
Change variable data types names from __m128 to __m256.
Divide by 2 iteration count (or double stride length).

This example below on Cartesian coordinate transformation demonstrates the Intel AVX Instruction
format, 32 byte YMM registers, dynamic and static memory allocation with data alignment of 32bytes,
and the C data type representing 8 floating-point elements in a YMM register.

11-2

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-1. Cartesian Coordinate Transformation with Intrinsics
//Use SSE intrinsic
#include "wmmintrin.h"

// Use Intel AVX intrinsic
#include "immintrin.h"

int main()
{ int len = 3200;
//Dynamic memory allocation with 16byte
//alignment
float* pInVector = (float*) _mm_malloc(len*sizeof(float),
16);
float* pOutVector = (float*) _mm_malloc(len*sizeof(float),
16);
//init data
for(int i=0; i

vmovups xmm0, mem
vinsertf128 ymm0, ymm0, mem+16, 1

Convert 32-byte stores as follows:
vmovups mem, ymm0
-> vmovups mem, xmm0
vextractf128 mem+16, ymm0, 1
The following intrinsics are available to handle unaligned 32-byte memory operating using 16-byte memory accesses:
_mm256_loadu2_m128 ( float const * addr_hi, float const * addr_lo);
_mm256_loadu2_m128d ( double const * addr_hi, double const * addr_lo);
_mm256_loadu2_m128 i( __m128i const * addr_hi, __m128i const * addr_lo);
_mm256_storeu2_m128 ( float * addr_hi, float * addr_lo, __m256 a);
_mm256_storeu2_m128d ( double * addr_hi, double * addr_lo, __m256d a);
_mm256_storeu2_m128 i( __m128i * addr_hi, __m128i * addr_lo, __m256i a);

Example 11-12 shows two implementations for SAXPY with unaligned addresses. Alternative 1 uses 32
byte loads and alternative 2 uses 16 byte loads. These code samples are executed with two source
buffers, src1, src2, at 4 byte offset from 32-Byte alignment, and a destination buffer, DST, that is 32-Byte
aligned. Using two 16-byte memory operations in lieu of 32-byte memory access performs faster.

Example 11-12. SAXPY Implementations for Unaligned Data Addresses
AVX with 32-byte memory operation
AVX using two 16-byte memory operations
mov
rax, src1
mov
rbx, src2
mov
rcx, dst
mov
rdx, len
xor
rdi, rdi
vbroadcastss ymm0, alpha
start_loop:
vmovups ymm1, [rax + rdi]
vmulps ymm1, ymm1, ymm0
vmovups ymm2, [rbx + rdi]
vaddps ymm1, ymm1, ymm2
vmovups [rcx + rdi], ymm1

vmovups ymm1, [rax+rdi+32]
vmulps ymm1, ymm1, ymm0
vmovups ymm2, [rbx+rdi+32]
vaddps ymm1, ymm1, ymm2
vmovups [rcx+rdi+32], ymm1
add
cmp
jl

rdi, 64
rdi, rdx
start_loop

mov
rax, src1
mov
rbx, src2
mov
rcx, dst
mov
rdx, len
xor
rdi, rdi
vbroadcastss ymm0, alpha
start_loop:
vmovups xmm2, [rax+rdi]
vinsertf128 ymm2, ymm2, [rax+rdi+16], 1
vmulps ymm1, ymm0, ymm2
vmovups xmm2, [ rbx + rdi]
vinsertf128 ymm2, ymm2, [rbx+rdi+16], 1
vaddps ymm1, ymm1, ymm2
vaddps ymm1, ymm1, ymm2
vmovaps [rcx+rdi], ymm1
vmovups xmm2, [rax+rdi+32]
vinsertf128 ymm2, ymm2, [rax+rdi+48], 1
vmulps ymm1, ymm0, ymm2
vmovups xmm2, [rbx+rdi+32]
vinsertf128 ymm2, ymm2, [rbx+rdi+48], 1
vaddps ymm1, ymm1, ymm2
vmovups [rcx+rdi+32], ymm1
add rdi, 64
cmp rdi, rdx
jl start_loop

11-21

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Assembly/Compiler Coding Rule 74. (M impact, H generality) Align data to 32-byte boundary
when possible. Prefer store alignment over load alignment.

11.6.3

Prefer Aligned Stores Over Aligned Loads

There are cases where it is possible to align only a subset of the processed data buffers. In these cases,
aligning data buffers used for store operations usually yields better performance than aligning data
buffers used for load operations.
Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there
is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150
cycles. Loads that cross a page boundary are executed at retirement. In Example 11-12, unaligned store
address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned
case.

11.7

L1D CACHE LINE REPLACEMENTS

When a load misses the L1D Cache, a cache line with the requested data is brought from a higher
memory hierarchy level. In memory intensive code where the L1 DCache is always active, replacing a
cache line in the L1 DCache may delay other loads. In Intel microarchitecture code name Sandy Bridge
and Ivy Bridge, the penalty for 32-Byte loads may be higher than the penalty for 16-Byte loads. Therefore, memory intensive Intel AVX code with 32-Byte loads and with data set larger than the L1 DCache
may be slower than similar code with 16-Byte loads.
When Example 11-12 is run with a data set that resides in the L2 Cache, the 16-byte memory access
implementation is slightly faster than the 32-byte memory operation.
Be aware that the relative merit of 16-Byte memory accesses versus 32-byte memory access is implementation specific across generations of microarchitectures.
With Intel microarchitecture code name Haswell, the L1 DCache can support two 32-byte fetch each
cycle, this cache line replacement concern does not apply.

11.8

4K ALIASING

4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it
loads from a different memory location with a 4-KByte offset between them. For example, a load to linear
address 0x400020 follows a store to linear address 0x401020.
The load and store have the same value for bits 5 - 11 of their addresses and the accessed byte offsets
should have partial or complete overlap.
4K aliasing may have a five-cycle penalty on the load latency. This penalty may be significant when 4K
aliasing happens repeatedly and the loads are on the critical path. If the load spans two cache lines it
might be delayed until the conflicting store is committed to the cache. Therefore 4K aliasing that happens
on repeated unaligned Intel AVX loads incurs a higher performance penalty.
To detect 4K aliasing, use the LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event that counts the number of
times Intel AVX loads were blocked due to 4K aliasing.
To resolve 4K aliasing, try the following methods in the following order:

•
•
•

Align data to 32 Bytes.
Change offsets between input and output buffers if possible.
Use 16-Byte memory accesses on memory which is not 32-Byte aligned.

11-22

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

11.9

CONDITIONAL SIMD PACKED LOADS AND STORES

The VMASKMOV instruction conditionally moves packed data elements to/from memory, depending on
the mask bits associated with each data element. The mask bit for each data element is the most significant bit of the corresponding element in the mask register.
When performing a mask load, the returned value is 0 for elements which have a corresponding mask
value of 0. The mask store instruction writes to memory only the elements with a corresponding mask
value of 1, while preserving memory values for elements with a corresponding mask value of 0. Faults
can occur only for memory accesses that are required by the mask. Faults do not occur due to referencing any memory location if the corresponding mask bit value for that memory location is zero. For
example, no faults are detected if the mask bits are all zero.
The following figure shows an example for a mask load and a mask store which does not cause a fault. In
this example, the mask register for the load operation is ymm1 and the mask register for the store operation is ymm2.
When using masked load or store consider the following:

•

The address of a VMASKMOV store is considered as resolved only after the mask is known. Loads that
follow a masked store can be blocked until the mask value is known (unless relieved by the memory
disambiguator).

•

If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store
data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to
the dependent loads. If the mask is all 0 the loads do not depend on the masked store.

•

Masked loads including an illegal address range do not result in an exception if the range is under a
zero mask value. However, the processor may take a multi-hundred-cycle “assist” to determine that
no part of the illegal range have a one mask value. This assist may occur even when the mask is
“zero” and it seems obvious to the programmer that the load should not be executed.

When using VMASKMOV, consider the following:

•
•
•
•
•

Use VMASKMOV only in cases where VMOVUPS cannot be used.
Use VMASKMOV on 32Byte aligned addresses if possible.
If possible use valid address range for masked loads, even if the illegal part is masked with zeros.
Determine the mask as early as possible.
Avoid store-forwarding issues by performing loads prior to a VMASKMOV store if possible.

11-23

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

•

Be aware of mask values that would cause the VMASKMOV instruction to require assist (if an assist is
required, the latency of VMASKMOV to load data will increase dramatically):
— Load data using VMASKMOV with a mask value selecting 0 elements from an illegal address will
require an assist.
— Load data using VMASKMOV with a mask value selecting 0 elements from a legal address
expressed in some addressing form (e.g. [base+index], disp[base+index] )will require an assist.

With processors based on the Skylake microarchitecture, the performance characteristics of VMASKMOV
instructions have the following notable items:

•
•

Loads that follow a masked store is not longer blocked until the mask value is known.
Store data using VMASKMOV with a mask value permitting 0 elements to be written to an illegal
address will require an assist.

11.9.1

Conditional Loops

VMASKMOV enables vectorization of loops that contain conditional code. There are two main benefits in
using VMASKMOV over the scalar implementation in these cases:

•
•

VMASKMOV code is vectorized.
Branch mispredictions are eliminated.

Below is a conditional loop C code:

Example 11-13. Loop with Conditional Expression
for(int i = 0; i < miBufferWidth; i++)
{
if(A[i]>0)
{
B[i] = (E[i]*C[i]);
}
else
{
B[i] = (E[i]*D[i]);
}
}

Example 11-14. Handling Loop Conditional with VMASKMOV
Scalar
AVX using VMASKMOV
float* pA = A;
float* pB = B;
float* pC = C;
float* pD = D;
float* pE = E;
uint64 len = (uint64) (miBufferWidth)*sizeof(float);
__asm
{
mov rax, pA
mov rbx, pB
mov rcx, pC
mov rdx, pD
mov rsi, pE
mov r8, len

11-24

float* pA = A;
float* pB = B;
float* pC = C;
float* pD = D;
float* pE = E;
uint64 len = (uint64) (miBufferWidth)*sizeof(float);
__asm
{
mov rax, pA
mov rbx, pB
mov rcx, pC
mov rdx, pD
mov rsi, pE
mov r8, len

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-14. Handling Loop Conditional with VMASKMOV (Contd.)
Scalar
AVX using VMASKMOV
//xmm8 all zeros
vxorps xmm8, xmm8, xmm8
xor r9,r9
loop1:
vmovss xmm1, [rax+r9]
vcomiss xmm1, xmm8
jbe a_le
a_gt:
vmovss xmm4, [rcx+r9]
jmp mul
a_le:
vmovss xmm4, [rdx+r9]
mul:
vmulss xmm4, xmm4, [rsi+r9]
vmovss [rbx+r9], xmm4
add r9, 4
cmp r9, r8
jl loop1
}

//ymm8 all zeros
vxorps ymm8, ymm8, ymm8
//ymm9 all ones
vcmpps ymm9, ymm8, ymm8, 0
xor r9,r9
loop1:
vmovups ymm1, [rax+r9]
vcmpps ymm2, ymm8, ymm1, 1
vmaskmovps ymm4, ymm2, [rcx+r9]
vxorps ymm2, ymm2, ymm9
vmaskmovps ymm5, ymm2, [rdx+r9]
vorps ymm4, ymm4, ymm5
vmulps ymm4,ymm4, [rsi+r9]
vmovups [rbx+r9], ymm4
add r9, 32
cmp r9, r8
jl loop1
}

The performance of the left side of Example 11-14 is sensitive to branch mis-predictions and can be an
order of magnitude slower than the VMASKMOV example which has no data-dependent branches.

11.10

MIXING INTEGER AND FLOATING-POINT CODE

Integer SIMD functionalities in Intel AVX instructions are limited to 128-bit. There are some algorithm
that uses mixed integer SIMD and floating-point SIMD instructions. Therefore, porting such legacy 128bit code into 256-bit AVX code requires special attention.
For example, PALINGR (Packed Align Right) is an integer SIMD instruction that is useful arranging data
elements for integer and floating-point code. But VPALINGR instruction does not have a corresponding
256-bit instruction in AVX.
There are two approaches to consider when porting legacy code consisting of mostly floating-point with
some integer operations into 256-bit AVX code:

•

Locate a 256-bit AVX alternative to replace the critical128-bit Integer SIMD instructions if such an
AVX instructions exist. This is more likely to be true with integer SIMD instruction that re-arranges
data elements.

•

Mix 128-bit AVX and 256-bit AVX instructions.

The performance gain from these two approaches may vary. Where possible, use method (1), since this
method utilizes the full 256-bit vector width.
In case the code is mostly integer, convert the code from 128-bit SSE to 128 bit AVX instructions and gain
from the Non destructive Source (NDS) feature.

Example 11-15. Three-Tap Filter in C Code
for(int i = 0; i < len -2; i++)
{
pOut[i] = A[i]*coeff[0]+A[i+1]*coeff[1]+A[i+2]*coeff[2];{B[i] = (E[i]*D[i]);
}
11-25

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-16. Three-Tap Filter with 128-bit Mixed Integer and FP SIMD
xor ebx, ebx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
mov r15, coeffs
movss xmm2, [r15] //load coeff 0
shufps xmm2, xmm2, 0 //broadcast coeff 0
movss xmm1, [r15+4] //load coeff 1
shufps xmm1, xmm1, 0 //broadcast coeff 1
movss xmm0, [r15+8] //coeff 2
shufps xmm0, xmm0, 0 //broadcast coeff 2
movaps xmm5, [rdi] //xmm5={A[n+3],A[n+2],A[n+1],A[n]}
loop_start:
movaps xmm6, [rdi+16] //xmm6={A[n+7],A[n+6],A[n+5],A[n+4]}
movaps xmm7, xmm6
movaps xmm8, xmm6
add rdi, 16
//inPtr+=32
add rbx, 4
//loop counter
palignr xmm7, xmm5, 4 //xmm7={A[n+4],A[n+3],A[n+2],A[n+1]}
palignr xmm8, xmm5, 8 //xmm8={A[n+5],A[n+4],A[n+3],A[n+2]}
mulps xmm5, xmm2 //xmm5={C0*A[n+3],C0*A[n+2],C0*A[n+1], C0*A[n]}

mulps xmm7, xmm1 //xmm7={C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}
mulps xmm8, xmm0 //xmm8={C2*A[n+5],C2*A[n+4] C2*A[n+3],C2*A[n+2]}
addps xmm7 ,xmm5
addps xmm7, xmm8
movaps [rsi], xmm7
movaps xmm5, xmm6
add rsi, 16
//outPtr+=16
cmp rbx, rcx
jl loop_start

Example 11-17. 256-bit AVX Three-Tap Filter Code with VSHUFPS
xor ebx, ebx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
mov r15, coeffs
vbroadcastss ymm2, [r15] //load and broadcast coeff 0
vbroadcastss ymm1, [r15+4] //load and broadcast coeff 1
vbroadcastss ymm0, [r15+8] //load and broadcast coeff 2

11-26

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-17. 256-bit AVX Three-Tap Filter Code with VSHUFPS (Contd.)
loop_start:
vmovaps ymm5, [rdi]

// Ymm5={A[n+7],A[n+6],A[n+5],A[n+4];
// A[n+3],A[n+2],A[n+1] , A[n]}
vshufps ymm6,ymm5,[rdi+16],0x4e // ymm6={A[n+9],A[n+8],A[n+7],A[n+6];
// A[n+5],A[n+4],A[n+3],A[n+2]}
vshufps ymm7,ymm5,ymm6,0x99 // ymm7={A[n+8],A[n+7],A[n+6],A[n+5];
// A[n+4],A[n+3],A[n+2],A[n+1]}
vmulps ymm3,ymm5,ymm2// ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5],C0*A[n+4];
// C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]}
vmulps ymm9,ymm7,ymm1 // ymm9={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5];
// C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}
vmulps ymm4,ymm6,ymm0 // ymm4={C2*A[n+9],C2*A[n+8],C2*A[n+7],C2*A[n+6];
// C2*A[n+5],C2*A[n+4],C2*A[n+3],C2*A[n+2]}
vaddps ymm8 ,ymm3,ymm4
vaddps ymm10, ymm8, ymm9
vmovaps [rsi], ymm10
add
rdi, 32 //inPtr+=32
add
rbx, 8 //loop counter
add
rsi, 32 //outPtr+=32
cmp
rbx, rcx
jl
loop_start

Example 11-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code
xor ebx, ebx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
mov r15, coeffs
vbroadcastss ymm2, [r15] //load and broadcast coeff 0
vbroadcastss ymm1, [r15+4] //load and broadcast coeff 1
vbroadcastss ymm0, [r15+8] //load and broadcast coeff 2
vmovaps xmm3, [rdi] //xmm3={A[n+3],A[n+2],A[n+1],A[n]}
vmovaps xmm4, [rdi+16] //xmm4={A[n+7],A[n+6],A[n+5],A[n+4]}
vmovaps xmm5, [rdi+32] //xmm5={A[n+11], A[n+10],A[n+9],A[n+8]}
loop_start:
vinsertf128 ymm3, ymm3, xmm4, 1 // ymm3={A[n+7],A[n+6],A[n+5],A[n+4];
// A[n+3], A[n+2],A[n+1],A[n]}
vpalignr xmm6, xmm4, xmm3, 4 // xmm6={A[n+4],A[n+3],A[n+2],A[n+1]}
vpalignr xmm7, xmm5, xmm4, 4 // xmm7={A[n+8],A[n+7],A[n+6],A[n+5]}
vinsertf128 ymm6,ymm6,xmm7,1 // ymm6={A[n+8],A[n+7],A[n+6],A[n+5];
// A[n+4],A[n+3],A[n+2],A[n+1]}
vpalignr xmm8,xmm4,xmm3,8
// xmm8={A[n+5],A[n+4],A[n+3],A[n+2]}
vpalignr xmm9, xmm5, xmm4, 8 // xmm9={A[n+9],A[n+8],A[n+7],A[n+6]}
vinsertf128 ymm8, ymm8, xmm9,1 // ymm8={A[n+9],A[n+8],A[n+7],A[n+6];
// A[n+5],A[n+4],A[n+3],A[n+2]}
vmulps ymm3,ymm3,ymm2 // Ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5], C0*A[n+4];
// C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]}
vmulps ymm6,ymm6,ymm1 // Ymm9={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5];
// C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}

11-27

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code (Contd.)
vmulps ymm8,ymm8,ymm0 // Ymm4={C2*A[n+9],C2*A[n+8],C2*A[n+7],C2*A[n+6];
// C2*A[n+5],C2*A[n+4],C2*A[n+3],C2*A[n+2]}
vaddps ymm3 ,ymm3,ymm6
vaddps ymm3, ymm3, ymm8
vmovaps [rsi], ymm3
vmovaps xmm3, xmm5
add
rdi, 32 //inPtr+=32
add
rbx, 8 //loop counter
add
rsi, 32 //outPtr+=32
cmp
rbx, rcx
jl
loop_start

Example 11-17 uses 256-bit VSHUFPS to replace the PALIGNR in 128-bit mixed SSE code. This speeds up
almost 70% over the 128-bit mixed SSE code of Example 11-16 and slightly ahead of Example 11-18.
For code that includes integer instructions and is written with 256-bit Intel AVX instructions, replace the
integer instruction with floating-point instructions that have similar functionality and performance. If
there is no similar floating-point instruction, consider using a 128-bit Intel AVX instruction to perform the
required integer operation.

11.11

HANDLING PORT 5 PRESSURE

Port 5 in Intel microarchitecture code name Sandy Bridge includes shuffle units and it frequently
becomes a performance bottleneck. Sometimes it is possible to replace shuffle instructions that dispatch
only on port 5, with different instructions and improve performance by reducing port 5 pressure. For
more information, see Table 2-15.

11.11.1 Replace Shuffles with Blends
There are a few cases where shuffles such as VSHUFPS or VPERM2F128 can be replaced by blend instructions. Intel AVX shuffles are executed only on port 5, while blends are also executed on port 0. Therefore,
replacing shuffles with blends could reduce port 5 pressure. The following figure shows how a VSHUFPS
is implemented using VBLENDPS.

11-28

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

The following example shows two implementations of an 8x8 Matrix transpose. In both cases, the bottleneck is Port 5 pressure. Alternative 1 uses 12 vshufps instructions that are executed only on port 5. Alternative 2 replaces eight of the vshufps instructions with the vblendps instruction which can be executed
on Port 0.

Example 11-19. 8x8 Matrix Transpose - Replace Shuffles with Blends
256-bit AVX using VSHUFPS
AVX replacing VSHUFPS with VBLENDPS
movrcx, inpBuf
movrdx, outBuf
movr10, NumOfLoops
movrbx, rdx
loop1:
vmovaps ymm9, [rcx]
vmovaps ymm10, [rcx+32]
vmovaps ymm11, [rcx+64]
vmovaps ymm12, [rcx+96]
vmovaps ymm13, [rcx+128]
vmovaps ymm14, [rcx+160]
vmovaps ymm15, [rcx+192]
vmovaps ymm2, [rcx+224]
vunpcklps ymm6, ymm9, ymm10
vunpcklps ymm1, ymm11, ymm12
vunpckhps ymm8, ymm9, ymm10
vunpcklps ymm0, ymm13, ymm14
vunpcklps ymm9, ymm15, ymm2
vshufps ymm3, ymm6, ymm1, 0x4E
vshufps ymm10, ymm6, ymm3, 0xE4
vshufps ymm6, ymm0, ymm9, 0x4E
vunpckhps ymm7, ymm11, ymm12
vshufps ymm11, ymm0, ymm6, 0xE4
vshufps ymm12, ymm3, ymm1, 0xE4
vperm2f128 ymm3, ymm10, ymm11, 0x20
vmovaps [rdx], ymm3
vunpckhps ymm5, ymm13, ymm14
vshufps ymm13, ymm6, ymm9, 0xE4
vunpckhps ymm4, ymm15, ymm2
vperm2f128 ymm2, ymm12, ymm13, 0x20
vmovaps 32[rdx], ymm2
vshufps ymm14, ymm8, ymm7, 0x4
vshufps ymm15, ymm14, ymm7, 0xE4
vshufps ymm7, ymm5, ymm4, 0x4E
vshufps ymm8, ymm8, ymm14, 0xE4
vshufps ymm5, ymm5, ymm7, 0xE4
vperm2f128 ymm6, ymm8, ymm5, 0x20
vmovaps 64[rdx], ymm6
vshufps ymm4, ymm7, ymm4, 0xE4
vperm2f128 ymm7, ymm15, ymm4, 0x20
vmovaps 96[rdx], ymm7
vperm2f128 ymm1, ymm10, ymm11, 0x31
vperm2f128 ymm0, ymm12, ymm13, 0x31
vmovaps 128[rdx], ymm1
vperm2f128 ymm5, ymm8, ymm5, 0x31
vperm2f128 ymm4, ymm15, ymm4, 0x31

movrcx, inpBuf
movrdx, outBuf
movr10, NumOfLoops
movrbx, rdx
loop1:
vmovaps ymm9, [rcx]
vmovaps ymm10, [rcx+32]
vmovaps ymm11, [rcx+64]
vmovaps ymm12, [rcx+96]
vmovaps ymm13, [rcx+128]
vmovaps ymm14, [rcx+160]
vmovaps ymm15, [rcx+192]
vmovaps ymm2, [rcx+224]
vunpcklps ymm6, ymm9, ymm10
vunpcklps ymm1, ymm11, ymm12
vunpckhps ymm8, ymm9, ymm10
vunpcklps ymm0, ymm13, ymm14
vunpcklps ymm9, ymm15, ymm2
vshufps ymm3, ymm6, ymm1, 0x4E
vblendps ymm10, ymm6, ymm3, 0xCC
vshufps ymm6, ymm0, ymm9, 0x4E
vunpckhps ymm7, ymm11, ymm12
vblendps ymm11, ymm0, ymm6, 0xCC
vblendps ymm12, ymm3, ymm1, 0xCC
vperm2f128 ymm3, ymm10, ymm11, 0x20
vmovaps [rdx], ymm3
vunpckhps ymm5, ymm13, ymm14
vblendps ymm13, ymm6, ymm9, 0xCC
vunpckhps ymm4, ymm15, ymm2
vperm2f128 ymm2, ymm12, ymm13, 0x20
vmovaps 32[rdx], ymm2
vshufps ymm14, ymm8, ymm7, 0x4E
vblendps ymm15, ymm14, ymm7, 0xCC
vshufps ymm7, ymm5, ymm4, 0x4E
vblendps ymm8, ymm8, ymm14, 0xCC
vblendps ymm5, ymm5, ymm7, 0xCC
vperm2f128 ymm6, ymm8, ymm5, 0x20
vmovaps 64[rdx], ymm6
vblendps ymm4, ymm7, ymm4, 0xCC
vperm2f128 ymm7, ymm15, ymm4, 0x20
vmovaps 96[rdx], ymm7
vperm2f128 ymm1, ymm10, ymm11, 0x31
vperm2f128 ymm0, ymm12, ymm13, 0x31
vmovaps 128[rdx], ymm1
vperm2f128 ymm5, ymm8, ymm5, 0x31
vperm2f128 ymm4, ymm15, ymm4, 0x31
11-29

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-19. 8x8 Matrix Transpose - Replace Shuffles with Blends (Contd.)
256-bit AVX using VSHUFPS
AVX replacing VSHUFPS with VBLENDPS
vmovaps 160[rdx], ymm0
vmovaps 192[rdx], ymm5
vmovaps 224[rdx], ymm4
decr10
jnz loop1

vmovaps 160[rdx], ymm0
vmovaps 192[rdx], ymm5
vmovaps 224[rdx], ymm4
dec r10
jnz loop1

In Example 11-19, replacing VSHUFPS with VBLENDPS relieved port 5 pressure and can gain almost 40%
speedup.
Assembly/Compiler Coding Rule 75. (M impact, M generality) Use Blend instructions in lieu of
shuffle instruction in AVX whenever possible.

11.11.2 Design Algorithm With Fewer Shuffles
In some cases you can reduce port 5 pressure by changing the algorithm to use less shuffles. The figure
below shows that the transpose moved all the elements in rows 0-4 to the low lanes, and all the elements
in rows 4-7 to the high lanes. Therefore, using 256-bit loads in the beginning of the algorithm requires
using VPERM2F128 in order to swap elements between the lanes. The processor executes the
VPERM2F128 instruction only on port 5.
Example 11-19 used eight 256-bit loads and eight VPERM2F128 instructions. You can implement the
same 8x8 Matrix Transpose using VINSERTF128 instead of the 256-bit loads and the eight VPERM2F128.
Using VINSERTF128 from memory is executed in the load ports and on port 0 or 5. The original method
required loads that are performed on the load ports and VPERM2F128 that is only performed on port 5.
Therefore redesigning the algorithm to use VINSERTF128 reduces port 5 pressure and improves performance.

The following figure describes step 1 of the 8x8 matrix transpose with vinsertf128. Step 2 performs the
same operations on different columns.

11-30

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-20. 8x8 Matrix Transpose Using VINSRTPS
mov
rcx, inpBuf
mov
rdx, outBuf
mov
r8, iLineSize
mov
r10, NumOfLoops
loop1:
vmovaps
xmm0, [rcx]
vinsertf128 ymm0, ymm0, [rcx + 128], 1
vmovaps
xmm1, [rcx + 32]
vinsertf128 ymm1, ymm1, [rcx + 160], 1
vunpcklpd
vunpckhpd
vmovaps
vinsertf128
vmovaps
vinsertf128

ymm8, ymm0, ymm1
ymm9, ymm0, ymm1
xmm2, [rcx+64]
ymm2, ymm2, [rcx + 192], 1
xmm3, [rcx+96]
ymm3, ymm3, [rcx + 224], 1

vunpcklpd
vunpckhpd
vshufps
vmovaps
vshufps
vmovaps
vshufps
vmovaps
vshufps
vmovaps

ymm10, ymm2, ymm3
ymm11, ymm2, ymm3
ymm4, ymm8, ymm10, 0x88
[rdx], ymm4
ymm5, ymm8, ymm10, 0xDD
[rdx+32], ymm5
ymm6, ymm9, ymm11, 0x88
[rdx+64], ymm6
ymm7, ymm9, ymm11, 0xDD
[rdx+96], ymm7

11-31

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-20. 8x8 Matrix Transpose Using VINSRTPS (Contd.)
vmovaps
xmm0, [rcx+16]
vinsertf128 ymm0, ymm0, [rcx + 144], 1
vmovaps
xmm1, [rcx + 48]
vinsertf128 ymm1, ymm1, [rcx + 176], 1
vunpcklpd
vunpckhpd

ymm8, ymm0, ymm1
ymm9, ymm0, ymm1

vmovaps
vinsertf128
vmovaps
vinsertf128

xmm2, [rcx+80]
ymm2, ymm2, [rcx + 208], 1
xmm3, [rcx+112]
ymm3, ymm3, [rcx + 240], 1

vunpcklpd
vunpckhpd

ymm10, ymm2, ymm3
ymm11, ymm2, ymm3

vshufps
vmovaps
vshufps
vmovaps
vshufps
vmovaps
vshufps
vmovaps
dec
jnz

ymm4, ymm8, ymm10, 0x88
[rdx+128], ymm4
ymm5, ymm8, ymm10, 0xDD
[rdx+160], ymm5
ymm6, ymm9, ymm11, 0x88
[rdx+192], ymm6
ymm7, ymm9, ymm11, 0xDD
[rdx+224], ymm7
r10
loop1

In Example 11-20, this reduced port 5 pressure further than the combination of VSHUFPS with
VBLENDPS in Example 11-19. It can gain 70% speedup relative to relying on VSHUFPS alone in Example
11-19.

11.11.3 Perform Basic Shuffles on Load Ports
Some shuffles can be executed in the load ports (ports 2, 3) if the source is from memory. The following
example shows how moving some shuffles (vmovsldup/vmovshdup) from Port 5 to the load ports
improves performance significantly.
The following figure describes an Intel AVX implementation of the complex multiply algorithm with
vmovsldup/vmovshdup on the load ports.

11-32

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-21 includes two versions of the complex multiply. Both versions are unrolled twice. Alternative 1 shuffles all the data in registers. Alternative 2 shuffles data while it is loaded from memory.

Example 11-21. Port 5 versus Load Port Shuffles
Shuffles data in registers
mov
mov
mov
mov
xor

loop1:
vmovaps ymm0, [rax +8*rcx]
vmovaps ymm4, [rax +8*rcx +32]
ymm3, [rbx +8*rcx]
vmovsldup ymm2, ymm3
vmulps ymm2, ymm2, ymm0
vshufps ymm0, ymm0, ymm0, 177
vmovshdup ymm1, ymm3
vmulps ymm1, ymm1, ymm0
vmovaps ymm7, [rbx +8*rcx +32]
vmovsldup ymm6, ymm7
vmulps ymm6, ymm6, ymm4
vaddsubps ymm2, ymm2, ymm1
vmovshdup ymm5, ymm7

Shuffling loaded data
mov
mov
mov
mov
xor

rax, inPtr1
rbx, inPtr2
rdx, outPtr
r8, len
rcx, rcx

vmovaps

rax, inPtr1
rbx, inPtr2
rdx, outPtr
r8, len
rcx, rcx

loop1:
vmovaps ymm0, [rax +8*rcx]
vmovaps ymm4, [rax +8*rcx +32]
vmovsldup ymm2, [rbx +8*rcx]
vmulps ymm2, ymm2, ymm0
vshufps ymm0, ymm0, ymm0, 177
vmovshdup ymm1, [rbx +8*rcx]
vmulps ymm1, ymm1, ymm0
vmovsldup ymm6, [rbx +8*rcx +32]
vmulps ymm6, ymm6, ymm4
vaddsubps ymm3, ymm2, ymm1
vmovshdup ymm5, [rbx +8*rcx +32]

11-33

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-21. Port 5 versus Load Port Shuffles (Contd.)
Shuffles data in registers
Shuffling loaded data
vmovaps [rdx+8*rcx], ymm2
vshufps ymm4, ymm4, ymm4, 177
vmulps ymm5, ymm5, ymm4
vaddsubps ymm6, ymm6, ymm5
vmovaps [rdx+8*rcx+32], ymm6

vmovaps [rdx +8*rcx], ymm3
vshufps ymm4, ymm4, ymm4, 177
vmulps ymm5, ymm5, ymm4
vaddsubps ymm7, ymm6, ymm5
vmovaps [rdx +8*rcx +32], ymm7

addrcx, 8
cmprcx, r8
jl loop1

addrcx, 8
cmprcx, r8
jl loop1

11.12

DIVIDE AND SQUARE ROOT OPERATIONS

In Intel microarchitectures prior to Skylake, the SSE divide and square root instructions DIVPS and
SQRTPS have a latency of 14 cycles (or the neighborhood) and they are not pipelined. This means that
the throughput of these instructions is one in every 14 cycles. The 256-bit Intel AVX instructions VDIVPS
and VSQRTPS execute with 128-bit data path and have a latency of 28 cycles and they are not pipelined
as well. Therefore, the performance of the Intel SSE divide and square root instructions is similar to the
Intel AVX 256-bit instructions on Intel microarchitecture code name Sandy Bridge.
With the Skylake microarchitecture, 256-bit and 128-bit version of (V)DIVPS/(V)SQRTPS have the same
latency because the 256-bit version can execute with a 256-bit data path. The latency is improved and is
pipelined to execute with significantly improved throughput. See Appendix C, “IA-32 Instruction Latency
and Throughput”.
In microarchitectures that provide DIVPS/SQRTPS with high latency and low throughput, it is possible to
speed up single-precision divide and square root calculations using the (V)RSQRTPS and (V)RCPPS
instructions. For example, with 128-bit RCPPS/RSQRTPS at 5-cycle latency and 1-cycle throughput or
with 256-bit implementation of these instructions at 7-cycle latency and 2-cycle throughput, a single
Newton-Raphson iteration or Taylor approximation can achieve almost the same precision as the
(V)DIVPS and (V)SQRTPS instructions. See Intel® 64 and IA-32 Architectures Software Developer's
Manual for more information on these instructions.
In some cases, when the divide or square root operations are part of a larger algorithm that hides some
of the latency of these operations, the approximation with Newton-Raphson can slow down execution,
because more micro-ops, coming from the additional instructions, fill the pipe.
With the Skylake microarchitecture, choosing between approximate reciprocal instruction alternative
versus DIVPS/SQRTPS for optimal performance of simple algebraic computations depend on a number of
factors. Table 11-5 shows several algebraic formula the throughput comparison of implementations of
different numeric accuracy tolerances. In each row, 24-bit accurate implementations are IEEE-compliant
and using the respective instructions of 128-bit or 256-bit ISA. The columns of 22-bit and 11-bit accurate
implementations are using approximate reciprocal instructions of the respective instruction set.

Table 11-5. Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture
Algorithm

Instruction Type

24-bit Accurate

22-bit Accurate

11-bit Accurate

Z = X/Y

SSE

1X

0.9X

1.3X

256-bit AVX

1X

1.5X

2.6X

SSE

1X

0.7X

2X

256-bit AVX

1X

1.4X

3.4X

SSE

1X

1.7X

4.3X

256-bit AVX

1X

3X

7.7X

0.5

Z=X

-0.5

Z=X

11-34

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Table 11-5. Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture
Algorithm
Z = (X *Y + Y*Y

)0.5

Z = (X+2Y+3)/(Z-2Y-3)

Instruction Type

24-bit Accurate

22-bit Accurate

11-bit Accurate

SSE

1X

0.75X

0.85X

256-bit AVX

1X

1.1X

1.6X

SSE

1X

0.85X

1X

256-bit AVX

1X

0.8X

1X

If targeting processors based on the Skylake microarchitecture, Table 11-5 can be summarized as:

•

For 256- bit AVX code, Newton-Raphson approximation can be beneficial on Skylake microarchitecture when the algorithm contains only operations executed on the divide unit. However, when
single precision divide or square root operations are part of a longer computation, the lower latency
of the DIVPS or SQRTPS instructions can lead to better overall performance.

•

For SSE or 128-bit AVX implementation, consider use of approximation for divide and square root
instructions only for algorithms that do not require precision higher than 11-bit or algorithms that
contain multiple operations executed on the divide unit.

Table 11-6 summarizes recommended calculation methods of divisions or square root when using singleprecision instructions, based on the desired accuracy level across recent generations of Intel microarchitectures.

Table 11-6. Single-Precision Divide and Square Root Alternatives
Operation

Accuracy Tolerance

Recommendation

Divide

24 bits (IEEE)

DIVPS

~ 22 bits

Skylake: Consult Table 11-5
Prior uarch: RCPPS + 1 Newton-Raphson Iteration + MULPS

Reciprocal square
root

Square root

~ 11 bits

RCPPS + MULPS

24 bits (IEEE)

SQRTPS + DIVPS

~ 22 bits

RSQRTPS + 1 Newton-Raphson Iteration

~ 11 bits

RSQRTPS

24 bits (IEEE)

SQRTPS

~ 22 bits

Skylake: Consult Table 11-5
Prior uarch: RSQRTPS + 1 Newton-Raphson Iteration + MULPS

~ 11 bits

RSQRTPS + RCPPS

11.12.1 Single-Precision Divide
To compute:
Z[i]=A[i]/B[i]
On a large vector of single-precision numbers, Z[i] can be calculated by a divide operation, or by multiplying 1/B[i] by A[i].
Denoting B[i] by N, it is possible to calculate 1/N using the (V)RCPPS instruction, achieving approximately 11-bit precision.
For better accuracy you can use the one Newton-Raphson iteration:
X_(0 ) ~= 1/N

; Initial estimation, rcp(N)

X_(0 ) = 1/N*(1-E)
E=1-N*X_0
X_1=X_0*(1+E)=1/N*(1-E^2 )

; E ~= 2^(-11)
; E^2 ~= 2^(-22)
11-35

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

X_1=X_0*(1+1-N*X_0 )= 2 *X_0 - N*X_0^2
X_1 is an approximation of 1/N with approximately 22-bit precision.

Example 11-22. Divide Using DIVPS for 24-bit Accuracy
SSE code using DIVPS
Using VDIVPS
mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

loop1:
movups xmm0, [rax+rdx*1]
movups xmm1, [rbx+rdx*1]
divps xmm0, xmm1
movups [rcx+rdx*1], xmm0
add rdx, 0x10
cmp rdx, rsi
jl loop1

loop1:
vmovups ymm0, [rax+rdx*1]
vmovups ymm1, [rbx+rdx*1]
vdivps ymm0, ymm0, ymm1
vmovups [rcx+rdx*1], ymm0
add rdx, 0x20
cmp rdx, rsi
jl loop1

Example 11-23. Divide Using RCPPS 11-bit Approximation
SSE code using RCPPS
Using VRCPPS
mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

loop1:
movups xmm0,[rax+rdx*1]
movups xmm1,[rbx+rdx*1]
rcpps xmm1,xmm1
mulps xmm0,xmm1
movups [rcx+rdx*1],xmm0
add rdx, 16
cmp rdx, rsi
jl loop1

loop1:
vmovups ymm0, [rax+rdx]
vmovups ymm1, [rbx+rdx]
vrcpps ymm1, ymm1
vmulps ymm0, ymm0, ymm1
vmovups [rcx+rdx], ymm0
add rdx, 32
cmp rdx, rsi
jl loop1

Example 11-24. Divide Using RCPPS and Newton-Raphson Iteration
RCPPS + MULPS ~ 22 bit accuracy
VRCPPS + VMULPS ~ 22 bit accuracy
mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

11-36

mov rax, pIn1
mov rbx, pIn2
mov rcx, pOut
mov rsi, iLen
xor rdx, rdx

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-24. Divide Using RCPPS and Newton-Raphson Iteration (Contd.)
RCPPS + MULPS ~ 22 bit accuracy
VRCPPS + VMULPS ~ 22 bit accuracy
loop1:
movups xmm0, [rax+rdx*1]
movups xmm1, [rbx+rdx*1]
rcpps xmm3, xmm1
movaps xmm2, xmm3
addps xmm3, xmm2
mulps xmm2, xmm2
mulps xmm2, xmm1
subps xmm3, xmm2
mulps xmm0, xmm3
movups xmmword ptr [rcx+rdx*1], xmm0
add rdx, 0x10
cmp rdx, rsi
jl loop1

loop1:
vmovups ymm0, [rax+rdx]
vmovups ymm1, [rbx+rdx]
vrcpps ymm3, ymm1
vaddps ymm2, ymm3, ymm3
vmulps ymm3, ymm3, ymm3
vmulps ymm3, ymm3, ymm1
vsubps ymm2, ymm2, ymm3
vmulps ymm0, ymm0, ymm2
vmovups [rcx+rdx], ymm0
add rdx, 32
cmp rdx, rsi
jl loop1

Table 11-7. Comparison of Single-Precision Divide Alternatives
Accuracy

Method

SSE Performance

AVX Performance

24 bits

(V)DIVPS

Baseline

1X

~ 22 bits

(V)RCPPS + Newton-Raphson

2.7X

4.5X

~ 11 bits

(V)RCPPS

6X

8X

11.12.2 Single-Precision Reciprocal Square Root
To compute Z[i]=1/ (A[i]) ^0.5 on a large vector of single-precision numbers, denoting A[i] by N, it is
possible to calculate 1/N using the (V)RSQRTPS instruction.
For better accuracy you can use one Newton-Raphson iteration:
X_0 ~=1/N ; Initial estimation RCP(N)
E=1-N*X_0^2
X_0= (1/N)^0.5 * ((1-E)^0.5 ) = (1/N)^0.5 * (1-E/2) ; E/2~= 2^(-11)
X_1=X_0*(1+E/2) ~= (1/N)^0.5 * (1-E^2/4)

; E^2/4?2^(-22)

X_1=X_0*(1+1/2-1/2*N*X_0^2 )= 1/2*X_0*(3-N*X_0^2)
X1 is an approximation of (1/N)^0.5 with approximately 22-bit precision.

11-37

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-25. Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy
Using SQRTPS, DIVPS
Using VSQRTPS, VDIVPS
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
movups xmm1, [rax+rdx]
sqrtps xmm0, xmm1
divps xmm0, xmm1
movups [rbx+rdx], xmm0
add rdx, 16
cmp rdx, rcx
jl loop1

mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
vmovups ymm1, [rax+rdx]
vsqrtps ymm0,ymm1
vdivps ymm0, ymm0, ymm1
vmovups [rbx+rdx], ymm0
add rdx, 32
cmp rdx, rcx
jl loop1

Example 11-26. Reciprocal Square Root Using RCPPS 11-bit Approximation
SSE code using RCPPS
Using VRCPPS
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
rsqrtps xmm0, [rax+rdx]
movups [rbx+rdx], xmm0
add rdx, 16
cmp rdx, rcx
jl loop1

mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
vrsqrtps ymm0, [rax+rdx]
vmovups [rbx+rdx], ymm0
add rdx, 32
cmp rdx, rcx
jl loop1

Example 11-27. Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration
RCPPS + MULPS ~ 22 bit accuracy
VRCPPS + VMULPS ~ 22 bit accuracy
__declspec(align(16)) float minus_half[4] = {-0.5, -0.5, 0.5, -0.5};
__declspec(align(16)) float three[4] = {3.0, 3.0, 3.0,
3.0};
__asm
{
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
movups xmm3, [three]
movups xmm4, [minus_half]

11-38

__declspec(align(32)) float half[8] =
{0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
__declspec(align(32)) float three[8] =
{3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0};
__asm
{
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
vmovups ymm3, [three]
vmovups ymm4, [half]

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-27. Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration (Contd.)
RCPPS + MULPS ~ 22 bit accuracy
VRCPPS + VMULPS ~ 22 bit accuracy
loop1:
movups xmm5, [rax+rdx]
rsqrtps xmm0, xmm5
movaps xmm2, xmm0
mulps xmm0, xmm0
mulps xmm0, xmm5
subps xmm0, xmm3
mulps xmm0, xmm2
mulps xmm0, xmm4
movups [rbx+rdx], xmm0

loop1:
vmovups ymm5, [rax+rdx]
vrsqrtps ymm0, ymm5
vmulps ymm2, ymm0, ymm0
vmulps ymm2, ymm2, ymm5
vsubps ymm2, ymm3, ymm2
vmulps ymm0, ymm0, ymm2
vmulps ymm0, ymm0, ymm4
vmovups [rbx+rdx], ymm0
add rdx, 32
cmp rdx, rcx
jl loop1

add rdx, 16
cmp rdx, rcx
jl loop1
}
}

Table 11-8. Comparison of Single-Precision Reciprocal Square Root Operation
Accuracy

Method

SSE Performance

AVX Performance

24 bits

(V)SQRTPS + (V)DIVPS

Baseline

1X

~ 22 bits

(V)RCPPS + Newton-Raphson

5.2X

9.1X

~ 11 bits

(V)RCPPS

13.5X

17.5X

11.12.3 Single-Precision Square Root
To compute Z[i]= (A[i])^0.5 on a large vector of single-precision numbers, denoting A[i] by N, the
approximation for N^0.5 is N multiplied by (1/N)^0.5 , where the approximation for (1/N)^0.5 is
described in the previous section.
To get approximately 22-bit precision of N^0.5, use the following calculation:
N^0.5 = X_1*N = 1/2*N*X_0*(3-N*X_0^2)

Example 11-28. Square Root Using SQRTPS for 24-bit Accuracy
Using SQRTPS
Using VSQRTPS
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
movups xmm1, [rax+rdx]
sqrtps xmm1, xmm1
movups [rbx+rdx], xmm1
add rdx, 16
cmp rdx, rcx
jl loop1

mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
vmovups ymm1, [rax+rdx]
vsqrtps ymm1,ymm1
vmovups [rbx+rdx], ymm1
add rdx, 32
cmp rdx, rcx
jl loop1

11-39

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-29. Square Root Using RCPPS 11-bit Approximation
SSE code using RCPPS
Using VRCPPS
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
loop1:
movups xmm1, [rax+rdx]
xorps xmm8, xmm8
cmpneqps xmm8, xmm1
rsqrtps xmm1, xmm1
rcpps xmm1, xmm1
andps xmm1, xmm8
movups [rbx+rdx], xmm1
add rdx, 16
cmp rdx, rcx
jl loop1

mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
vxorps ymm8, ymm8, ymm8
loop1:
vmovups ymm1, [rax+rdx]
vcmpneqps ymm9, ymm8, ymm1
vrsqrtps ymm1, ymm1
vrcpps ymm1, ymm1
vandps ymm1, ymm1, ymm9
vmovups [rbx+rdx], ymm1
add rdx, 32
cmp rdx, rcx
jl loop1

Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion
RCPPS + Taylor ~ 22 bit accuracy
VRCPPS + Taylor ~ 22 bit accuracy
__declspec(align(16)) float minus_half[4] = {-0.5, -0.5, 0.5, -0.5};
__declspec(align(16)) float three[4] = {3.0, 3.0, 3.0,
3.0};
__asm
{
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
movups xmm6, [three]
movups xmm7, [minus_half]

loop1:
movups xmm3, [rax+rdx]
rsqrtps xmm1, xmm3
xorps xmm8, xmm8
cmpneqps xmm8, xmm3
andps xmm1, xmm8
movaps xmm4, xmm1
mulps xmm1, xmm3
movaps xmm5, xmm1
mulps xmm1, xmm4

11-40

__declspec(align(32)) float three[8] =
{3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0};
__declspec(align(32)) float minus_half[8] =
{-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5};
__asm
{
mov rax, pIn
mov rbx, pOut
mov rcx, iLen
xor rdx, rdx
vmovups ymm6, [three]
vmovups ymm7, [minus_half]
vxorps ymm8, ymm8, ymm8
loop1:
vmovups ymm3, [rax+rdx]
vrsqrtps ymm4, ymm3
vcmpneqps ymm9, ymm8, ymm3
vandps ymm4, ymm4, ymm9
vmulps ymm1,ymm4, ymm3
vmulps ymm2, ymm1, ymm4

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion (Contd.)
RCPPS + Taylor ~ 22 bit accuracy
VRCPPS + Taylor ~ 22 bit accuracy
subps xmm1, xmm6
mulps xmm1, xmm5
mulps xmm1, xmm7
movups [rbx+rdx], xmm1
add rdx, 16
cmp rdx, rcx
jl loop1
}

vsubps ymm2, ymm2, ymm6
vmulps ymm1, ymm1, ymm2
vmulps ymm1, ymm1, ymm7
vmovups [rbx+rdx], ymm1
add rdx, 32
cmp rdx, rcx
jl loop1
}

Table 11-9. Comparison of Single-Precision Square Root Operation
Accuracy

Method

SSE Performance

AVX Performance

24 bits

(V)SQRTPS

Baseline

1X

~ 22 bits

(V)RCPPS + Taylor-Expansion

2.3X

4.3X

~ 11 bits

(V)RCPPS

4.7X

5.9X

11.13

OPTIMIZATION OF ARRAY SUB SUM EXAMPLE

This section shows the transformation of SSE implementation of Array Sub Sum algorithm to Intel AVX
implementation.
The Array Sub Sum algorithm is:
Y[i] = Sum of k from 0 to i ( X[k]) = X[0] + X[1] + .. + X[i]
The following figure describes the SSE implementation.

The figure below describes the Intel AVX implementation of the Array Sub Sums algorithm. The PSLLDQ
is an integer SIMD instruction which does not have a 256-bit equivalent. It is replaced by VSHUFPS.

11-41

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-31. Array Sub Sums Algorithm
SSE code
mov
rax, InBuff
mov
rbx, OutBuff
mov
rdx, len
xor
rcx, rcx
xorps xmm0, xmm0
loop1:
movaps xmm2, [rax+4*rcx]
movaps xmm3, [rax+4*rcx]
movaps xmm4, [rax+4*rcx]
movaps xmm5, [rax+4*rcx]
pslldq xmm3, 4
pslldq xmm4, 8
pslldq xmm5, 12
addps xmm2, xmm3
addps xmm4, xmm5
addps xmm2, xmm4
addps xmm2, xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 0xFF
movaps [rbx+4*rcx], xmm2
add
rcx, 4
cmp
rcx, rdx
jl
loop1

AVX code
mov
rax, InBuff
mov
rbx, OutBuff
mov
rdx, len
xor
rcx, rcx
vxorps ymm0, ymm0, ymm0
vxorps ymm1, ymm1, ymm1
loop1:
vmovaps ymm2, [rax+4*rcx]
vshufps ymm4, ymm0, ymm2, 0x40
vshufps ymm3, ymm4, ymm2, 0x99
vshufps ymm5, ymm0, ymm4, 0x80
vaddps ymm6, ymm2, ymm3
vaddps ymm7, ymm4, ymm5
vaddps ymm9, ymm6, ymm7
vaddps ymm1, ymm9, ymm1
vshufps ymm8, ymm9, ymm9, 0xff
vperm2f128 ymm10, ymm8, ymm0, 0x2
vaddps ymm12, ymm1, ymm10
vshufps ymm11, ymm12, ymm12, 0xff
vperm2f128 ymm1, ymm11, ymm11, 0x11
vmovaps [rbx+4*rcx], ymm12
add
rcx, 8
cmp
rcx, rdx
jl
loop1

Example 11-31 shows SSE implementation of array sub summ and AVX implementation. The AVX code is
about 40% faster.

11-42

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

11.14

HALF-PRECISION FLOATING-POINT CONVERSIONS

In applications that use floating-point and require only the dynamic range and precision offered by the
16-bit floating-point format, storing persistent floating-point data encoded in 16-bits has strong advantages in memory footprint and bandwidth conservation. These situations are encountered in some
graphics and imaging workloads.
The encoding format of half-precision floating-point numbers can be found in Chapter 4, “Data Types” of
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
Instructions to convert between packed, half-precision floating-point numbers and packed single-precision floating-point numbers is described in Chapter 14, “Programming with AVX, FMA and AVX2” of
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 and in the reference pages of
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.
To perform computations on half precision floating-point data, packed 16-bit FP data elements must be
converted to single precision format first, and the single-precision results converted back to half precision format, if necessary. These conversions of 8 data elements using 256-bit instructions are very fast
and handle the special cases of denormal numbers, infinity, zero and NaNs properly.

11.14.1 Packed Single-Precision to Half-Precision Conversion
To convert the data in single precision floating-point format to half precision format, without special hardware support like VCVTPS2PH, a programmer needs to do the following:

•
•
•
•
•

Correct exponent bias to permitted range for each data element.
Shift and round the significand of each data element.
Copy the sign bit to bit 15 of each element.
Take care of numbers outside the half precision range.
Pack each data element to a register of half size.

Example 11-32 compares two implementations of floating-point conversion from single precision to half
precision. The code on the left uses packed integer shift instructions that is limited to 128-bit SIMD
instruction set. The code on right is unrolled twice and uses the VCVTPS2PH instruction.

Example 11-32. Single-Precision to Half-Precision Conversion
AVX-128 code
VCVTPS2PH code
__asm {
mov
rax, pIn
mov
rbx, pOut
mov
rcx, bufferSize
add
rcx, rax
vmovdqu xmm0,SignMask16
vmovdqu xmm1,ExpBiasFixAndRound
vmovdqu xmm4,SignMaskNot32
vmovdqu xmm5,MaxConvertibleFloat
vmovdqu xmm6,MinFloat
loop:
vmovdqu
xmm2, [rax]
vmovdqu
xmm3, [rax+16]
vpaddd
xmm7, xmm2, xmm1
vpaddd
xmm9, xmm3, xmm1
vpand
xmm7, xmm7, xmm4
vpand
xmm9, xmm9, xmm4
add
rax, 32

__asm {
mov
mov
mov
add
loop:
vmovups
vmovups
add
vcvtps2ph
vcvtps2ph
add
cmp
jl

rax, pIn
rbx, pOut
rcx, bufferSize
rcx, rax
ymm0,[rax]
ymm1,[rax+32]
rax, 64
[rbx],ymm0, roundingCtrl
[rbx+16],ymm1,roundingCtrl
rbx, 32
rax, rcx
loop

11-43

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-32. Single-Precision to Half-Precision Conversion (Contd.)
AVX-128 code
VCVTPS2PH code
vminps
vminps
vpcmpgtd
vpcmpgtd
vpand
vpand
vpackssdw
vpsrad
vpsrad
vpand
vpackssdw
vpaddw
vmovdqu
add
cmp
jl

xmm7, xmm7, xmm5
xmm9, xmm9, xmm5
xmm8, xmm7, xmm6
xmm10, xmm9, xmm6
xmm7, xmm8, xmm7
xmm9, xmm10, xmm9
xmm2, xmm3, xmm2
xmm7, xmm7, 13
xmm8, xmm9, 13
xmm2, xmm2, xmm0
xmm3, xmm7, xmm9
xmm3, xmm3, xmm2
[rbx], xmm3
rbx, 16
rax, rcx
loop

The code using VCVTPS2PH is approximately four times faster than the AVX-128 sequence. Although it is
possible to load 8 data elements at once with 256-bit AVX, most of the per-element conversion operations require packed integer instructions which do not have 256-bit extensions yet. Using VCVTPS2PH is
not only faster but also provides handling of special cases that do not encode to normal half-precision
floating-point values.

11.14.2 Packed Half-Precision to Single-Precision Conversion
Example 11-33 compares two implementations using AVX-128 code and with VCVTPH2PS.
Conversion from half precision to single precision floating-point format is easier to implement, yet using
VCVTPH2PS instruction performs about 2.5 times faster than the alternative AVX-128 code.

Example 11-33. Half-Precision to Single-Precision Conversion
AVX-128 code
VCVTPS2PH code
__asm {
mov
rax, pIn
mov
rbx, pOut
mov
rcx, bufferSize
add
rcx, rax
vmovdqu
xmm0,SignMask16
vmovdqu
xmm1,ExpBiasFix16
vmovdqu
xmm2,ExpMaskMarker
loop:
vmovdqu
xmm3, [rax]
add
rax, 16
vpandn
xmm4, xmm0, xmm3
vpand
xmm5, xmm3, xmm0
vpsrlw
xmm4, xmm4, 3
vpaddw
xmm6, xmm4, xmm1
vpcmpgtw
xmm7, xmm6, xmm2

11-44

__asm {
mov
rax, pIn
mov
rbx, pOut
mov
rcx, bufferSize
add
rcx, rax
loop:
vcvtph2ps
ymm0,[rax]
vcvtph2ps
ymm1,[rax+16]
add
rax, 32
vmovups
[rbx], ymm0
vmovups
[rbx+32], ymm1
add
rbx, 64
cmp
rax, rcx
jl
loop

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-33. Half-Precision to Single-Precision Conversion (Contd.)
AVX-128 code
VCVTPS2PH code
vpand
vpand
vpor
vpsllw
vpunpcklwd
vpunpckhwd
vmovdqu
vmovdqu
add
cmp
jl

xmm6, xmm6, xmm7
xmm8, xmm3, xmm7
xmm6, xmm6, xmm5
xmm8, xmm8, 13
xmm3, xmm8, xmm6
xmm4, xmm8, xmm6
[rbx], xmm3
[rbx+16], xmm4
rbx, 32
rax, rcx
loop

11.14.3 Locality Consideration for using Half-Precision FP to Conserve Bandwidth
Example 11-32 and Example 11-33 demonstrate the performance advantage of using FP16C instructions
when software needs to convert between half-precision and single-precision data. Half-precision FP
format is more compact, consumes less bandwidth than single-precision FP format, but sacrifices
dynamic range, precision, and incurs conversion overhead if arithmetic computation is required. Whether
it is profitable for software to use half-precision data will be highly dependent on locality considerations
of the workload.
This section uses an example based on the horizontal median filtering algorithm, “Median3”. The Median3
algorithm calculates the median of every three consecutive elements in a vector:
Y[i] = Median3( X[i], X[i+1], X[i+2])
Where: Y is the output vector, and X is the input vector.
Example 11-34 shows two implementations of the Median3 algorithm; one uses single-precision format
without conversion, the other uses half-precision format and requires conversion. Alternative 1 on the
left works with single precision format using 256-bit load/store operations, each of which loads/stores
eight 32-bit numbers. Alternative 2 uses 128-bit load/store operations to load/store eight 16-bit
numbers in half precision format and VCVTPH2PS/VCVTPS2PH instructions to convert it to/from single
precision floating-point format.

Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision
Single-Precision code w/o Conversion
Half-Precision code w/ Conversion
__asm {
xor rbx, rbx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
vmovaps ymm0, [rdi]
loop:
add rdi, 32
vmovaps ymm6, [rdi]
vperm2f128 ymm1, ymm0, ymm6, 0x21
vshufps ymm3, ymm0, ymm1, 0x4E
vshufps ymm2, ymm0, ymm3, 0x99
vminps ymm5, ymm0 ,ymm2
vmaxps ymm0, ymm0, ymm2

__asm {
xor rbx, rbx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
vcvtph2ps ymm0, [rdi]
loop:
add rdi,16
vcvtph2ps ymm6, [rdi]
vperm2f128 ymm1, ymm0, ymm6, 0x21
vshufps ymm3, ymm0, ymm1, 0x4E
vshufps ymm2, ymm0, ymm3, 0x99
vminps ymm5, ymm0 ,ymm2
vmaxps ymm0, ymm0, ymm2

11-45

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision (Contd.)
Single-Precision code w/o Conversion
Half-Precision code w/ Conversion
vminps ymm4, ymm0, ymm3
vmaxps ymm7, ymm4, ymm5
vmovaps ymm0, ymm6
vmovaps [rsi], ymm7
add rsi, 32
add rbx, 8
cmp rbx, rcx
jl loop

vminps ymm5, ymm0 ,ymm2
vmaxps ymm0, ymm0, ymm2
vminps ymm4, ymm0, ymm3
vmaxps ymm7, ymm4, ymm5
vmovaps ymm0, ymm6
vcvtps2ph [rsi], ymm7, roundingCtrl
add
rsi, 16
add rbx, 8
cmp rbx, rcx
jl loop

When the locality of the working set resides in memory, using half-precision format with processors
based on Intel microarchitecture code name Ivy Bridge is about 30% faster than single-precision format,
despite the conversion overhead. When the locality resides in L3, using half-precision format is still
~15% faster. When the locality resides in L1, using single-precision format is faster because the cache
bandwidth of the L1 data cache is much higher than the rest of the cache/memory hierarchy and the
overhead of the conversion becomes a performance consideration.

11.15

FUSED MULTIPLY-ADD (FMA) INSTRUCTIONS GUIDELINES

FMA instructions perform vectored operations of “a * b + c” on IEEE-754-2008 floating-point values,
where the multiplication operations “a * b” are performed with infinite precision, the final results of the
addition are rounded to produced the desired precision. Details of FMA rounding behavior and special
case handling can be found in section 2.3 of Intel® Architecture Instruction Set Extensions Programming
Reference.
FMA instruction can speed up and improve the accuracy of many FP calculations. Intel microarchitecture
code name Haswell implements FMA instructions with execution units on port 0 and port 1 and 256-bit
data paths. Dot product, matrix multiplication and polynomial evaluations are expected to benefit from
the use of FMA, 256-bit data path and the independent executions on two ports. The peak throughput of
FMA from each processor core are 16 single-precision and 8 double-precision results each cycle.
Algorithms designed to use FMA instruction should take into consideration that non-FMA sequence of
MULPD/PS and ADDPD/PS likely will produce slightly different results compared to using FMA. For numerical computations involving a convergence criteria, the difference in the precision of intermediate results
must be factored into the numeric formalism to avoid surprise in completion time due to rounding issues.
User/Source Coding Rule 33. Factor in precision and rounding characteristics of FMA instructions
when replacing multiply/add operations executing non-FMA instructions. FMA improves performance
when an algorithm is execution-port throughput limited, like DGEMM.
There may be situations where using FMA might not deliver better performance. Consider the vectored
operation of “a * b + c * d” and data are ready at the same time:
In the three-instruction sequence of
VADDPS ( VMULPS (a,b) , VMULPS (c,b) );
VMULPS can be dispatched in the same cycle and execute in parallel, leaving the latency of VADDPS (3
cycle) exposed. With unrolling the exposure of VADDPS latency may be further amortized.
When using the two-instruction sequence of
VFMADD213PS ( c, d, VMULPS (a,b) );
The latency of FMA (5 cycle) is exposed for producing each vector result.
User/Source Coding Rule 34. Factor in result-dependency, latency of FP add vs. FMA instructions
when replacing FP add operations with FMA instructions.

11-46

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

11.15.1 Optimizing Throughput with FMA and Floating-Point Add/MUL
In the Skylake microarchitecture, there are two pipes of executions supporting FMA, vector FP Multiply,
and FP ADD instructions. All three categories of instructions have a latency of 4 cycles and can dispatch
to either port 0 or port 1 to execute every cycle.
The arrangement of identical latency and number of pipes allows software to increase the performance
of situations where floating-point calculations are limited by the floating-point add operations that follow
FP multiplies. Consider a situation of vector operation An = C1 + C2 * An-1:

Example 11-35. FP Mul/FP Add Versus FMA
FP Mul/FP Add Sequence
mov eax, NumOfIterations
mov rbx, pA
mov rcx, pC1
mov rdx, pC2
vmovups ymm0, Ymmword ptr [rbx] // A
vmovups ymm1, Ymmword ptr [rcx] // C1
vmovups ymm2, Ymmword ptr [rdx] // C2
loop:
vmulps ymm4, ymm0 ,ymm2 // A * C2
vaddps ymm0, ymm1, ymm4
dec eax
jnz loop

FMA Sequence
mov eax, NumOfIterations
mov rbx, pA
mov rcx, pC1
mov rdx, pC2
vmovups ymm0, Ymmword ptr [rbx] // A
vmovups ymm1, Ymmword ptr [rcx] // C1
vmovups ymm2, Ymmword ptr [rdx] // C2
loop:
vfmadd132ps ymm0, ymm1, ymm2 // C1 + A * C2
dec eax
jnz loop
vmovups ymmword ptr[rbx], ymm0 // store An

vmovups ymmword ptr[rbx], ymm0 // store An
Cost per iteration: ~ fp add latency + fp add latency

Cost per iteration: ~ fma latency

The overall throughput of the code sequence on the LHS is limited by the combined latency of the FP MUL
and FP ADD instructions of specific microarchitecture. The overall throughput of the code sequence on
the RHS is limited by the throughput of the FMA instruction of the corresponding microarchitecture.
A common situation where the latency of the FP ADD operation dominates performance is the following
C code:
for ( int 1 = 0; i < arrLenght; i ++) result += arrToSum[i];
Example 11-35 shows two implementations with and without unrolling.

Example 11-36. Unrolling to Hide Dependent FP Add Latency
No Unroll
Unroll 8 times
mov eax, arrLength
mov rbx, arrToSum
vmovups ymm0, Ymmword ptr [rbx]
sub eax, 8
loop:
add rbx, 32
vaddps ymm0, ymm0, ymmword ptr [rbx]
sub eax, 8
jnz loop

mov eax, arrLength
mov rbx, arrToSum
vmovups ymm0, ymmword ptr [rbx]
vmovups ymm1, ymmword ptr 32[rbx]
vmovups ymm2, ymmword ptr 64[rbx]
vmovups ymm3, ymmword ptr 96[rbx]
vmovups ymm4, ymmword ptr 128[rbx]
vmovups ymm5, ymmword ptr 160[rbx]
vmovups ymm6, ymmword ptr 192[rbx]
vmovups ymm7, ymmword ptr 224[rbx]

11-47

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-36. Unrolling to Hide Dependent FP Add Latency (Contd.)
No Unroll
Unroll 8 times
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vpermilps xmm1, xmm0, 0xe
vaddps xmm0, xmm0, xmm1
vpermilps xmm1, xmm0, 0x1
vaddss xmm0, xmm0, xmm1

sub eax, 64
loop:
add rbx, 256
vaddps ymm0, ymm0, ymmword ptr [rbx]
vaddps ymm1, ymm1, ymmword ptr 32[rbx]
vaddps ymm2, ymm2, ymmword ptr 64[rbx]
vaddps ymm3, ymm3, ymmword ptr 96[rbx]
vaddps ymm4, ymm4, ymmword ptr 128[rbx]
vaddps ymm5, ymm5, ymmword ptr 160[rbx]
vaddps ymm6, ymm6, ymmword ptr 192[rbx]
vaddps ymm7, ymm7, ymmword ptr 224[rbx]
sub eax, 64
jnz loop
vaddps Ymm0, ymm0, ymm1
vaddps Ymm2, ymm2, ymm3
vaddps Ymm4, ymm4, ymm5
vaddps Ymm6, ymm6, ymm7
vaddps Ymm0, ymm0, ymm2
vaddps Ymm4, ymm4, ymm6
vaddps Ymm0, ymm0, ymm4

vmovss result, ymm0

vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vpermilps xmm1, xmm0, 0xe
vaddps xmm0, xmm0, xmm1
vpermilps xmm1, xmm0, 0x1
vaddss xmm0, xmm0, xmm1
vmovss result, ymm0

Without unrolling (LHS of Example 11-35), the cost of summing every 8 array elements is about proportional to the latency of the FP ADD instruction, assuming the working set fit in L1. To use unrolling effectively, the number of unrolled operations should be at least “latency of the critical operation” * “number
of pipes”. The performance gain of optimized unrolling versus no unrolling, for a given microarchitecture,
can approach “number of pipes” * “Latency of FP ADD”.
User/Source Coding Rule 35. Consider using unrolling technique for loops containing back-to-back
dependent FMA, FP Add or Vector MUL operations, The unrolling factor can be chosen by considering
the latency of the critical instruction of the dependency chain and the number of pipes available to
execute that instruction.

11.15.2 Optimizing Throughput with Vector Shifts
In the Skylake microarchitecture, many common vector shift instructions can dispatch into either port 0
or port 1, compared to only one port in prior generations, see Table 2-2 and Table 2-7.
A common situation where the latency of the FP ADD operation dominates performance is the following
C code, where a, b, and c are integer arrays:
for ( int 1 = 0; i < len; i ++) c[i] += 4* a[i] + b[i]/2;
Example 11-35 shows two implementations with and without unrolling.

11-48

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-37. FP Mul/FP Add Versus FMA
FP Mul/FP Add Sequence
mov eax, NumOfIterations
mov rbx, pA
mov rcx, pC1
mov rdx, pC2
vmovups ymm0, Ymmword ptr [rbx] // A
vmovups ymm1, Ymmword ptr [rcx] // C1
vmovups ymm2, Ymmword ptr [rdx] // C2
loop:
vmulps ymm4, ymm0 ,ymm2 // A * C2
vaddps ymm0, ymm1, ymm4
dec eax
jnz loop

FMA Sequence
mov eax, NumOfIterations
mov rbx, pA
mov rcx, pC1
mov rdx, pC2
vmovups ymm0, Ymmword ptr [rbx] // A
vmovups ymm1, Ymmword ptr [rcx] // C1
vmovups ymm2, Ymmword ptr [rdx] // C2
loop:
vfmadd132ps ymm0, ymm1, ymm2 // C1 + A * C2
dec eax
jnz loop
vmovups ymmword ptr[rbx], ymm0 // store An

vmovups ymmword ptr[rbx], ymm0 // store An
Cost per iteration: ~ fp add latency + fp add latency

11.16

Cost per iteration: ~ fma latency

AVX2 OPTIMIZATION GUIDELINES

AVX2 instructions promotes the great majority of 128-bit SIMD integer instructions to operate on 256-bit
YMM registers. AVX2 also adds a rich mix of broadcast/permute/variable-shift instructions to accelerate
numerical computations. The 256-bit AVX2 instructions are supported by the Intel microarchitecture
Haswell which implements 256-bit data path with low latency and high throughput.
Consider an intra-coding 4x4 block image transformation1 shown in Figure 11-3.
A 128-bit SIMD implementation can perform this transformation by the following technique:

•
•

Convert 8-bit pixels into 16-bit word elements and fetch two 4x4 image block as 4 row vectors.

•
•

The two 4x4 word-granular, intermediate result can be re-arranged into column vectors.

The matrix operation 1/128 * (B x R) can be evaluated with row vectors of the image block and
column vectors of the right-hand-side coefficient matrix using a sequence of SIMD instructions of
PMADDWD, PHADDD, packed shift and blend instructions.
The left-hand-side coefficient matrix in row vectors and the column vectors of the intermediate block
can be calculated (using PMADDWD, PHADDD, shift, blend) and written out.

1. C. Yeo, Y. H. Tan, Z. Li and S. Rahardja, “Mode-Dependent Fast Separable KLT for Block-based Intra
Coding,” JCTVC-B024, Geneva, Switzerland, Jul 2010
11-49

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

11--------------128
128

29
29 55
55 74
74 84
84
74
74 74
74 00 ––74
74
84
–
29
–
74
55
84 – 29 – 74 55
55
55 ––84
84 74
74 ––29
29

X

L

X

64
1 - 84
-------128 64
35

64
35
– 64
– 84

B

64
– 35
– 64
84

64
– 84
64
– 35

R

Figure 11-3. 4x4 Image Block Transformation
The same technique can be implemented using AVX2 instructions in a straightforward manner. The AVX2
sequence is illustrated in Example 11-38 and Example 11-39.

Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2
// b0: input row vector from 4 consecutive 4x4 image block of word pixels
// rmc0-3: columnar vector coefficient of the RHS matrix, repeated 4X for 256-bit
// min32km1: saturation constant vector to cap intermediate pixel to less than or equal to 32767
// w0: output row vector of garbled intermediate matrix, elements within each block are garbled
// e.g Low 128-bit of row 0 in descending order: y07, y05, y06, y04, y03, y01, y02, y00
(continue)
#define __MyM_KIP_PxRMC_ROW_4x4Wx4(b0, w0, rmc0_256,
rmc1_256, rmc2_256, rmc3_256, min32km1)\
{__m256i tt0, tt1, tt2, tt3;\
tt0 = _mm256_madd_epi16(b0, (rmc0_256));\
tt0 = _mm256_hadd_epi32(tt0, tt0) ;\
tt1 = _mm256_madd_epi16(b0, rmc1_256);\
tt1 = _mm256_blend_epi16(tt0, _mm256_hadd_epi32(tt1, tt1) , 0xf0);\
tt1 = _mm256_min_epi32(_mm256_srai_epi32( tt1, 7), min32km1);\
tt1 = _mm256_shuffle_epi32(tt1, 0xd8); \
tt2 = _mm256_madd_epi16(b0, rmc2_256);\
tt2 = _mm256_hadd_epi32(tt2, tt2) ;\
tt3 = _mm256_madd_epi16(b0, rmc3_256);\
tt3 = _mm256_blend_epi16(tt2, _mm256_hadd_epi32(tt3, tt3) , 0xf0);\
tt3 = _mm256_min_epi32( _mm256_srai_epi32(tt3, 7), min32km1);\
tt3 = _mm256_shuffle_epi32(tt3, 0xd8);\
w0 = _mm256_blend_epi16(tt1, _mm256_slli_si256( tt3, 2), 0xaa);\
}

11-50

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2 (Contd.)

// t0-t3: 256-bit input vectors of un-garbled intermediate matrix 1/128 * (B x R)
// lmr_256: 256-bit vector of one row of LHS coefficient, repeated 4X
// min32km1: saturation constant vector to cap final pixel to less than or equal to 32767
// w0; Output row vector of final result in un-garbled order
#define __MyM_KIP_LMRxP_ROW_4x4Wx4(w0, t0, t1, t2, t3, lmr_256, min32km1)\
{__m256itb0, tb1, tb2, tb3;\
tb0 = _mm256_madd_epi16( lmr_256, t0);\
tb0 = _mm256_hadd_epi32(tb0, tb0) ;\
tb1 = _mm256_madd_epi16( lmr_256, t1);\
tb1 = _mm256_blend_epi16(tb0, _mm256_hadd_epi32(tb1, tb1), 0xf0 );\
tb1 = _mm256_min_epi32( _mm256_srai_epi32( tb1, 7), min32km1);\
tb1 = _mm256_shuffle_epi32(tb1, 0xd8);\
tb2 = _mm256_madd_epi16( lmr_256, t2);\
tb2 = _mm256_hadd_epi32(tb2, tb2) ;\
tb3 = _mm256_madd_epi16( lmr_256, t3);\
tb3 = _mm256_blend_epi16(tb2, _mm256_hadd_epi32(tb3, tb3) , 0xf0);\
tb3 = _mm256_min_epi32( _mm256_srai_epi32( tb3, 7), min32km1);\
tb3 = _mm256_shuffle_epi32(tb3, 0xd8); \
tb3 = _mm256_slli_si256( tb3, 2);\
tb3 = _mm256_blend_epi16(tb1, tb3, 0xaa);\
w0 = _mm256_shuffle_epi8(tb3, _mm256_setr_epi32( 0x5040100, 0x7060302, 0xd0c0908, 0xf0e0b0a,
0x5040100, 0x7060302, 0xd0c0908, 0xf0e0b0a));\
}

In Example 11-39, matrix multiplication of 1/128 * (B xR) is evaluated first in a 4-wide manner by
fetching from 4 consecutive 4x4 image block of word pixels. The first macro shown in Example 11-38
produces an output vector where each intermediate row result is in an garbled sequence between the
two middle elements of each 4x4 block. In Example 11-39, undoing the garbled elements and transposing the intermediate row vector into column vectors are implemented using blend primitives instead
of shuffle/unpack primitives.
In Intel microarchitecture code name Haswell, shuffle/pack/unpack primitives rely on the shuffle execution unit dispatched to port 5. In some situations of heavy SIMD sequences, port 5 pressure may become
a determining factor in performance.
If 128-bit SIMD code faces port 5 pressure when running on Haswell, porting 128-bit code to use 256-bit
AVX2 can improve performance and alleviate port 5 pressure.

11-51

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-39. Separable KLT Intra-block Transformation Using AVX2
short __declspec(align(16))cst_rmc0[8] = {64, 84, 64, 35, 64, 84, 64, 35};
short __declspec(align(16))cst_rmc1[8] = {64, 35, -64, -84, 64, 35, -64, -84};
short __declspec(align(16))cst_rmc2[8] = {64, -35, -64, 84, 64, -35, -64, 84};
short __declspec(align(16))cst_rmc3[8] = {64, -84, 64, -35, 64, -84, 64, -35};
short __declspec(align(16))cst_lmr0[8] = {29, 55, 74, 84, 29, 55, 74, 84};
short __declspec(align(16))cst_lmr1[8] = {74, 74, 0, -74, 74, 74, 0, -74};
short __declspec(align(16))cst_lmr2[8] = {84, -29, -74, 44, 84, -29, -74, 55};
short __declspec(align(16)) cst_lmr3[8] = {55, -84, 74, -29, 55, -84, 74, -29};
void Klt_256_d(short * Input, short * Output, int iWidth, int iHeight)
{int iX, iY;
__m256i rmc0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *) &cst_rmc0[0]));
__m256i rmc1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc1[0]));
__m256i rmc2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc2[0]));
__m256i rmc3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc3[0]));
__m256i lmr0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr0[0]));
__m256i lmr1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr1[0]));
__m256i lmr2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr2[0]));
__m256i lmr3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr3[0]));
__m256i min32km1 = _mm256_broadcastsi128_si256( _mm_setr_epi32( 0x7fff7fff, 0x7fff7fff, 0x7fff7fff,
0x7fff7fff));
__m256i b0, b1, b2, b3, t0, t1, t2, t3;
__m256i w0, w1, w2, w3;
short* pImage = Input;
short* pOutImage = Output;
int hgt = iHeight, wid= iWidth;
(continue)
// We implement 1/128 * (Mat_L x (1/128 * (Mat_B x Mat_R))) from the inner most parenthesis
for( iY = 0; iY < hgt; iY+=4) {
for( iX = 0; iX < wid; iX+=16) {
//load row 0 of 4 consecutive 4x4 matrix of word pixels
b0 = _mm256_loadu_si256( (__m256i *) (pImage + iY*wid+ iX)) ;
// multiply row 0 with columnar vectors of the RHS matrix coefficients
__MyM_KIP_PxRMC_ROW_4x4Wx4(b0, w0, rmc0, rmc1, rmc2, rmc3, min32km1);
// low 128-bit of garbled row 0, from hi->lo: y07, y05, y06, y04, y03, y01, y02, y00
b1 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+1)*wid+ iX) );
__MyM_KIP_PxRMC_ROW_4x4Wx4(b1, w1, rmc0, rmc1, rmc2, rmc3, min32km1);
// hi->lo y17, y15, y16, y14, y13, y11, y12, y10
b2 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+2)*wid+ iX) );
__MyM_KIP_PxRMC_ROW_4x4Wx4(b2, w2, rmc0, rmc1, rmc2, rmc3, min32km1);
b3 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+3)*wid+ iX) );
__MyM_KIP_PxRMC_ROW_4x4Wx4(b3, w3, rmc0, rmc1, rmc2, rmc3, min32km1);

11-52

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-39. Separable KLT Intra-block Transformation Using AVX2 (Contd.)
// unscramble garbled middle 2 elements of each 4x4 block, then
// transpose into columnar vectors: t0 has 4 consecutive column 0 or 4 4x4 intermediate
t0 = _mm256_blend_epi16( w0, _mm256_slli_epi64(w1, 16), 0x22);
t0 = _mm256_blend_epi16( t0, _mm256_slli_epi64(w2, 32), 0x44);
t0 = _mm256_blend_epi16( t0, _mm256_slli_epi64(w3, 48), 0x88);
t1 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 32), _mm256_srli_epi64(w1, 16), 0x22);
t1 = _mm256_blend_epi16( t1, w2, 0x44);
t1 = _mm256_blend_epi16( t1, _mm256_slli_epi64(w3, 16), 0x88); // column 1
t2 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 16), w1, 0x22);
t2 = _mm256_blend_epi16( t2, _mm256_slli_epi64(w2, 16), 0x44);
t2 = _mm256_blend_epi16( t2, _mm256_slli_epi64(w3, 32), 0x88); // column 2
t3 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 48), _mm256_srli_epi64(w1, 32), 0x22);
t3 = _mm256_blend_epi16( t3, _mm256_srli_epi64(w2, 16), 0x44);
t3 = _mm256_blend_epi16( t3, w3, 0x88);// column 3
// multiply row 0 of the LHS coefficient with 4 columnar vectors of intermediate blocks
// final output row are arranged in normal order
__MyM_KIP_LMRxP_ROW_4x4Wx4(w0, t0, t1, t2, t3, lmr0, min32km1);
_mm256_store_si256( (__m256i *) (pOutImage+iY*wid+ iX), w0) ;
__MyM_KIP_LMRxP_ROW_4x4Wx4(w1, t0, t1, t2, t3, lmr1, min32km1);
_mm256_store_si256( (__m256i *) (pOutImage+(iY+1)*wid+ iX), w1) ;
__MyM_KIP_LMRxP_ROW_4x4Wx4(w2, t0, t1, t2, t3, lmr2, min32km1);
_mm256_store_si256( (__m256i *) (pOutImage+(iY+2)*wid+ iX), w2) ;
(continue)
__MyM_KIP_LMRxP_ROW_4x4Wx4(w3, t0, t1, t2, t3, lmr3, min32km1);
_mm256_store_si256( (__m256i *) (pOutImage+(iY+3)*wid+ iX), w3) ;
}
}
Although 128-bit SIMD implementation is not shown here, it can be easily derived.
When running 128-bit SIMD code of this KLT intra-coding transformation on Intel microarchitecture code
name Sandy Bridge, the port 5 pressure are less because there are two shuffle units, and the effective
throughput for each 4x4 image block transformation is around 50 cycles. Its speed-up relative to optimized scalar implementation is about 2.5X.
When the 128-bit SIMD code runs on Haswell, micro-ops issued to port 5 account for slightly less than
50% of all micro-ops, compared to about one third on prior microarchitecture, resulting in about 25%
performance regression. On the other hand, AVX2 implementation can deliver effective throughput in
less than 35 cycle per 4x4 block.

11-53

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

11.16.1 Multi-Buffering and AVX2
There are many compute-intensive algorithms (e.g. hashing, encryption, etc.) which operate on a
stream of data buffers. Very often, the data stream may be partitioned and treated as multiple independent buffer streams to leverage SIMD instruction sets.
Detailed treatment of hashing several buffers in parallel can be found at
http://www.scirp.org/journal/PaperInformation.aspx?paperID=23995 and at
http://eprint.iacr.org/2012/476.pdf.
With AVX2 providing a full compliment of 256-bit SIMD instructions with rich functionality at multiple
width granularities for logical and arithmetic operations. Algorithms that had leveraged XMM registers
and prior generations of SSE instruction sets can extend those multi-buffering algorithms to use AVX2 on
YMM and deliver even higher throughput. Optimized 256-bit AVX2 implementation may deliver up to
1.9X throughput when compared to 128-bit versions.
The image block transformation example discussed in Section 11.16 can be construed also as a multibuffering implementation of 4x4 blocks. When the performance baseline is switched from a two-shuffleport microarchitecture (Sandy Bridge) to single-shuffle-port microarchitecture, the 256-bit wide AVX2
provides a speed up of 1.9X relative to 128-bit SIMD implementation.
Greater details on multi-buffering can be found in the white paper at: https://wwwssl.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html.

11.16.2 Modular Multiplication and AVX2
Modular multiplication of very large integers are often used to implement efficient modular exponentiation operations which are critical in public key cryptography, such as RSA 2048. Library implementation
of modular multiplication is often done with MUL/ADC chain sequences. Typically, a MUL instruction can
produce a 128-bit intermediate integer output, and add-carry chains must be used at 64-bit intermediate
data granularity.
In AVX2, VPMULUDQ/VPADDQ/VPSRLQ/VPSLLQ/VPBROADCASTQ/VPERMQ allow vectorized approach to
implement efficient modular multiplication/exponentiation for key lengths corresponding to RSA1024
and RSA2048. For details of modular exponentiation/multiplication and AVX2 implementation in
OpenSSL, see http://rd.springer.com/chapter/10.1007%2F978-3-642-31662-3_9?LI=true.
The basic heuristic starts with reformulating the large integer input operands in 512/1024 bit exponentiation in redundant representations. For example, a 1024-bit integer can be represented using base 2^29
and 36 “digits”, where each “digit” is less than 2^29. A digit in such redundant representation can be
placed in a dword slot of a vector register. Such redundant representation of large integer simplifies the
requirement to perform carry-add chains across the hardware granularity of the intermediate results of
unsigned integer multiplications.
Each VPMULUDQ in AVX2 using the digits from a redundant representation can produce 4 separate 64bit intermediate result with sufficient headroom (e.g. 5 most significant bits are 0 excluding sign bit).
Then, VPADDQ is sufficient to implement add-carry chain requirement without needing SIMD versions of
equivalent of ADC-like instructions. More details are available in the reference cited in paragraph above,
including the cost factor of conversion to redundant representation and effective speedup accounting for
parallel output bandwidth of VPMULUDQ/VPADDQ chain.

11.16.3 Data Movement Considerations
Intel microarchitecture code name Haswell can support up to two 256-bit load and one 256-bit store
micro-ops dispatched each cycle. Most existing binaries with heavy data-movement operation can
benefit from this enhancement and the higher bandwidths of the L1 data cache and L2 without re-compilation, if the binary is already optimized for prior generation microarchitecture. For example, 256-bit
SAXPY computation were limited by the number of load/store ports available in prior generation microarchitecture. It will benefit immediately on the Intel microarchitecture Haswell.

11-54

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

In some situation, there may be some intricate interactions between microarchitectural restrictions on
the instruction set that is worth some discussion. We consider two commonly used library functions
memcpy() and memset() and the optimal choice to implement them on the new microarchitecture.
With memcpy() on Intel microarchitecture code name Haswell, using REP MOVSB to implement memcpy
operation for large copy length can take advantage the 256-bit store data path and deliver throughput of
more than 20 bytes per cycle. For copy length that are smaller than a few hundred bytes, REP MOVSB
approach is slower than using 128-bit SIMD technique described in Section 11.16.3.1.

11.16.3.1 SIMD Heuristics to implement Memcpy()
We start with a discussion of the general heuristic to attempt implementing memcpy() with 128-bit SIMD
instructions, which revolves around three numeric factors (destination address alignment, source
address alignment, bytes to copy) relative to the width of register width of the desired instruction set.
The data movement work of memcpy can be separated into the following phases:

•

An initial unaligned copy of 16 bytes, allows looping destination address pointer to become 16-byte
aligned. Thus subsequent store operations can use as many 16-byte aligned stores.

•

The remaining bytes-left-to-copy are decomposed into (a) multiples of unrolled 16-byte copy
operations, plus (b) residual count that may include some copy operations of less than 16 bytes. For
example, to unroll eight time to amortize loop iteration overhead, the residual count must handle
individual cases from 1 to 8x16-1 = 127.

•

Inside an 8X16 unrolled main loop, each 16 byte copy operation may need to deal with source pointer
address is not aligned to 16-byte boundary and store 16 fresh data to 16B-aligned destination
address. When the iterating source pointer is not 16B-aligned, the most efficient technique is a three
instruction sequence of:
— Fetch an 16-byte chunk from an 16-byte-aligned adjusted pointer address and use a portion of
this chunk with complementary portion from previous 16-byte-aligned fetch.
— Use PALIGNR to stitch a portion of the current chunk with the previous chunk.
— Stored stitched 16-byte fresh data to aligned destination address, and repeat this 3 instruction
sequence.
This 3-instruction technique allows the fetch:store instruction ratio for each 16-byte copy operation
to remain at 1:1.

While the above technique (specifically, the main loop dealing with copying thousands of bytes of data)
can achieve throughput of approximately 10 bytes per cycle on Intel microarchitecture Sandy Bridge and
Ivy Bridge with 128-bit data path for store operations, an attempt to extend this technique to use wider
data path will run into the following restrictions:

•

To use 256-bit VPALIGNR with its 2X128-bit lane microarchitecture, stitching of two partial chunks of
the current 256-bit 32-byte-aligned fetch requires another 256-bit fetch from an address 16-byte
offset from the current 32-byte-aligned 256-bit fetch.
— The fetch:store ratio for each 32-byte copy operation becomes 2:1.
— The 32-byte-unaligned fetch (although aligned to 16-byte boundary) will experience a cache-line
split penalty, once every 64-bytes of copy operation.

The net of this attempt to use 256-bit ISA to take advantage of the 256-bit store data-path microarchitecture was offset by the 4-instruction sequence and cacheline split penalty.

11.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB
It is interesting to compare the alternate approach of using enhanced REP MOVSB to implement
memcpy(). In Intel microarchitecture code name Haswell and Ivy Bridge, REP MOVSB is an optimized,
hardware provided, micro-op flow.
On Intel microarchitecture code name Ivy Bridge, a REP MOVSB implementation of memcpy can achieve
throughput at slightly better than the 128-bit SIMD implementation when copying thousands of bytes.
However, if the size of copy operation is less than a few hundred bytes, the REP MOVSB approach is less
11-55

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

efficient than the explicit residual copy technique described in phase 2 of Section 11.16.3.1. This is
because handling 1-127 residual copy length (via jump table or switch/case, and is done before the main
loop) plus one or two 8x16B iterations incurs less branching overhead than the hardware provided microop flows. For the grueling implementation details of 128-bit SIMD implementation of memcpy(), one can
look up from the archived sources of open source library such as GLibC.
On Intel microarchitecture code name Haswell, using REP MOVSB to implement memcpy operation for
large copy length can take advantage the 256-bit store data path and deliver throughput of more than 20
bytes per cycle. For copy length that are smaller than a few hundred bytes, REP MOVSB approach is still
slower than treating the copy length as the residual phase of Section 11.16.3.1.

11.16.3.3 Memset() Implementation Considerations
The interface of Memset() has one address pointer as destination, which simplifies the complexity of
managing address alignment scenarios to use 256-bit aligned store instruction. After an initial unaligned
store, and adjusting the destination pointer to be 32-byte aligned, the residual phase follows the same
consideration as described in Section 11.16.3.1, which may employ a large jump table to handle each
residual value scenario with minimal branching, depending on the amount of unrolled 32B-aligned
stores. The main loop is a simple YMM register to 32-byte-aligned store operation, which can deliver
close to 30 bytes per cycle for lengths more than a thousand byte. The limiting factor here is due to each
256-bit VMOVDQA store consists of a store_address and a store_data micro-op flow. Only port 4 is available to dispatch the store_data micro-op each cycle.
Using REP STOSB to implement memset() has the code size advantage versus a SIMD implementation,
like REP MOVSB for memcpy(). On Intel microarchitecture code name Haswell, a memset() routine
implemented using REP STOSB will also benefit the from the 256-bit data path and increased L1 data
cache bandwidth to deliver up to 32 bytes per cycle for large count values.
Comparing the performance of memset() implementations using REP STOSB vs. 256-bit AVX2 requires
one to consider the pattern of invocation of memset(). The invocation pattern can lead to the necessity
of using different performance measurement techniques. There may be side effects affecting the
outcome of each measurement technique.
The most common measurement technique that is often used with a simple routine like memset() is to
execute memset() inside a loop with a large iteration count, and wrap the invocation of RDTSC before
and after the loop.
A slight variation of this measurement technique can apply to measuring memset() invocation patterns
of multiple back-to-back calls to memset() with different count values with no other intervening instruction streams executed between calls to memset().
In both of the above memset() invocation scenarios, branch prediction can play a significant role in
affecting the measured total cycles for executing the loop. Thus, measuring AVX2-implemented
memset() under a large loop to minimize RDTSC overhead can produce a skewed result with the branch
predictor being trained by the large loop iteration count.
In more realistic software stacks, the invocation patterns of memset() will likely have the characteristics
that:

•

There are intervening instruction streams being executed between invocations of memset(), the
state of branch predictor prior to memset() invocation is not pre-trained for the branching sequence
inside a memset() implementation.

•

Memset() count values are likely to be uncorrected.

The proper measurement technique to compare memset() performance for more realistic memset()
invocation scenarios will require a per-invocation technique that wraps two RDTSC around each invocation of memset().
With the per-invocation RDTSC measurement technique, the overhead of RDTSC and be pre-calibrated
and post-validated outside of a measurement loop. The per-invocation technique may also consider
cache warming effect by using a loop to wrap around the per-invocation measurements.
When the relevant skew factors of measurement techniques are taken into effect, the performance of
memset() using REP STOSB, for count values smaller than a few hundred bytes, is generally faster than
11-56

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

the AVX2 version for the common memset() invocation scenarios. Only in the extreme scenarios of
hundreds of unrolled memset() calls, all using count values less than a few hundred bytes and with no
intervening instruction stream between each pair of memset() can the AVX2 version of memset() take
advantage of the training effect of the branch predictor.

11.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code
There may be situations where the data furnished by a call to memcpy/memset and subsequent instructions consuming the data can be re-arranged:
memcpy ( pBuf, pSrc, Cnt); // make a copy of some data with knowledge of Cnt
..... // subsequent instruction sequences are not consuming pBuf immediately
result = compute( pBuf); // memcpy result consumed here
When the count is known to be at least a thousand byte or more, using enhanced REP MOVSB/STOSB can
provide another advantage to amortize the cost of the non-consuming code. The heuristic can be understood using a value of Cnt = 4096 and memset() as example:

•

A 256-bit SIMD implementation of memset() will need to issue/execute retire 128 instances of 32byte store operation with VMOVDQA, before the non-consuming instruction sequences can make
their way to retirement.

•

An instance of enhanced REP STOSB with ECX= 4096 is decoded as a long micro-op flow provided by
hardware, but retires as one instruction. There are many store_data operation that must complete
before the result of memset() can be consumed. Because the completion of store data operation is
de-coupled from program-order retirement, a substantial part of the non-consuming code stream
can process through the issue/execute and retirement, essentially cost-free if the non-consuming
sequence does not compete for store buffer resources.

Software that use enhanced REP MOVSB/STOSB much check its availability by verifying
CPUID.(EAX=07H, ECX=0):EBX.ERMSB (bit 9) reports 1.

11.16.3.5 256-bit Fetch versus Two 128-bit Fetches
On Intel microarchitecture code name Sandy Bridge and Ivy Bridge, using two 16-byte aligned loads are
preferred due to the 128-bit data path limitation in the memory pipeline of the microarchitecture.
To take advantage of Intel microarchitecture code name Haswell’s 256-bit data path microarchitecture,
the use of 256-bit loads must consider the alignment implications. Instruction that fetched 256-bit data
from memory should pay attention to be 32-byte aligned. If a 32-byte unaligned fetch would span across
cache line boundary, it is still preferable to fetch data from two 16-byte aligned address instead.

11.16.3.6 Mixing MULX and AVX2 Instructions
Combining MULX and AVX2 instruction can further improve the performance of some common computation task, e.g. numeric conversion 64-bit integer to ascii format can benefit from the flexibility of MULX
register allocation, wider YMM register, and variable packed shift primitive VPSRLVD for parallel
moduli/remainder calculations.
Example 11-40 shows a macro sequence of AVX2 instruction to calculate one or two finite range
unsigned short integer(s) into respective decimal digits, featuring VPSRLVD in conjunction with Montgomery reduction technique.

Example 11-40. Macros for Parallel Moduli/Remainder Calculation
static short quoTenThsn_mulplr_d[16] =
{ 0x199a, 0, 0x28f6, 0, 0x20c5, 0, 0x1a37, 0, 0x199a, 0, 0x28f6, 0, 0x20c5, 0, 0x1a37, 0};
static short mten_mulplr_d[16] = { -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1};

11-57

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-40. Macros for Parallel Moduli/Remainder Calculation (Contd.)
// macro to convert input t5 (a __m256i type) containing quotient (dword 4) and remainder
// (dword 0) into single-digit integer (between 0-9) in output y3 ( a__m256i);
//both dword element "t5" is assume to be less than 10^4, the rest of dword must be 0;
//the output is 8 single-digit integer, located in the low byte of each dword, MS digit in dword 0
#define __ParMod10to4AVX2dw4_0( y3, t5 ) \
{ __m256i x0, x2;

\

x0 = _mm256_shuffle_epi32( t5, 0); \
x2 = _mm256_mulhi_epu16(x0, _mm256_loadu_si256( (__m256i *) quoTenThsn_mulplr_d));\
x2 = _mm256_srlv_epi32( x2, _mm256_setr_epi32(0x0, 0x4, 0x7, 0xa, 0x0, 0x4, 0x7, 0xa) ); \
(y3) = _mm256_or_si256(_mm256_slli_si256(x2, 6), _mm256_slli_si256(t5, 2) ); \
(y3) = _mm256_or_si256(x2, y3);\
(y3) = _mm256_madd_epi16(y3, _mm256_loadu_si256( (__m256i *) mten_mulplr_d) ) ;\
}}
// parallel conversion of dword integer (< 10^4) to 4 single digit integer in __m128i
#define __ParMod10to4AVX2dw( x3, dw32 ) \
{ __m128i x0, x2;

\

x0 = _mm_broadcastd_epi32( _mm_cvtsi32_si128( dw32)); \
x2 = _mm_mulhi_epu16(x0, _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d));\
x2 = _mm_srlv_epi32( x2, _mm_setr_epi32(0x0, 0x4, 0x7, 0xa) ); \
(x3) = _mm_or_si128(_mm_slli_si128(x2, 6), _mm_slli_si128(_mm_cvtsi32_si128( dw32), 2) ); \
(x3) = _mm_or_si128(x2, (x3));\
(x3) = _mm_madd_epi16((x3), _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;\
}
Example 11-41 shows a helper utility and overall steps to reduce a 64-bit signed integer into 63-bit
unsigned range. reduced-range integer quotient/remainder pairs using MULX.

Example 11-41. Signed 64-bit Integer Conversion Utility
#defineQWCG10to80xabcc77118461cefdull
static short quo4digComp_mulplr_d[8] = { 1024, 0, 64, 0, 8, 0, 0, 0};
static int pr_cg_10to4[8] = { 0x68db8db, 0 , 0, 0, 0x68db8db, 0, 0, 0};
static int pr_1_m10to4[8] = { -10000, 0 , 0, 0 , 1, 0 , 0, 0};
char * i64toa_avx2i( __int64 xx, char * p)
{int cnt;
_mm256_zeroupper();
if( xx < 0) cnt = avx2i_q2a_u63b(-xx, p);
else cnt = avx2i_q2a_u63b(xx, p);
p[cnt] = 0;
return p;
}

11-58

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-41. Signed 64-bit Integer Conversion Utility (Contd.)
// Convert unsigned short (< 10^4) to ascii
__inline int ubsAvx2_Lt10k_2s_i2(int x_Lt10k, char *ps)
{int tmp;
__m128i x0, m0, x2, x3, x4, compv;
if( x_Lt10k < 10) { *ps = '0' + x_Lt10k; return 1; }
x0 = _mm_broadcastd_epi32( _mm_cvtsi32_si128( x_Lt10k));
// calculate quotients of divisors 10, 100, 1000, 10000
m0 = _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d);
x2 = _mm_mulhi_epu16(x0, m0);
// u16/10, u16/100, u16/1000, u16/10000
x2 = _mm_srlv_epi32( x2, _mm_setr_epi32(0x0, 0x4, 0x7, 0xa) );
// 0, u16, 0, u16/10, 0, u16/100, 0, u16/1000
x3 = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) x_Lt10k, 1);
x4 = _mm_or_si128(x2, x3);
// produce 4 single digits in low byte of each dword
x4 = _mm_madd_epi16(x4, _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;// add bias for ascii encoding
x2 = _mm_add_epi32( x4, _mm_set1_epi32( 0x30303030 ) );
// pack 4 single digit into a dword, start with most significant digit
x3 = _mm_shuffle_epi8(x2, _mm_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080) );
if (x_Lt10k > 999 ) *(int *) ps = _mm_cvtsi128_si32( x3); return 4;
else {
tmp = _mm_cvtsi128_si32( x3);
if (x_Lt10k > 99 ) {
*((short *) (ps)) = (short ) (tmp >>8);
ps[2] = (char ) (tmp >>24);
return 3;
}
(continue)
else if ( x_Lt10k > 9){
*((short *) ps) = (short ) tmp;
return 2;
}
}
}

Example 11-42 shows the steps of numeric conversion of 63-bit dynamic range into ascii format
according to a progressive range reduction technique using vectorized Montgomery reduction scheme.

11-59

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-42. Unsigned 63-bit Integer Conversion Utility
unsigned avx2i_q2a_u63b (unsigned __int64 xx, char *ps)
{ __m128i v0;
__m256i m0, x1, x2, x3, x4, x5 ;
unsigned __int64 xxi, xx2, lo64, hi64;
__int64 w;
int j, cnt, abv16, tmp, idx, u;
// conversion of less than 4 digits
if ( xx < 10000 ) {
j = ubsAvx2_Lt10k_2s_i2 ( (unsigned ) xx, ps); return j;
} else if (xx < 100000000 ) { // dynamic range of xx is less than 9 digits
// conversion of 5-8 digits
x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(xx)); // broadcast to every dword
// calculate quotient and remainder, each with reduced range (< 10^4)
x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 ));
x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4));
// quotient in dw4, remainder in dw0
m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(xx)), x3);
__ParMod10to4AVX2dw4_0( x3, m0); // 8 digit in low byte of each dw
x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) );
x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080,
0x0004080c, 0x80808080, 0x80808080, 0x80808080) );
// pack 8 single-digit integer into first 8 bytes and set rest to zeros
x4 = _mm256_permutevar8x32_epi32( x4, _mm256_setr_epi32(0x4, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1) );
tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) );
_BitScanForward((unsigned long *) &idx, tmp);
cnt = 8 -idx; // actual number non-zero-leading digits to write to output
} else { // conversion of 9-12 digits
lo64 = _mulx_u64(xx, (unsigned __int64) QWCG10to8, &hi64);
hi64 >>= 26;
xxi = _mulx_u64(hi64, (unsigned __int64)100000000, &xx2);
lo64 = (unsigned __int64)xx - xxi;
(continue)

11-60

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.)
if( hi64 < 10000) { // do digist 12-9 first
__ParMod10to4AVX2dw(v0, hi64);
v0 = _mm_add_epi32( v0, _mm_set1_epi32( 0x30303030 ) );
// continue conversion of low 8 digits of a less-than 12-digit value
x5 = _mm256_setzero_si256( );
x5 = _mm256_castsi128_si256( _mm_cvtsi32_si128(lo64));
x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(lo64)); // broadcast to every dword
x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 ));
x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4));
m0 = _mm256_add_epi32( x5, x3); // quotient in dw4, remainder in dw0
__ParMod10to4AVX2dw4_0( x3, m0);
x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) );
x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080,
0x0004080c, 0x80808080, 0x80808080, 0x80808080) );
x5 = _mm256_castsi128_si256( _mm_shuffle_epi8( v0, _mm_setr_epi32(0x80808080, 0x80808080,
0x0004080c, 0x80808080) ));
x4 = _mm256_permutevar8x32_epi32( _mm256_or_si256(x4, x5), _mm256_setr_epi32(0x2, 0x4, 0x0, 0x1,
0x1, 0x1, 0x1, 0x1) );
tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) );
_BitScanForward((unsigned long *) &idx, tmp);
cnt = 12 -idx;
} else { // handle greater than 12 digit input value
cnt = 0;
if ( hi64 > 100000000) { // case of input value has more than 16 digits
xxi = _mulx_u64(hi64, (unsigned __int64) QWCG10to8, &xx2) ;
abv16 = xx2 >>26;
hi64 -= _mulx_u64((unsigned __int64) abv16, (unsigned __int64) 100000000, &xx2);
__ParMod10to4AVX2dw(v0, abv16);
v0 = _mm_add_epi32( v0, _mm_set1_epi32( 0x30303030 ) );
v0 = _mm_shuffle_epi8(v0, _mm_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080) );
tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(v0, _mm_set1_epi32( 0x30303030 )) );
_BitScanForward((unsigned long *) &idx, tmp);
cnt = 4 -idx;
}
(continue)

11-61

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.)
// conversion of lower 16 digits
x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(hi64)); // broadcast to every dword
x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 ));
x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4));
m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(hi64)), x3);
__ParMod10to4AVX2dw4_0( x3, m0);
x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) );
x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080,
0x0004080c, 0x80808080, 0x80808080, 0x80808080) );
x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(lo64)); // broadcast to every dword
x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 ));
x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4));
m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(hi64)), x3);
__ParMod10to4AVX2dw4_0( x3, m0);
x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) );
x5 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x80808080, 0x80808080, 0x0004080c, 0x80808080,
0x80808080, 0x80808080, 0x0004080c, 0x80808080) );
x4 = _mm256_permutevar8x32_epi32( _mm256_or_si256(x4, x5), _mm256_setr_epi32(0x4, 0x0, 0x6, 0x2,
0x1, 0x1, 0x1, 0x1) );
cnt += 16;
if (cnt <= 16) {
tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) );
_BitScanForward((unsigned long *) &idx, tmp);
cnt -= idx;
}
}
}
w = _mm_cvtsi128_si64( _mm256_castsi256_si128(x4));
switch(cnt) {
case5:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
break;
case6:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
break;
case7:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(unsigned *) (&ps[3]) = (w >>32);
break;
case 8: *(long long *)ps = w;
break;
case9:*ps++ = (char) (w >>24); *(long long *) (&ps[0]) = _mm_cvtsi128_si64(
_mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
(continue)

11-62

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.)
case10:*(short *)ps = (short) (w >>16);
*(long long *) (&ps[2]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case11:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(long long *) (&ps[3]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case 12: *(unsigned *)ps = w; *(long long *) (&ps[4]) = _mm_cvtsi128_si64(
_mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case13:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
*(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8));
break;
case14:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
*(long long *) (&ps[6]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8));
break;
case15:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(unsigned *) (&ps[3]) = (w >>32);
*(long long *) (&ps[7]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8));
break;
case 16: _mm_storeu_si128( (__m128i *) ps, _mm256_castsi256_si128(x4));
break;
case17:u = _mm_cvtsi128_si64(v0); *ps++ = (char) (u >>24);
_mm_storeu_si128( (__m128i *) &ps[0], _mm256_castsi256_si128(x4));
break;
case18:u = _mm_cvtsi128_si64(v0); *(short *)ps = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[2], _mm256_castsi256_si128(x4));
break;
case19:u = _mm_cvtsi128_si64(v0); *ps = (char) (u >>8); *(short *) (&ps[1]) = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[3], _mm256_castsi256_si128(x4));
break;
case20:u = _mm_cvtsi128_si64(v0); *(unsigned *)ps = (short) (u);
_mm_storeu_si128( (__m128i *) &ps[4], _mm256_castsi256_si128(x4));
break;
}
return cnt;
}
The AVX2 version of numeric conversion across the dynamic range of 3/9/17 output digits are approximately 23/57/54 cycles per input, compared to standard library implement ion’s range of 85/260/560
cycles per input.
The techniques illustrated above can be extended to numeric conversion of other library, such as binaryinteger-decimal (BID) encoded IEEE-754-2008 Decimal floating-point format. For BID-128 format,
Example 11-42 can be adapted by adding another range-reduction stage using a pre-computed 256-bit
constant to perform Montgomery reduction at modulus 10^16. The technique to construct the 256-bit

11-63

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

constant is covered in Chapter 10, “SSE4.2 and SIMD Programming For TextProcessing/LexING/Parsing”of Intel® 64 and IA-32 Architectures Optimization Reference Manual.

11.16.4 Considerations for Gather Instructions
VGATHER family of instructions fetch multiple data elements specified by a vector index register
containing relative offsets from a base address. Processors based on the Haswell microarchitecture is the
first implementation of the VGATHER instruction and a single instruction results in multiple micro-ops
being executed. In the Broadwell microarchitecture, the throughput of the VGATHER family of instructions have improved significantly; see Table C-5.
Depending on data organization and access patterns, it is possible to create equivalent code sequences
without using VGATHER instruction that will execute faster and with fewer micro-ops than a single
VGATHER instruction (e.g. see Section 11.5.1). Example 11-43 shows some of those situations where,
use of VGATHER on Intel microarchitecture code name Haswell is unlikely to provide performance
benefit.

Example 11-43. Access Patterns Favoring Non-VGATHER Techniques
Access Patterns

Recommended Instruction Selection

Sequential elements

Regular SIMD loads (MOVAPS/MOVUPS, MOVDQA/MOVDQU)

Fewer than 4 elements

Regular SIMD load + horizontal data-movement to re-arrange slots

Small Strides

Load all nearby elements + shuffle/permute to collected strided elements:
VMOVUPD
VPERMQ
VPERMQ

YMM0, [sequential elements]
YMM1, YMM0, 0x08
// the even elements
YMM2, YMM0, 0x0d
// the odd elements

Transpositions

Regular SIMD loads + shuffle/permute/blend to transpose to columns

Redundant elements

Load once + shuffle/blend/logical to build data vectors in register. In this case, result[i] =
x[index[i]] + x[index[i+1]], the technique below may be preferable to using multiple VGATHER:
ymm0 <- VGATHER ( x[index[k] ]); // fetching 8 elements
ymm1 <- VBLEND( VPERM (ymm0), VBROADCAST ( x[indexx[k+8]]);
ymm2 <- VPADD( ymm0, ymm1);

In other cases, using VGATHER instruction can reduce code size and execute faster with techniques
including but not limited to amortizing the latency and throughput of VGATHER, or by hoisting the fetch
operations well in advance of consumer code of the destination register of those fetches. Example
11-44 lists some patterns that can benefit from using VGATHER on Intel microarchitecture code name
Haswell.
General tips for using VGATHER:

•

Gathering more elements with a VGATHER instruction helps amortize the latency and throughput of
VGATHER, and is more likely to provide performance benefit over an equivalent non-VGATHER flow.
For example, the latency of 256-bit VGATHER is less than twice the equivalent 128-bit VGATHER and
therefore more likely to show gains than two 128-bit equivalent ones. Also, using index size larger
than data element size results in only half of the register slots utilized but not a proportional latency
reduction. Therefore the dword index form of VGATHER is preferred over qword index if dwords or
single-precision values are to be fetched.

•
•

It is advantageous to hoist VGATHER well in advance of the consumer code.
VGATHER merges the (unmasked) gathered elements with the previous value of the destination.
Therefore, in cases where the previous value of the destination doesn’t need to be merged (for
instance, when no elements is masked off), it can be beneficial to break the dependency of the

11-64

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

VGATHER instruction on the previous writer of the destination register (by zeroing out the register
with a VXOR instruction).

Example 11-44. Access Patterns Likely to Favor VGATHER Techniques
Access Patterns

Instruction Selection

4 or more
elements with
unknown masks

Code with conditional element gathers typically either will not vectorize without a VGATHER
instruction or provide relatively poor performance due to data-dependent mis-predicted branches.
C code with data-dependent branches:
if (condition[i] > 0) { result[i] = x[index[i]] }
AVX2 equivalent sequence:
YMM0 <- VPCMPGT (condition, zeros) // compute vector mask
YMM2 <- VGATHER (x[YMM1], YMM0) // addr=x[YMM1], mask=YMM0

Vectorized index
calculation with 8
elements

Vectorized calculations to generate the index synergizes well with the VGATHER instruction
functionality.
C code snippet:
x[index1[i] + index2[i]]
AVX2 equivalent:
YMM0 <- VPADD (index1, index2)
YMM1 <- VGATHER (x[YMM0], mask)

// calc vector index
// addr=x[YMM0]

Performance of the VGATHER instruction compared to a multi-instruction gather equivalent flow can vary
due to (1) differences in the base algorithm, (2) different data organization, and (3) the effectiveness of
the equivalent flow. In performance critical applications it is advisable to evaluate both options before
choosing one.
The throughput of GATHER instructions continue to improve from Broadwell to Skylake Microarchitecture. This is shown in Figure 11-4.

Figure 11-4. Throughput Comparison of Gather Instructions
11-65

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Example 11-45 gives the asm sequence of software implementation that is equivalent to the VPGATHERD
instruction. This can be used to compare the trade-off of using a hardware gather instruction or software
gather sequence based on inserting an individual element.

Example 11-45. Software AVX Sequence Equivalent to Full-Mask VPGATHERD
mov eax, [rdi]
vmovd xmm0, [rsi+4*rax]
mov eax, [rdi+4]

// load index0
// load element0
// load index1

vpinsrd xmm0, xmm0, [rsi+4*rax], 0x1 // load element1
mov eax, [rdi+8]

// load index2

vpinsrd xmm0, xmm0, [rsi+4*rax], 0x2 // load element2
mov eax, [rdi+12]

// load index3

vpinsrd xmm0, xmm0, [rsi+4*rax], 0x3 // load element3
mov eax, [rdi+16]

// load index4

vmovd xmm1, [rsi+4*rax]
mov eax, [rdi+20]

// load element4
// load index5

vpinsrd xmm1, xmm1, [rsi+4*rax], 0x1
mov eax, [rdi+24]

// load index6

vpinsrd xmm1, xmm1, [rsi+4*rax], 0x2
mov eax, [rdi+28]

// load element5
// load element6

// load index7

vpinsrd xmm1, xmm1, [rsi+4*rax], 0x3

// load element7

vinserti128 ymm0, ymm0, xmm1, 1

//result in ymm0

Figure 11-5. Comparison of HW GATHER Versus Software Sequence in Skylake Microarchitecture

11-66

OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2

Figure 11-5 compares per-element throughput using the VPGATHERD instruction versus a software
gather sequence with Skylake microarchitecture as a function of cache locality of data supply. With the
exception of using hardware GATHER on two data elements per instruction, the gather instruction outperforms the software sequence on Skylake microarchitecture.
If data supply locality is from memory, software sequences are likely to perform better than the hardware
GATHER instruction.

11.16.4.1 Strided Loads
This section compares using the hardware GATHER instruction versus alternative implementations of
handling Array of Structures (AOS) to Structure of Arrays (SOA) transformation. The code separates the
real and imaginary elements in a complex array into two separate arrays.
C code:

for(int i=0;i

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : No
Encryption                      : Standard V4.4 (128-bit)
User Access                     : Copy, Extract
Author                          : Intel Corporation
Create Date                     : 2016:06:22 08:04:44Z
Keywords                        : 64 architecture, IA-32, core processors, 248966
Modify Date                     : 2016:06:28 04:39:40+05:30
XMP Toolkit                     : Adobe XMP Core 5.6-c015 84.158975, 2016/02/13-02:40:29
Creator Tool                    : FrameMaker 10.0.2
Metadata Date                   : 2016:06:28 04:39:40+05:30
Producer                        : Acrobat Distiller 9.2.0 (Windows)
Format                          : application/pdf
Title                           : Intel® 64 and IA-32 Architectures Optimization Reference Manual
Creator                         : Intel Corporation
Subject                         : 64 architecture, IA-32, core processors, 248966
Document ID                     : uuid:f79e346a-6e66-4a24-ae01-4b17fc242dd0
Instance ID                     : uuid:0edaf631-09c1-4ac0-8bc9-8f6923f8ad45
Page Mode                       : UseOutlines
Page Count                      : 672
EXIF Metadata provided by EXIF.tools

Navigation menu