Coprocessor Instruction Set Architecture Reference Manual

User Manual:
Open the PDF directly: View PDF .
Page Count: 725
Download
Open PDF In Browser	View PDF
Intel® Xeon Phi™ Coprocessor Instruction
Set Architecture Reference Manual
September 7, 2012

Reference Number: 327364-001

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY
THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY,
RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT
OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or
indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH
MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS
AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT
OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING
IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS
PARTS.
Intel may make changes to speci ications and product descriptions at any time, without notice. Designers must
not rely on the absence or characteristics of any features or instructions marked "reserved" or "unde ined". Intel
reserves these for future de inition and shall have no responsibility whatsoever for con licts or incompatibilities
arising from future changes to them. The information here is subject to change without notice. Do not inalize a
design with this information.
The products described in this document may contain design defects or errors known as errata which may cause
the product to deviate from published speci ications. Current characterized errata are available on request.
Contact your local Intel sales of ice or your distributor to obtain the latest speci ications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature,
may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Intel, the Intel® logo, Intel® Xeon Phi™ , Intel® Pentium® , Intel® Xeon® , Intel® Pentium® 4 Processor, Intel®
Core™ Duo, Intel® Core™ 2 Duo, MMX™, Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector
Extensions (Intel® AVX) are trademarks or registered trademarks of Intel® Corporation or its subsidiaries in the
United States and other countries. *Other names and brands may be claimed as the property of others.
Copyright 2012 Intel® Corporation. All rights reserved.

2

Reference Number: 327364-001

CONTENTS

Contents

1 Introduction

20

2 Instructions Terminology and State

21

2.1 Overview of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Extensions . . . . . . . . 21
2.1.1

What are vectors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.2

Vector mask registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.3

2.1.2.1

Vector mask k0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.2.2

Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Understanding Intel® Xeon Phi™ Coprocessor Instruction Set Architecture . . . . . . . . . . 23
2.1.3.1

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Vector Instructions . . 24

2.1.3.2

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Vector Memory Instructions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.3.3

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector mask Instructions 26

2.1.3.4

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture New Scalar Instructions 27

2.2 Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Swizzles and Converts . . . . . . . . . . . 27
2.2.1

Load-Op Swizzle/Convert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.2

Load Up-convert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.3

Down-Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Static Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Intel® Xeon Phi™ coprocessor Execution Environments . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Format

40

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Reference Number: 327364-001

3

CONTENTS

3.2 Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1

MVEX/VEX and the LOCK pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2

MVEX/VEX and the 66H, F2H, and F3H pre ixes . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.3

MVEX/VEX and the REX pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 The MVEX Pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1

Vector SIB (VSIB) Memory Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 The VEX Pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Assembly Syntax . . . . . . . . . . . . . . 46
3.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.1

Operand Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6.2

The Displacement Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6.3

Memory size and disp8*N calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.7 EH hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Functions and Tables Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8.1

MemLoad and MemStore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.8.2

SwizzUpConvLoad, UpConvLoad and DownConvStore . . . . . . . . . . . . . . . . . . . . . 52

3.8.3

Other Functions/Identi iers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Floating-Point Environment, Memory Addressing, and Processor State

55

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1

Suppress All Exceptions Attribute (SAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.2

SIMD Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.3

SIMD Floating-Point Exception Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3.1

Invalid Operation Exception (#I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3.2

Divide-By-Zero Exception (#Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3.3

Denormal Operand Exception (#D) . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3.4

Numeric Over low Exception (#O) . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.3.5

Numeric Under low Exception (#U) . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.3.6

Inexact Result (Precision) Exception (#P) . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Denormal Flushing Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4

Reference Number: 327364-001

CONTENTS

4.2.1

Denormal control in up-conversions and down-conversions . . . . . . . . . . . . . . . . . . 59
4.2.1.1

Up-conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.1.2

Down-conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Extended Addressing Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Swizzle/up-conversion exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Accessing uncacheable memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1

Memory read operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.2

vloadunpackh*/vloadunpackl* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.3

vgatherd* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.4

Memory stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Floating-point Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1

4.6.2

Rounding Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1.1

Swizzle-explicit rounding modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6.1.2

De inition and propagation of NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6.1.3

Signed Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

REX pre ix and Intel® Xeon Phi™ Coprocessor Instruction Set Architecture interactions . . . 66

4.7 Intel® Xeon Phi™ Coprocessor Instruction Set Architecture State Save . . . . . . . . . . . . . . . . . . 66
4.8 Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Processor State After Reset . . . . . . . . 66
5 Instruction Set Reference

68

5.1 Interpreting Instruction Reference Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1

Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.2

Opcode Notations for MVEX Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.3

Opcode Notations for VEX Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Instruction Descriptions

71

6.1 Vector Mask Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
JKNZD - Jump near if mask is not zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
JKZD - Jump near if mask is zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
KAND - AND Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Reference Number: 327364-001

5

CONTENTS

KANDN - AND NOT Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
KANDNR - Reverse AND NOT Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
KCONCATH - Pack and Move High Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
KCONCATL - Pack and Move Low Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
KEXTRACT - Extract Vector Mask From Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
KMERGE2L1H - Swap and Merge High Element Portion and Low Portion of Vector Masks . . . . . 91
KMERGE2L1L - Move Low Element Portion into High Portion of Vector Mask . . . . . . . . . . . . 93
KMOV - Move Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
KNOT - Not Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
KOR - OR Vector Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
KORTEST - OR Vector Mask And Set EFLAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
KXNOR - XNOR Vector Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
KXOR - XOR Vector Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Vector Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
VADDNPD - Add and Negate Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
VADDNPS - Add and Negate Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
VADDPD - Add Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
VADDPS - Add Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
VADDSETSPS - Add Float32 Vectors and Set Mask to Sign . . . . . . . . . . . . . . . . . . . . . . . . 120
VALIGND - Align Doubleword Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
VBLENDMPD - Blend Float64 Vectors using the Instruction Mask . . . . . . . . . . . . . . . . . . . 126
VBLENDMPS - Blend Float32 Vectors using the Instruction Mask . . . . . . . . . . . . . . . . . . . 129
VBROADCASTF32X4 - Broadcast 4xFloat32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
VBROADCASTF64X4 - Broadcast 4xFloat64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
VBROADCASTI32X4 - Broadcast 4xInt32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
VBROADCASTI64X4 - Broadcast 4xInt64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
VBROADCASTSD - Broadcast Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
VBROADCASTSS - Broadcast Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
VCMPPD - Compare Float64 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . 144
6

Reference Number: 327364-001

CONTENTS

VCMPPS - Compare Float32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . 149
VCVTDQ2PD - Convert Int32 Vector to Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 154
VCVTFXPNTDQ2PS - Convert Fixed Point Int32 Vector to Float32 Vector . . . . . . . . . . . . . . . 157
VCVTFXPNTPD2DQ - Convert Float64 Vector to Fixed Point Int32 Vector . . . . . . . . . . . . . . . 161
VCVTFXPNTPD2UDQ - Convert Float64 Vector to Fixed Point Uint32 Vector . . . . . . . . . . . . . 165
VCVTFXPNTPS2DQ - Convert Float32 Vector to Fixed Point Int32 Vector . . . . . . . . . . . . . . . 169
VCVTFXPNTPS2UDQ - Convert Float32 Vector to Fixed Point Uint32 Vector . . . . . . . . . . . . . 173
VCVTFXPNTUDQ2PS - Convert Fixed Point Uint32 Vector to Float32 Vector . . . . . . . . . . . . . 177
VCVTPD2PS - Convert Float64 Vector to Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . 180
VCVTPS2PD - Convert Float32 Vector to Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . 184
VCVTUDQ2PD - Convert Uint32 Vector to Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . 187
VEXP223PS - Base-2 Exponential Calculation of Float32 Vector . . . . . . . . . . . . . . . . . . . . 190
VFIXUPNANPD - Fix Up Special Float64 Vector Numbers With NaN Passthrough . . . . . . . . . . 193
VFIXUPNANPS - Fix Up Special Float32 Vector Numbers With NaN Passthrough . . . . . . . . . . . 197
VFMADD132PD - Multiply Destination By Second Source and Add To First Source Float64 Vectors 201
VFMADD132PS - Multiply Destination By Second Source and Add To First Source Float32 Vectors 205
VFMADD213PD - Multiply First Source By Destination and Add Second Source Float64 Vectors . 208
VFMADD213PS - Multiply First Source By Destination and Add Second Source Float32 Vectors . . 212
VFMADD231PD - Multiply First Source By Second Source and Add To Destination Float64 Vectors 216
VFMADD231PS - Multiply First Source By Second Source and Add To Destination Float32 Vectors 220
VFMADD233PS - Multiply First Source By Specially Swizzled Second Source and Add To Second
Source Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
VFMSUB132PD - Multiply Destination By Second Source and Subtract First Source Float64 Vectors228
VFMSUB132PS - Multiply Destination By Second Source and Subtract First Source Float32 Vectors232
VFMSUB213PD - Multiply First Source By Destination and Subtract Second Source Float64 Vectors235
VFMSUB213PS - Multiply First Source By Destination and Subtract Second Source Float32 Vectors239
VFMSUB231PD - Multiply First Source By Second Source and Subtract Destination Float64 Vectors242
VFMSUB231PS - Multiply First Source By Second Source and Subtract Destination Float32 Vectors246
VFNMADD132PD - Multiply Destination By Second Source and Subtract From First Source
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Reference Number: 327364-001

7

CONTENTS

VFNMADD132PS - Multiply Destination By Second Source and Subtract From First Source
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
VFNMADD213PD - Multiply First Source By Destination and Subtract From Second Source
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
VFNMADD213PS - Multiply First Source By Destination and Subtract From Second Source
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
VFNMADD231PD - Multiply First Source By Second Source and Subtract From Destination
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
VFNMADD231PS - Multiply First Source By Second Source and Subtract From Destination
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
VFNMSUB132PD - Multiply Destination By Second Source, Negate, and Subtract First Source
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
VFNMSUB132PS - Multiply Destination By Second Source, Negate, and Subtract First Source
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
VFNMSUB213PD - Multiply First Source By Destination, Negate, and Subtract Second Source
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
VFNMSUB213PS - Multiply First Source By Destination, Negate, and Subtract Second Source
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
VFNMSUB231PD - Multiply First Source By Second Source, Negate, and Subtract Destination
Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
VFNMSUB231PS - Multiply First Source By Second Source, Negate, and Subtract Destination
Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
VGATHERDPD - Gather Float64 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . 297
VGATHERDPS - Gather Float32 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . . 300
VGATHERPF0DPS - Gather Prefetch Float32 Vector With Signed Dword Indices Into L1 . . . . . . 303
VGATHERPF0HINTDPD - Gather Prefetch Float64 Vector Hint With Signed Dword Indices . . . . 306
VGATHERPF0HINTDPS - Gather Prefetch Float32 Vector Hint With Signed Dword Indices . . . . . 308
VGATHERPF1DPS - Gather Prefetch Float32 Vector With Signed Dword Indices Into L2 . . . . . . 310
VGETEXPPD - Extract Float64 Vector of Exponents from Float64 Vector . . . . . . . . . . . . . . . 313
VGETEXPPS - Extract Float32 Vector of Exponents from Float32 Vector . . . . . . . . . . . . . . . . 316
VGETMANTPD - Extract Float64 Vector of Normalized Mantissas from Float64 Vector . . . . . . . 319
VGETMANTPS - Extract Float32 Vector of Normalized Mantissas from Float32 Vector . . . . . . . 324
VGMAXABSPS - Absolute Maximum of Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 329
VGMAXPD - Maximum of Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8

Reference Number: 327364-001

CONTENTS

VGMAXPS - Maximum of Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
VGMINPD - Minimum of Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
VGMINPS - Minimum of Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
VLOADUNPACKHD - Load Unaligned High And Unpack To Doubleword Vector
VLOADUNPACKHPD - Load Unaligned High And Unpack To Float64 Vector

. . . . . . . . . . . 349

. . . . . . . . . . . . . 352

VLOADUNPACKHPS - Load Unaligned High And Unpack To Float32 Vector . . . . . . . . . . . . . . 355
VLOADUNPACKHQ - Load Unaligned High And Unpack To Int64 Vector

. . . . . . . . . . . . . . . 358

VLOADUNPACKLD - Load Unaligned Low And Unpack To Doubleword Vector . . . . . . . . . . . . 361
VLOADUNPACKLPD - Load Unaligned Low And Unpack To Float64 Vector . . . . . . . . . . . . . . 364
VLOADUNPACKLPS - Load Unaligned Low And Unpack To Float32 Vector . . . . . . . . . . . . . . 367
VLOADUNPACKLQ - Load Unaligned Low And Unpack To Int64 Vector . . . . . . . . . . . . . . . . 370
VLOG2PS - Vector Logarithm Base-2 of Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 373
VMOVAPD - Move Aligned Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
VMOVAPS - Move Aligned Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
VMOVDQA32 - Move Aligned Int32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
VMOVDQA64 - Move Aligned Int64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
VMOVNRAPD - Store Aligned Float64 Vector With No-Read Hint . . . . . . . . . . . . . . . . . . . . 388
VMOVNRAPS - Store Aligned Float32 Vector With No-Read Hint . . . . . . . . . . . . . . . . . . . . 390
VMOVNRNGOAPD - Non-globally Ordered Store Aligned Float64 Vector With No-Read Hint . . . . 393
VMOVNRNGOAPS - Non-globally Ordered Store Aligned Float32 Vector With No-Read Hint . . . . 396
VMULPD - Multiply Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
VMULPS - Multiply Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
VPACKSTOREHD - Pack And Store Unaligned High From Int32 Vector

. . . . . . . . . . . . . . . . 405

VPACKSTOREHPD - Pack And Store Unaligned High From Float64 Vector

. . . . . . . . . . . . . . 408

VPACKSTOREHPS - Pack And Store Unaligned High From Float32 Vector . . . . . . . . . . . . . . . 411
VPACKSTOREHQ - Pack And Store Unaligned High From Int64 Vector

. . . . . . . . . . . . . . . . 414

VPACKSTORELD - Pack and Store Unaligned Low From Int32 Vector . . . . . . . . . . . . . . . . . 417
VPACKSTORELPD - Pack and Store Unaligned Low From Float64 Vector . . . . . . . . . . . . . . . 420
VPACKSTORELPS - Pack and Store Unaligned Low From Float32 Vector . . . . . . . . . . . . . . . 423
Reference Number: 327364-001

9

CONTENTS

VPACKSTORELQ - Pack and Store Unaligned Low From Int64 Vector . . . . . . . . . . . . . . . . . 426
VPADCD - Add Int32 Vectors with Carry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
VPADDD - Add Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
VPADDSETCD - Add Int32 Vectors and Set Mask to Carry . . . . . . . . . . . . . . . . . . . . . . . . 435
VPADDSETSD - Add Int32 Vectors and Set Mask to Sign . . . . . . . . . . . . . . . . . . . . . . . . . 438
VPANDD - Bitwise AND Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
VPANDND - Bitwise AND NOT Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
VPANDNQ - Bitwise AND NOT Int64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
VPANDQ - Bitwise AND Int64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
VPBLENDMD - Blend Int32 Vectors using the Instruction Mask . . . . . . . . . . . . . . . . . . . . . 453
VPBLENDMQ - Blend Int64 Vectors using the Instruction Mask . . . . . . . . . . . . . . . . . . . . . 456
VPBROADCASTD - Broadcast Int32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
VPBROADCASTQ - Broadcast Int64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
VPCMPD - Compare Int32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . . 463
VPCMPEQD - Compare Equal Int32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . 467
VPCMPGTD - Compare Greater Than Int32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . 470
VPCMPLTD - Compare Less Than Int32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . 473
VPCMPUD - Compare Uint32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . . . . . 476
VPERMD - Permutes Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
VPERMF32X4 - Shuf le Vector Dqwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
VPGATHERDD - Gather Int32 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . . . 484
VPGATHERDQ - Gather Int64 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . . . 487
VPMADD231D - Multiply First Source By Second Source and Add To Destination Int32 Vectors . . 490
VPMADD233D - Multiply First Source By Specially Swizzled Second Source and Add To Second
Source Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
VPMAXSD - Maximum of Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
VPMAXUD - Maximum of Uint32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
VPMINSD - Minimum of Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
VPMINUD - Minimum of Uint32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
VPMULHD - Multiply Int32 Vectors And Store High Result . . . . . . . . . . . . . . . . . . . . . . . . 509
10

Reference Number: 327364-001

CONTENTS

VPMULHUD - Multiply Uint32 Vectors And Store High Result . . . . . . . . . . . . . . . . . . . . . . 512
VPMULLD - Multiply Int32 Vectors And Store Low Result . . . . . . . . . . . . . . . . . . . . . . . . 515
VPORD - Bitwise OR Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
VPORQ - Bitwise OR Int64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
VPSBBD - Subtract Int32 Vectors with Borrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
VPSBBRD - Reverse Subtract Int32 Vectors with Borrow . . . . . . . . . . . . . . . . . . . . . . . . 527
VPSCATTERDD - Scatter Int32 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . . 530
VPSCATTERDQ - Scatter Int64 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . . 533
VPSHUFD - Shuf le Vector Doublewords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
VPSLLD - Shift Int32 Vector Immediate Left Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
VPSLLVD - Shift Int32 Vector Left Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
VPSRAD - Shift Int32 Vector Immediate Right Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 544
VPSRAVD - Shift Int32 Vector Right Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
VPSRLD - Shift Int32 Vector Immediate Right Logical . . . . . . . . . . . . . . . . . . . . . . . . . . 550
VPSRLVD - Shift Int32 Vector Right Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
VPSUBD - Subtract Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
VPSUBRD - Reverse Subtract Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
VPSUBRSETBD - Reverse Subtract Int32 Vectors and Set Borrow . . . . . . . . . . . . . . . . . . . . 562
VPSUBSETBD - Subtract Int32 Vectors and Set Borrow . . . . . . . . . . . . . . . . . . . . . . . . . 565
VPTESTMD - Logical AND Int32 Vectors and Set Vector Mask . . . . . . . . . . . . . . . . . . . . . . 568
VPXORD - Bitwise XOR Int32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
VPXORQ - Bitwise XOR Int64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
VRCP23PS - Reciprocal of Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
VRNDFXPNTPD - Round Float64 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
VRNDFXPNTPS - Round Float32 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
VRSQRT23PS - Vector Reciprocal Square Root of Float32 Vector . . . . . . . . . . . . . . . . . . . . 588
VSCALEPS - Scale Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
VSCATTERDPD - Scatter Float64 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . 595
VSCATTERDPS - Scatter Float32 Vector With Signed Dword Indices . . . . . . . . . . . . . . . . . . 598
Reference Number: 327364-001

11

CONTENTS

VSCATTERPF0DPS - Scatter Prefetch Float32 Vector With Signed Dword Indices Into L1 . . . . . . 601
VSCATTERPF0HINTDPD - Scatter Prefetch Float64 Vector Hint With Signed Dword Indices . . . . 604
VSCATTERPF0HINTDPS - Scatter Prefetch Float32 Vector Hint With Signed Dword Indices . . . . 606
VSCATTERPF1DPS - Scatter Prefetch Float32 Vector With Signed Dword Indices Into L2 . . . . . . 608
VSUBPD - Subtract Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
VSUBPS - Subtract Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
VSUBRPD - Reverse Subtract Float64 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
VSUBRPS - Reverse Subtract Float32 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
A Scalar Instruction Descriptions

623

CLEVICT0 - Evict L1 line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
CLEVICT1 - Evict L2 line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
DELAY - Stall Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
LZCNT - Leading Zero Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
POPCNT - Return the Count of Number of Bits Set to 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 632
SPFLT - Set performance monitor iltering mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
TZCNT - Trailing Zero Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
TZCNTI - Initialized Trailing Zero Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
VPREFETCH0 - Prefetch memory line using T0 hint . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
VPREFETCH1 - Prefetch memory line using T1 hint . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
VPREFETCH2 - Prefetch memory line using T2 hint . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
VPREFETCHE0 - Prefetch memory line using T0 hint, with intent to write . . . . . . . . . . . . . . 647
VPREFETCHE1 - Prefetch memory line using T1 hint, with intent to write . . . . . . . . . . . . . . 649
VPREFETCHE2 - Prefetch memory line using T2 hint, with intent to write . . . . . . . . . . . . . . 651
VPREFETCHENTA - Prefetch memory line using NTA hint, with intent to write . . . . . . . . . . . 653
VPREFETCHNTA - Prefetch memory line using NTA hint

. . . . . . . . . . . . . . . . . . . . . . . . 655

B Intel® Xeon Phi™ coprocessor 64 bit Mode Scalar Instruction Support

657

B.1 64 bit Mode General-Purpose and X87 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
B.2 Intel® Xeon Phi™ coprocessor 64 bit Mode Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 659
12

Reference Number: 327364-001

CONTENTS

B.3 LDMXCSR - Load MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
B.4 FXRSTOR - Restore x87 FPU and MXCSR State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
B.5 FXSAVE - Save x87 FPU and MXCSR State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
B.6 RDPMC - Read Performance-Monitoring Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
B.7 STMXCSR - Store MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
B.8 CPUID - CPUID Identi ication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
C Floating-Point Exception Summary

683

C.1 Instruction loating-point exception summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
C.2 Conversion loating-point exception summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
C.3 Denormal behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
D Instruction Attributes and Categories

691

D.1 Conversion Instruction Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
D.1.1

Df 32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.2

Df 64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.3

Di32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.4

Di64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.5

Sf 32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.6

Sf 64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692

D.1.7

Si32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693

D.1.8

Si64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693

D.1.9

Uf 32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693

D.1.10 Uf 64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
D.1.11 Ui32 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
D.1.12 Ui64 Family of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
E Non-faulting Unde ined Opcodes

694

F General Templates

696

F.1

Mask Operation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697

Reference Number: 327364-001

13

CONTENTS

Mask m0 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
Mask m1 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
Mask m2 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
Mask m3 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
Mask m4 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
Mask m5 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
F.2

Vector Operation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
Vector v0 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705

Vector v1 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707

Vector v10 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
Vector v11 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710

F.3

Vector v2 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711

Vector v3 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713

Vector v4 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714

Vector v5 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716

Vector v6 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718

Vector v7 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719

Vector v8 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720

Vector v9 - Template

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722

Scalar Operation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Scalar s0 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
Scalar s1 - Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725

14

Reference Number: 327364-001

LIST OF TABLES

List of Tables

2.1 EH attribute syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 32 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 32 bit elements
that form one 128-bit block in the source (with 'a' least signi icant and 'd' most signi icant), so
aaaa means that the least signi icant element of the 128-bit block in the source is replicated to all
four elements of the same 128-bit block in the destination; the depicted pattern is then repeated
for all four 128-bit blocks in the source and destination. We use 'ponm lkji hgfe dcba' to denote a
full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, where 'a' is the least
signi icant element and 'p' is the most signi icant element. However, since each 128-bit block
performs the same permutation for register swizzles, we only show the least signi icant block
here. Note that in this table as well as in subsequent ones from this chapter S2 S1 S0 are bits 6-4
from MVEX pre ix encoding (see Figure 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 64 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 64 bit elements
that form one 256-bit block in the source (with 'a' least signi icant and 'd' most signi icant), so
aaaa means that the least signi icant element of the 256-bit block in the source is replicated to all
four elements of the same 256-bit block in the destination; the depicted pattern is then repeated
for the two 256-bit blocks in the source and destination. We use 'hgfe dcba' to denote a full Intel®
Xeon Phi™ Coprocessor Instruction Set Architecture source register, where 'a' is the least signi icant
element and 'h' is the most signi icant element. However, since each 256-bit block performs the
same permutation for register swizzles, we only show the least signi icant block here. . . . . . . . 30
2.4 32 bit Floating-point Load-op SwizzUpConvf 32 swizzle/conversion primitives. We use 'ponm
lkji hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source
register, with each letter referring to a 32 bit element, where 'a' is the least signi icant element
and 'p' is the most signi icant element. So, for example, 'dcba dcba dcba dcba' shows that the
source elements are copied to the destination by replicating the lower 128 bits of the source (the
four least signi icant elements) to each 128-bit block of the destination. . . . . . . . . . . . . . . . 30
2.5 32 bit Integer Load-op SwizzUpConvi32 (Doubleword) swizzle/conversion primitives. We use
'ponm lkji hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture
source register, with each letter referring to a 32 bit element, where 'a' is the least signi icant
element and 'p' is the most signi icant element. So, for example, 'dcba dcba dcba dcba' shows
that the source elements are copied to the destination by replicating the lower 128 bits of the
source (the four least signi icant elements) to each 128-bit block of the destination. . . . . . . . . 31
Reference Number: 327364-001

15

LIST OF TABLES

2.6 64 bit Floating-point Load-op SwizzUpConvf 64 swizzle/conversion primitives. We use 'hgfe
dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register,
with each letter referring to a 64 bit element, where 'a' is the least signi icant element and 'h'
is the most signi icant element. So, for example, 'dcba dcba' shows that the source elements are
copied to the destination by replicating the lower 256 bits of the source (the four least signi icant
elements) to each 256-bit block of the destination. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 64 bit Integer Load-op SwizzUpConvi64 (Quadword) swizzle/conversion primitives. We use
'hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with each letter referring to a 64 bit element, where 'a' is the least signi icant element and 'h'
is the most signi icant element. So, for example, 'dcba dcba' shows that the source elements are
copied to the destination by replicating the lower 256 bits of the source (the four least signi icant
elements) to each 256-bit block of the destination. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 32 bit Load UpConv load/broadcast instructions per datatype. Elements may be 1, 2, or 4 bytes
in memory prior to data conversion, after which they are always 4 bytes. We use 'ponm lkji hgfe
dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register,
with each letter referring to a 32 bit element, where 'a' is the least signi icant element and 'p'
is the most signi icant element. So, for example, 'dcba dcba dcba dcba' shows that the source
elements are copied to the destination by replicating the lower 128 bits of the source (the four
least signi icant elements) to each 128-bit block of the destination. . . . . . . . . . . . . . . . . . . 32
2.9 32 bit Load UpConv conversion primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 64 bit Load UpConv load/broadcast instructions per datatype. Elements are always 8 bytes. We
use 'hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source
register, with each letter referring to a 64 bit element, where 'a' is the least signi icant element and
'h' is the most signi icant element. So, for example, 'dcba dcba' shows that the source elements are
copied to the destination by replicating the lower 256 bits of the source (the four least signi icant
elements) to each 256-bit block of the destination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.11 64 bit Load UpConv conversion primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 32 bit DownConv conversion primitives. Unless otherwise noted, all conversions from loatingpoint use MXCSR.RC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.13 64 bit DownConv conversion primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.14 Static Rounding-Mode Swizzle available modes plus SAE. . . . . . . . . . . . . . . . . . . . . . . . 36
2.15 MXCSR bit layout. Note: MXCSR bit 20 is reserved, however it is not reported as Reserved by
MXCSR_MASK. Setting this bit will result in unde ined behavior . . . . . . . . . . . . . . . . . . . . 39
3.1 Operand Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Vector Operand Value Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Size of vector or element accessed in memory for up-conversion . . . . . . . . . . . . . . . . . . . . 49
3.3 Size of vector or element accessed in memory for up-conversion . . . . . . . . . . . . . . . . . . . . 50
3.4 Size of vector or element accessed in memory for down-conversion . . . . . . . . . . . . . . . . . . 50
3.5 Prefetch behavior based on the EH (cache-line eviction hint) . . . . . . . . . . . . . . . . . . . . . . 51
16

Reference Number: 327364-001

LIST OF TABLES

3.6 Load/load-op behavior based on the EH bit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Store behavior based on the EH bit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 SwizzUpConv, UpConv and DownConv function conventions . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Masked Responses of Intel® Xeon Phi™ Coprocessor Instruction Set Architecture to Invalid Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Summary of legal and illegal swizzle/conversion primitives for special instructions. . . . . . . . . 61
4.3 Rules for handling NaNs for unary and binary operations. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Rules for handling NaNs for fused multiply and add/sub operations (ternary). . . . . . . . . . . . 65
4.5 Processor State Following Power-up, Reset, or INIT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3 for other cases with a result of zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3 for other cases with a result of zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 VCMPPD behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.4 VCMPPS behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 Converting to integer special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . 161
6.6 Converting to integer special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . 165
6.7 Converting to integer special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . 169
6.8 Converting to integer special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . 173
6.9 Converting loat64 to loat32 special values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.10 vexp2_1ulp() special int values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.11 VFNMSUB outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 273
6.12 VFNMSUB outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 277
6.13 VFNMSUB outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 281
6.14 VFNMSUB outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 285
6.15 VFMADDN outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 289
6.16 VFMADDN outcome when adding zeros depending on rounding-mode . . . . . . . . . . . . . . . . 293
6.17 GetExp() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.18 GetExp() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
6.19 GetMant() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Reference Number: 327364-001

17

LIST OF TABLES

6.20 GetMant() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
6.21 Max exception lags priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
6.22 Max exception lags priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
6.23 Min exception lags priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
6.24 Min exception lags priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
6.25 vlog2_DX() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
6.26 recip_1ulp() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
6.27 RoundToInt() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
6.28 RoundToInt() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
6.29 rsqrt_1ulp() special loating-point values behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
B.3 Highest CPUID Source Operand for IA-32 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
B.4 Information Returned by CPUID Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
B.5 Information Returned by CPUID Instruction (Contd.) . . . . . . . . . . . . . . . . . . . . . . . . . . 674
B.6 Information Returned by CPUID Instruction. 8000000xH leafs. . . . . . . . . . . . . . . . . . . . . 675
B.7 Information Returned by CPUID Instruction. 8000000xH leafs. (Contd.) . . . . . . . . . . . . . . . 676
B.8 Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX) . . . . . . . . . . . . . . 680
B.9 Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX) (Contd.) . . . . . . . . 681
B.10 Feature Information Returned in the ECX Register (CPUID.EAX[01h].ECX) . . . . . . . . . . . . . . 682
C.3 Float-to-integer Max/Min Valid Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
C.4 Float-to- loat Max/Min Valid Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690

18

Reference Number: 327364-001

LIST OF FIGURES

List of Figures

2.1 64 bit Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Vector and Vector Mask Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 New Instruction Encoding Format with MVEX Pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 New Instruction Encoding Format with VEX Pre ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 MVEX bit ields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 VEX bit ields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 MXCSR Control/Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Reference Number: 327364-001

19

CHAPTER 1. INTRODUCTION

Chapter 1
Introduction
This document describes new vector instructions for the Intel® Xeon Phi™ coprocessor.
The major features of the new vector instructions described herein are:
A high performance 64 bit execution environment The Intel® Xeon Phi™ coprocessor provides a 64 bit execution environment (see Figure 2.1) similar to that found in the Intel64® Intel® Architecture Software
Developer's Manual. Additionally, Intel® Xeon Phi™ Coprocessor Instruction Set Architecture provides basic
support for loat64 and int64 logical operations.
32 new vector registers The Intel® Xeon Phi™ coprocessor's 64 bit environment offers 32 512-bit wide vector
SIMD registers tailored to boost the performance of high performance computing applications. The 512bit vector SIMD instruction extensions provide comprehensive, native support to handle 32 bit and 64 bit
loating-point and integer data, including a rich set of conversions for native data types.
Ternary instructions Most instructions are ternary, with two sources and a different destination. Multiply&add instructions are ternary with three sources, one of which is also the destination.
Vector mask support Intel® Xeon Phi™ Coprocessor Instruction Set Architecture introduces 8 vector mask registers that allow for conditional execution over the 16 (or 8) elements in a vector instruction, and merging of
the results into the destination. Masks allow vectorizing loops that contain conditional statements. Additionally, Intel® Xeon Phi™ Coprocessor Instruction Set Architecture provides support for updating the value
of the vector masks with special vector instructions such as vcmpmps.
Coherent memory model The Intel® Xeon Phi™ Coprocessor Instruction Set Architecture operates in a memory
address space that follows the standard de ined by the Intel® 64 achitecture. This feature eases the process
of developing vector code.
Gather/Scatter support The Intel® Xeon Phi™ Coprocessor Instruction Set Architecture features speci ic gather/scatter
instructions that allow manipulation of irregular data patterns of memory (by fetching sparse locations of
memory into a dense vector register or vice-versa) thus enabling vectorization of algorithms with complex
data structures.

20

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

Chapter 2
Instructions Terminology and State
The vector streaming SIMD instruction extensions are designed to enhance the performance of Intel® 64 processors for scienti ic and engineering applications.
This chapter introduces Intel® Xeon Phi™ Coprocessor Instruction Set Architecture terminology and relevant processor state.

2.1

2.1.1

Overview of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Extensions
What are vectors?

The vector is the basic working unit of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture. Most instructions use at least one vector. A vector is de ined as a sequence of packed data elements. For Intel® Xeon Phi™ Coprocessor Instruction Set Architecture the size of a vector is 64 bytes. As the support data types are loat32, int32,
loat64 and int64, then a vector consists on either 16 doubleword-size elements or alternatively, 8 quadwordsize elements. Only doubleword and quadword elements are supported in Intel® Xeon Phi™ Coprocessor Instruction Set Architecture.
The number of Intel® Xeon Phi™ Coprocessor Instruction Set Architecture registers is 32.
Additionally, Intel® Xeon Phi™ Coprocessor Instruction Set Architecture features vector masks. Vector masks allow
any set of elements in the destination to be protected from updates during the execution of any operation. A
subset of this functionality is the ability to control the vector length of the operation being performed (that is,
the span of elements being modi ied, from the irst to the last one); however, it is not necessary that the elements
that are modi ied be consecutive.

2.1.2

Vector mask registers

Most Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions use a special extra source,
known as the write-mask, sourced from a set of 8 registers called vector mask registers. These registers contain
Reference Number: 327364-001

21

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

one bit for each element that can be held by a regular Intel® Xeon Phi™ Coprocessor Instruction Set Architecture
vector register.
Elements are always either loat32, int32, loat64 or int64 and the vector size is set to 64 bytes. Therefore, a
vector register holds either 8 or 16 elements; accordingly, the length of a vector mask register is 16 bits. For 64
bit datatype instructions, only the 8 least signi icant bits of the vector mask register are used.
A vector mask register affects an instruction for which it is the write-mask operand at element granularity (either
32 or 64 bits). That means that every element-sized operation and element-sized destination update by a vector
instruction is predicated on the corresponding bit of the vector mask register used as the write-mask operand.
That has two implications:
• The instruction's operation is not performed for an element if the corresponding write-mask bit is not
set. This implies that no exception or violation can be caused by an operation on a masked-off element.
• A destination element is not updated if the corresponding write-mask bit is not set. Thus, the mask
in effect provides a merging behavior for Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector
register destinations, thereby potentially converting destinations into implicit sources, whenever a writemask containing any 0-bits is used.
This merging behavior, and the associated performance hazards, can also occur when writing a vector to
memory via a vector store. Vectors are written on a per element basis, based on the vector mask register used as a write-mask. Therefore, no exception or violation can be caused by a write to a masked-off
element of a destination vector operand.
The sticky bits implemented in the MXCSR to indicate that loating-point exceptions occurred, are set based
soley upon operations on non-masked vector elements.
The value of a given mask register can be set up as a direct result of a vector comparison instruction, transferred
from a GP register, or calculated as a direct result of a logical operation between two masks.
Vector mask registers can be used for purposes other than write-masking. For example, they can be used to to
set the EFLAGS based on the 0/0xFFFF/other status of the OR of two vector mask registers. A number of the
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture are provided to support such uses of the vector mask
register.

2.1.2.1

Vector mask k0

The only exception to the vector mask rules described above is mask k0. Mask k0 cannot be selected as the writemask for a vector operation; the encoding that would be expected to select mask k0 instead selects an implicit
mask of 0xFFFF, thereby effectively disabling masking. Vector mask k0 can still be used as any non-write-mask
operand for any instruction that takes vector mask operands; it just can't ever be selected as a write-mask.

2.1.2.2

Example of use

Here's an example of a masked vector operation.
The initial state of vector registers zmm0, zmm1, and zmm2 is:
MSB
LSB
zmm0 = [ 0x00000003 0x00000002 0x00000001 0x00000000 ]
22

(bytes 15 through

0)

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

[ 0x00000007 0x00000006 0x00000005 0x00000004 ]
[ 0x0000000B 0x0000000A 0x00000009 0x00000008 ]
[ 0x0000000F 0x0000000E 0x0000000D 0x0000000C ]

(bytes 31 through 16)
(bytes 47 through 32)
(bytes 63 through 48)

zmm1 = [
[
[
[

0x0000000F
0x0000000F
0x0000000F
0x0000000F

0x0000000F
0x0000000F
0x0000000F
0x0000000F

0x0000000F
0x0000000F
0x0000000F
0x0000000F

0x0000000F
0x0000000F
0x0000000F
0x0000000F

]
]
]
]

(bytes
(bytes
(bytes
(bytes

15
31
47
63

through 0)
through 16)
through 32)
through 48)

zmm2 = [
[
[
[

0xAAAAAAAA
0xBBBBBBBB
0xCCCCCCCC
0xDDDDDDDD

0xAAAAAAAA
0xBBBBBBBB
0xCCCCCCCC
0xDDDDDDDD

0xAAAAAAAA
0xBBBBBBBB
0xCCCCCCCC
0xDDDDDDDD

0xAAAAAAAA
0xBBBBBBBB
0xCCCCCCCC
0xDDDDDDDD

]
]
]
]

(bytes
(bytes
(bytes
(bytes

15
31
47
63

through 0)
through 16)
through 32)
through 48)

k3 = 0x8F03

(1000 1111 0000 0011)

Given this state, we will execute the following instruction:
vpaddd zmm2 {k3}, zmm0, zmm1
The vpaddd instruction adds vector elements of 32 bit integers. Since elements are not operated upon when the
corresponding bit of the mask is not set, the temporary result would be:
[
[
[
[

**********
**********
0x0000001A
0x0000001E

**********
**********
0x00000019
**********

0x00000010
**********
0x00000018
**********

0x0000000F
**********
0x00000017
**********

]
]
]
]

(bytes
(bytes
(bytes
(bytes

15
31
47
63

through 0)
through 16)
through 32)
through 48)

where "**********" indicates that no operation is performed.
This temporary result is then written into the destination vector register, zmm2, using vector mask register k3
as the write-mask, producing the following inal result:
zmm2 =
[
[
[

[ 0xAAAAAAAA 0xAAAAAAAA 0x00000010 0x0000000F ]
0xBBBBBBBB 0xBBBBBBBB 0xBBBBBBBB 0xBBBBBBBB ]
0x0000001A 0x00000019 0x00000018 0x00000017 ]
0x0000001E 0xDDDDDDDD 0xDDDDDDDD 0xDDDDDDDD ]

(bytes 15 through 0)
(bytes 31 through 16)
(bytes 47 through 32)
(bytes 63 through 48)

Note that for a 64 bit instruction (say vaddpd), only the 8 LSB of mask k3 (0x03) would be used to identify the
write-mask operation on each one of the 8 elements of the source/destination vectors.

2.1.3

Understanding Intel® Xeon Phi™ Coprocessor Instruction Set Architecture

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture can be classi ied depending on the nature of their
Reference Number: 327364-001

23

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

operands. The majority of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture operate on vector registers, with a vector mask register serving as a write-mask. However, in most cases these instructions may have
one of the vector source operands stored in either memory or a vector register, and may additionally have one
or more non-vector (scalar) operands, such as a Intel® 64 general purpose register or an immediate value. Additionally, some instructions use vector mask registers as destinations and/or explicit sources. Finally, Intel®
Xeon Phi™ Coprocessor Instruction Set Architecture adds some new scalar instructions.
From the point of view of instruction formats, there are four main types of Intel® Xeon Phi™ Coprocessor Instruction Set Architecture:
• Vector Instructions
• Vector Memory Instructions
• Vector Mask Instructions
• New Scalar Instructions

2.1.3.1

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Vector Instructions

Vector instructions operate on vectors that are sourced from either registers or memory and that can be modi ied
prior to the operation via prede ined swizzle and convert functions. The destination is usually a vector register,
though some vector instructions may have a vector mask register as either a second destination or the primary
destination.
All these instructions work in an element-wise manner: the irst element of the irst source vector is operated
on together with the irst element of the second source vector, and the result is stored in the irst element of the
destination vector, and so on for the remaining 15 (or 7) elements.
As described above, the vector mask() register that serves as the write-mask for a vector instruction determines
which element locations are actually operated upon; the mask can disable the operation and update for any
combination of element locations.
Most vector instructions have three different vector operands (typically, two sources and one destination) except those instructions that have a single source and thus use only two operands. Additionally, most vector
instructions feature an extra operand in the form of the vector mask() register that serves as the write-mask.
Thus, we can categorize Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions based on
the number of vector sources they use:
Vector-Converted Vector/Memory. Vector-converted vector/memory instructions, such as vaddps (which
adds two vectors), are ternary operations that take two different sources, a vector register and a converted
vector/memory operand, and a separate destination vector register, as follows:
zmm0 <= OP(zmm1, S(zmm2, m))
where zmm1 is a vector operand that is used as the irst source for the instruction, S(zmm2, m) is a converted vector/memory operand that is used as the second source for the instruction, and the result of
performing operation OP on the two source operands is written to vector destination register zmm0.
A converted vector/memory operand is a source vector operand that it is obtained through the process of
applying a swizzle/conversion function to either a Intel® Xeon Phi™ Coprocessor Instruction Set Architecture
vector or a memory operand. The details of the swizzle/conversion function are found in section 2.2;
24

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

note that its behavior varies depending on whether the operand is a register or a memory location, and,
for memory operands, on whether the instruction performs a loating-point or integer operation. Each
source memory operand must have an address that is aligned to the number of bytes of memory actually
accessed by the operand (that is, before the swizzle/convert is performed); otherwise, a #GP fault will
result.
Converted Vector/Memory. Converted vector/memory instructions, such as vcvtpu2ps (which converts a vector of unsigned integers to a vector of loats), are binary operations that take a single vector source, as
follows:
zmm0 <= OP(S(zmm1, m))
Vector-Vector-Converted Vector/Memory. Vector-vector-converted vector/memory instructions, of which
vfmadd*ps (multiply-add of three vectors) is a good example, are similar to the vector-converted vector/memory family of instructions; here, however, the destination vector register is used as a third source
as well:
zmm0 <= OP(zmm0, zmm1, S(zmm2, m))

2.1.3.2

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Vector Memory Instructions:

Vector Memory Instructions perform vector loads from and vector stores to memory, with extended conversion
support.
As with regular vector instructions, vector memory instructions transfer data from/to memory in an elementwise fashion, with the elements that are actually transferred dictated by the contents of the vector mask that is
selected as the write-mask.
There are two basic groups of Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector memory instructions, vector loads/broadcasts and vector stores.
Vector Loads/Broadcasts. A vector load/broadcast reads a memory source, performs a prede ined load conversion function, and replicates the result (in the case of broadcasts) to form a 64-byte 16-element vector
(or 8-element for 64 bit datatypes). This vector is then conditionally written element-wise to the vector
destination register, with the writes enabled or disabled according to the corresponding bits of the vector
mask register selected as the write-mask.
The size of the memory operand is a function of the type of conversion and the number of replications
to be performed on the memory operand. We call this special memory operand an up-converted memory
operand. Each source memory operand must have an address that is aligned to the number of bytes of
memory actually accessed by the operand (that is, before the swizzle/convert is performed); otherwise, a
#GP fault will result.
A Vector Load operates as follows:
zmm0 <= U(m)
where U (m) is an up-converted memory operand whose contents are replicated and written to destination
register zmm0. The mnemonic dictates the degree of replication and the conversion table.
Reference Number: 327364-001

25

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

A special sub-case of these instructions are Vector Gathers. Vector Gathers are a special form of vector
loads where, instead of a consecutive chunks of memory, we load a sparse set of memory operands (as
many as the vector elements of the destination). Every one of those memory operands must obey the
alignment rules; otherwise, a #GP fault will result if the related write-mask bit is not disabled (set to 0).
A Vector Gather operates as follows:
zmm0 <= U(mv)
where U (mv) is a set of up-converted memory operands described by a base address, a vector of indices
and an immediate scale to apply for each index. Every one of those operands is conditionally written to
destination vector zmm0 (based on the value of the write-mask).
Vector Stores. A vector store reads a vector register source, performs a prede ined store conversion function,
and writes the result to the destination memory location on a per-element basis, with the writes enabled
or disabled according to the corresponding bits of the vector mask register selected as the write-mask.
The size of the memory destination is a function of the type of conversion associated with the mnemonic.
We call this special memory operand a down-converted memory operand. Each memory destination
operand must have an address that is aligned to the number of bytes of memory accessed by the operand
(pre-conversion, if conversion is performed); otherwise, a #GP fault will result.
A Vector Store operates as follows:
m <= D(zmm0)
where zmm0 is the vector register source whose full contents are down-converted (denoted by D()), and
written to memory.
A special sub-case of these instructions are Vector Scatters. Vector Scatters are a special form of vector
stores where, instead of writing the source vector into a consecutive chuck of memory, we store each
vector element into a different memory location. Every one of those memory destinations must obey the
alignment rules; otherwise, a #GP fault will result if the related write-mask bit is not disabled (set to 0).
A Vector Scatter operates as follows:
mv <= D(zmm0)
where zmm0 is the vector register source whose full or partial contents are down-converted (denoted
by D()), and written to the set of memory locations mv, speci ied by a base address, a vector of indices
and an immediate scale which is applied to every index. Every one of those down-converted elements are
conditionally stored in the memory locations based on the value of the write-mask.

2.1.3.3

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector mask Instructions

Vector mask instructions allow programmers to set, copy, or operate on the contents of a given vector mask.
There are three types of vector mask instructions:
• Mask read/write instructions: These instruction move data between a general-purpose integer register
and a vector mask register, or between two vector mask registers.
26

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

• Flag instructions: This category, consisting of instructions that modify EFLAGS based on vector mask
registers, actually contains only one instruction, kortest.
• Mask logical instructions: These instructions perform standard bitwise logical operations between vector mask registers.

2.1.3.4

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture New Scalar Instructions

In addition to vector, vector memory, and vector mask instructions, Intel® Xeon Phi™ Coprocessor Instruction Set
Architecture adds a few scalar instructions as well. These instructions are useful for increasing the performance
of some critical algorithms; for example, any code that suffers reduced performance due to cache-miss latency
can bene it from the new prefetch instructions.

2.2

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Swizzles
and Converts

Data transformation, in the form of certain data conversions or element rearrangements (for loads, both at once)
of one operand, can be performed for free as part of most Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions.
Three sorts of data transformations are available:
• Data Conversions: Sources from memory can be converted to either 32 bit signed or unsigned integer or
32 bit loating-point before being used. Supported data types in memory are loat16, sint8, uint8, sint16,
and uint16 for load-op instructions
• Broadcast: If the source memory operand contains fewer than the total number of elements, it can be
broadcast (repeated) to form the full number of elements of the effective source operand (16 for 32 bit
instructions, 8 for 64 bit instructions). Broadcast can be combined with load-type conversions only; loadop instructions can do one or the other: either broadcast, or swizzle and/or up-conversion. There are two
broadcast granularities:
– 1-element granularity where the 1 element of the source memory operand are broadcast 16 times
to form a full 16-element effective source operand (for 32 bit instructions), or 8 times to form a full
8-element effective source operand (for 64 bit instructions).
– 4-element granularity where the 4 elements of the source memory operand is broadcast 4 times
to form a full 16-element effective source operand (for 32 bit instructions), or 2 times to form a full
8-element effective source operand (for 64 bit instructions).
Broadcast is very useful for instructions that mix vector and scalar sources, where one of the sources is
common across the different operations.
• Swizzles: Sources from registers can undergo swizzle transformations (that is, they can be permuted),
although only 8 swizzles are available, all of which are limited to permuting within 4-element sets (either
of 32 bits or 64 bits each).
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture also introduces the concept of Rounding Mode Override or Static (per instruction) Rounding Mode, which ef iciently supports the feature of determining the
Reference Number: 327364-001

27

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

rounding mode for arithmetic operations on a per-instruction basis. Thus one can choose the rounding mode
without having to perform costly MXCSR save-modify-restore operations.
The Intel® Xeon Phi™ coprocessor extends the swizzle functionality for register-register operands in order to
provide rounding mode override capabilities for the Intel® Xeon Phi™ coprocessor loating-point instructions
instead of obeying the MXCSR.RC bits. All four rounding modes are available via swizzle attribute: Roundup, Round-down, Round-toward-zero and Round-to-nearest. The option is not available for instructions with
memory operands. On top of these options, the Intel® Xeon Phi™ coprocessor introduces the SAE (suppressall-exceptions) attribute feature. An instruction with SAE set will not raise any kind of loating-point exception
lags, independent of the inputs.
In addition to those transformations, all Intel® Xeon Phi™ Coprocessor Instruction Set Architecture memory
operands may have a special attribute, called the EH hint (eviction hint), that indicates to the processor that
the data is non-temporal - that is, it is unlikely to be reused soon enough to bene it from caching in the 1st-level
cache and should be given priority for eviction. This is, however, a hint, and the processor may implement it in
any way it chooses, including ignoring the hint entirely.
Table 2.1 shows the assembly language syntax used to indicate the presence or absence of the EH hint.
B1
0
1

Function
EH

Usage
[eax]
[eax]{eh}

Comment
(no effect) regular memory operand
memory operand with Non-Temporal (Eviction) hint

Table 2.1: EH attribute syntax.
Data transformations can only be performed on one source operand at most; for instructions that take two
or three source operands, the other operands are always used unmodi ied, exactly as they're stored in their
source registers. In no case do any of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture allow using
data conversion and swizzling at the same time. Broadcasts, on the other hand, can be combined with data
conversions when performing vector loads.
Not all instructions can use all of the different data transformations. Load-op instructions (such as vector arithmetic instructions), vector loads, and vector stores have different data transformation capabilities. We can categorize these transformation capabilities into three families:

• Load-Op SwizzUpConv: For a register source, swizzle; for a memory operand, either: (a) broadcast, or (b)
convert to 32 bit loats or 32 bit signed or unsigned integers. This is used by vector arithmetic instructions
and other load-op instructions. There are two versions, one for 32 bit loating-point instructions and
another for 32 bit integer instructions; in addition, the available data transformations differ for register
and memory operands.
• Load UpConv: Convert from a memory operand to 32 bit loats or 32 bit signed or unsigned integers; used
by vector loads and broadcast instructions. For 32 bit loats, there are three different conversion tables
based on three different input types. See Section 2.2.2, Load UpConvert.
There is no load conversion support for 64 bit datatypes.
• DownConv: Convert from 32 bit loats or 32 bit signed or unsigned integers to a memory operand; used by
vector stores. For 32 bit loats, there are three different conversion tables based on three different output
types. See Section 2.2.3, Down-Conversion.
There is no store conversion support for 64 bit datatypes.
28

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

2.2.1

Load-Op Swizzle/Convert

Vector load-op instructions can swizzle, broadcast, or convert one of the sources; we will refer to this as the
swizzle/convert source, and we will use SwizzUpConv to describe the swizzle/convert function itself. The available SwizzUpConv transformations vary depending on whether the operand is memory or a register, and also
in the case of conversions from memory depending on whether the vector instruction is 32 bit integer, 32 bit
loating-point, 64 bit integer or 64 bit loating-point. 3 bits are used to select among the different options, so
eight options are available in each case.
When the swizzle/convert source is a register, SwizzUpConv allows the choice of one of eight swizzle primitives
(one of the eight being the identity swizzle). These swizzle functions work on either 4-byte or 8-byte elements
within 16-byte/32-byte boundaries. For 32 bit instructions, that means certain permutations of each set of four
elements (16 bytes) are supported, replicated across the four sets of four elements. When the swizzle/convert
source is a register, the functionality is the same for both integer and loating-point 32 bit instructions. Table 2.2
shows the available register-source swizzle primitives.
S2 S1 S0
000
001
010
011
100
101
110
111

Function: 4 x 32 bits
no swizzle
swap (inner) pairs
swap with two-away
cross-product swizzle
broadcast a element across 4-element packets
broadcast b element across 4-element packets
broadcast c element across 4-element packets
broadcast d element across 4-element packets

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Table 2.2: 32 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 32 bit elements that
form one 128-bit block in the source (with 'a' least signi icant and 'd' most signi icant), so aaaa means that the
least signi icant element of the 128-bit block in the source is replicated to all four elements of the same 128bit block in the destination; the depicted pattern is then repeated for all four 128-bit blocks in the source and
destination. We use 'ponm lkji hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture
source register, where 'a' is the least signi icant element and 'p' is the most signi icant element. However, since
each 128-bit block performs the same permutation for register swizzles, we only show the least signi icant block
here. Note that in this table as well as in subsequent ones from this chapter S2 S1 S0 are bits 6-4 from MVEX pre ix
encoding (see Figure 3.3
For 64 bit instructions, that means certain permutations of each set of four elements (32 bytes) are supported,
replicated across the two sets of four elements. When the swizzle/convert source is a register, the functionality
is the same for both integer and loating-point 64 bit instructions. Table 2.3 shows the available register-source
swizzle primitives.
When the source is a memory location, load-op swizzle/convert can perform either no transformation, 2 different broadcasts, or four data conversions. Vector load-op instructions cannot both broadcast and perform data
conversion at the same time. The conversions available differ depending on whether the associated vector instruction is integer or loating-point, and whether the natural data type is 32 bit or 64 bit. (Note however that
there are no load conversions for 64 bit destination data types.)
Source memory operands may have sizes smaller than 64 bytes, expanding to the full 64 bytes of a vector source
by means of either broadcasting (replication) or data conversion.
Each source memory operand must have an address that is aligned to the number of bytes of memory actually
accessed by the operand (that is, before conversion or broadcast is performed); otherwise, a #GP fault will result.
Thus, for SwizzUpConv, any of 4-byte, 16-byte, 32-byte, or 64-byte alignment may be required.
Reference Number: 327364-001

29

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

Function: 4 x 64 bits
no swizzle
swap (inner) pairs
swap with two-away
cross-product swizzle
broadcast a element across 4-element packets
broadcast b element across 4-element packets
broadcast c element across 4-element packets
broadcast d element across 4-element packets

S2 S1 S0
000
001
010
011
100
101
110
111

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Table 2.3: 64 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 64 bit elements that
form one 256-bit block in the source (with 'a' least signi icant and 'd' most signi icant), so aaaa means that the
least signi icant element of the 256-bit block in the source is replicated to all four elements of the same 256bit block in the destination; the depicted pattern is then repeated for the two 256-bit blocks in the source and
destination. We use 'hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source
register, where 'a' is the least signi icant element and 'h' is the most signi icant element. However, since each
256-bit block performs the same permutation for register swizzles, we only show the least signi icant block here.

.

S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
reserved
uint16 to loat32
sint16 to loat32

Usage
[rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
N/A
[rax] {uint16}
[rax] {sint16 }

Table 2.4: 32 bit Floating-point Load-op SwizzUpConvf 32 swizzle/conversion primitives. We use 'ponm lkji
hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with each
letter referring to a 32 bit element, where 'a' is the least signi icant element and 'p' is the most signi icant element. So, for example, 'dcba dcba dcba dcba' shows that the source elements are copied to the destination
by replicating the lower 128 bits of the source (the four least signi icant elements) to each 128-bit block of the
destination.
Table 2.4 shows the available 32 bit loating-point swizzle primitives.
SwizzUpConv conversions to loat32s are exact.
Table 2.5 shows the available 32 bit integer swizzle primitives.
Table 2.6 shows the available 64 bit loating-point swizzle primitives.
Finally, Table 2.7 shows the available 64 bit integer swizzle primitives.

2.2.2

Load Up-convert

Vector load/broadcast instructions can perform a wide array of data conversions on the data being read from
memory, and can additionally broadcast (replicate) that data across the elements of the destination vector register depending on the instructions. The type of broadcast depends on the opcode/mnemonic being used. We
30

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16 }

Table 2.5: 32 bit Integer Load-op SwizzUpConvi32 (Doubleword) swizzle/conversion primitives. We use
'ponm lkji hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register,
with each letter referring to a 32 bit element, where 'a' is the least signi icant element and 'p' is the most significant element. So, for example, 'dcba dcba dcba dcba' shows that the source elements are copied to the destination by replicating the lower 128 bits of the source (the four least signi icant elements) to each 128-bit block of
the destination.
.
S2 S1 S0 Function:
Usage
000
no conversion
[rax] {8to8} or [rax]
001
broadcast 1 element (x8)
[rax] {1to8}
010
broadcast 4 elements (x2) [rax] {4to8}
011
reserved
N/A
100
reserved
N/A
101
reserved
N/A
110
reserved
N/A
111
reserved
N/A
Table 2.6: 64 bit Floating-point Load-op SwizzUpConvf 64 swizzle/conversion primitives. We use 'hgfe dcba'
to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with each letter referring
to a 64 bit element, where 'a' is the least signi icant element and 'h' is the most signi icant element. So, for
example, 'dcba dcba' shows that the source elements are copied to the destination by replicating the lower 256
bits of the source (the four least signi icant elements) to each 256-bit block of the destination.
S2 S1 S0
000
001
010
011
100
101
110
111

.
Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

Table 2.7: 64 bit Integer Load-op SwizzUpConvi64 (Quadword) swizzle/conversion primitives. We use 'hgfe
dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with each letter
referring to a 64 bit element, where 'a' is the least signi icant element and 'h' is the most signi icant element. So,
for example, 'dcba dcba' shows that the source elements are copied to the destination by replicating the lower
256 bits of the source (the four least signi icant elements) to each 256-bit block of the destination.
will refer to this conversion process as up-conversion, and we will use UpConv to describe the load conversion
function itself.
Reference Number: 327364-001

31

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

Based on that, load instructions could be divided into the following categories:
• regular loads: load 16 elements (32 bits) or 8 elements (64 bits), convert them and write into the destination vector
• broadcast 4-elements: load 4 elements, convert them (possible only for 32 bit data types), replicate them
four times (32 bits) or two times (64 bits) and write into the destination vector
• broadcast 1-element: load 1 element, convert it (possible only for 32 bit data types), replicate it 16 times
(32 bits) or 8 times (64 bits) and write into the destination vector
Therefore, unlike load-op swizzle/conversion, Load UpConv can perform both data conversion and broadcast
simultaneously. We will refer to this process as up-conversion, and we will use Load UpConv to describe the load
conversion function itself.
When a broadcast 1-element is selected, the memory data, after data conversion, has a size of 4 bytes, and is
broadcast 16 times across all 16 elements of the destination vector register. In other words, one vector element
is fetched from memory, converted to a 32 bit loat or integer, and replicated to all 16 elements of the destination
register. Using the notation where the contents of the source register are denoted {ponm lkji hgfe dcba}, with
each letter referring to a 32 bit element ('a' being the least signi icant element and 'p' being the most signi icant
element), the source elements map to the destination register as follows:
{aaaa aaaa aaaa aaaa}
When broadcast 4-element is selected, the memory data, after data conversion, has a size of 16 bytes, and is
broadcast 4 times across the four 128-bit sets of the destination vector register. In other words, four vector
elements are fetched from memory, converted to four 32 bit loats or integers, and replicated to all four 4-element
sets in the destination register. For this broadcast, the source elements map to the destination register as follows:
{dcba dcba dcba dcba}
Table 2.8 shows the different 32 bit Load up-conversion instructions in function of the broadcast function and
the conversion datatype. Similarly, Table 2.10 shows the different 64 bit Load up-conversion instructions in
function of the broadcast function and datatype.
Datatype
INT32 (d)
FP32 (ps)

Load (16-element)
VMOVDQA32
VMOVAPS

Broadcast 4-element
VBROADCASTI32X4
VBROADCASTF32X4

Broadcast 1-element
VPBROADCASTD
VBROADCASTSS

Table 2.8: 32 bit Load UpConv load/broadcast instructions per datatype. Elements may be 1, 2, or 4 bytes in
memory prior to data conversion, after which they are always 4 bytes. We use 'ponm lkji hgfe dcba' to denote a
full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with each letter referring to a 32 bit
element, where 'a' is the least signi icant element and 'p' is the most signi icant element. So, for example, 'dcba
dcba dcba dcba' shows that the source elements are copied to the destination by replicating the lower 128 bits
of the source (the four least signi icant elements) to each 128-bit block of the destination.
As with SwizzUpConv, UpConv may have source memory operands with sizes smaller than 64-bytes, which are
expanded to a full 64-byte vector by means of broadcast and/or data conversion. Each source memory operand
must have an address that is aligned to the number of bytes of memory actually accessed by the operand (that
is, before conversion or broadcast is performed); otherwise, a #GP fault will result. Thus, any of 1-byte, 2-byte,
4-byte, 8-byte, 16-byte, 32-byte, or 64-byte alignment may be required.
32

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

UpConvi32 (INT32)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
uint8 to uint32
101
sint8 to sint32
110
uint16 to uint32
111
sint16 to sint32

Usage
[rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16 }

UpConvf 32 (FP32)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
loat16 to loat32
100
uint8 to loat32
101
sint8 to loat32
110
uint16 to loat32
111
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16 }

Table 2.9: 32 bit Load UpConv conversion primitives.
Datatype
INT64 (q)
FP64 (pd)

Load
VMOVDQA64
VMOVAPD

Broadcast 4-element
VBROADCASTI64X4
VBROADCASTF64X4

Broadcast 1-element
VPBROADCASTQ
VBROADCASTSD

Table 2.10: 64 bit Load UpConv load/broadcast instructions per datatype. Elements are always 8 bytes. We
use 'hgfe dcba' to denote a full Intel® Xeon Phi™ Coprocessor Instruction Set Architecture source register, with
each letter referring to a 64 bit element, where 'a' is the least signi icant element and 'h' is the most signi icant
element. So, for example, 'dcba dcba' shows that the source elements are copied to the destination by replicating
the lower 256 bits of the source (the four least signi icant elements) to each 256-bit block of the destination.
Table 2.9 shows the available data conversion primitives for 32 bit Load UpConv and for the different datatypes
supported.
Table 2.11 shows the 64 bit counterpart of Load UpConv. As shown, no 64 bit conversions are available but the
pure "no-conversion" option.

2.2.3

Down-Conversion

Vector store instructions can perform a wide variety of data conversions to the data on the way to memory.
We will refer to this process as down-conversion, and we will use DownConv to describe the store conversion
function itself.
DownConv may have destination memory operands with sizes smaller than 64 bytes, as a result of data conversion. Each destination memory operand must have an address that is aligned to the number of bytes of memory
Reference Number: 327364-001

33

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

UpConvi64 (INT64)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

UpConvf 64 (FP64)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Table 2.11: 64 bit Load UpConv conversion primitives.
DownConvi32 (INT32)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
uint32 to uint8
101
sint32 to sint8
110
uint32 to uint16
111
sint32 to sint16

Usage
zmm1
N/A
N/A
N/A
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16 }

DownConvf 32 (FP32)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
loat32 to loat16
100
loat32 to uint8
101
loat32 to sint8
110
loat32 to uint16
111
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16 }

Table 2.12: 32 bit DownConv conversion primitives. Unless otherwise noted, all conversions from loatingpoint use MXCSR.RC

34

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

DownConvi64 (INT64)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

DownConvf 64 (FP64)
S2 S1 S0 Function:
000
no conversion
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Table 2.13: 64 bit DownConv conversion primitives.
actually accessed by the operand (that is, after data conversion is performed); otherwise, a #GP fault will result.
Thus, any of 1-byte, 2-byte, 4-byte, 8-byte, 16-byte, 32-byte, or 64-byte alignment may be required.
Table 2.12 shows the available data conversion primitives for 32 bit DownConv and for the different supported
datatypes.
Table 2.13 shows the 64 bit counterpart of DownConv. As shown, no 64 bit conversions are available but the
pure "no-conversion" option.

2.3

Static Rounding Mode

As described before, the Intel® Xeon Phi™ coprocessor introduces a new instruction attribute on top of the normal register swizzles called Static (per instruction) Rounding Mode or Rounding Mode override. This attribute
allows statically applying a speci ic arithmetic rounding mode ignoring the value of RM bits in MXCSR.
Static Rounding Mode can be enabled in the encoding of the instruction by setting the EH bit to 1 in a registerregister vector instruction. Table 2.14 shows the available rounding modes and their encoding. On top of the
rounding-mode, the Intel® Xeon Phi™ coprocessor also allows to set the SAE ("suppress-all-exceptions") attribute, to disable reporting any loating-point exception lag on MXCSR. This option is available, even if the
instruction does not perform any kind of rounding.
Note that some instructions already allow to specify the rounding mode statically via immediate bits. In such
case, the immediate bits take precedence over the swizzle-speci ied rounding mode (in the same way that they
take precedence over the MXCSR.RC setting).
Reference Number: 327364-001

35

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

.

S2 S1 S0
000
001
010
011
100
101
110
111
1xx

Rounding Mode Override
Round To Nearest (even)
Round Down (-INF)
Round Up (+INF)
Round Toward Zero
Round To Nearest (even) with SAE
Round Down (-INF) with SAE
Round Up (+INF) with SAE
Round Toward Zero with SAE
SAE

Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}
, {sae}

Table 2.14: Static Rounding-Mode Swizzle available modes plus SAE.

2.4

Intel® Xeon Phi™ coprocessor Execution Environments

The Intel® Xeon Phi™ coprocessor's support for 32 bit and 64 bit execution environments are similar to those
found in the Intel64® Intel® Architecture Software Developer's Manual. The 64 bit execution environment of the
Intel® Xeon Phi™ coprocessor is shown in Figure 2.1. The layout of 512-bit vector registers and vector mask registers are shown in Figure 2.2. This section describes new features associated with the 512-bit vector registers
and the 16 bit vector mask registers.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture de ines two new sets of registers that hold the new
vector state. The Intel® Xeon Phi™ Coprocessor Instruction Set Architecture extension uses the vector registers,
the vector mask registers and/or the x86 64 general purpose registers.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Vector Registers. The 32 registers each store store
16 doubleword/single precision loating-point entries (or 8 quadword/double precision loating-point
entries), and serve as source and destination operands for vector packed loating point and integer operations. Additionally, they may also contain memory pointer offsets used to gather and scatter data from/to
memory. These registers are referenced as zmm0 through zmm31.
Vector Mask Registers. These registers specify which vector elements are operated on and written for Intel®
Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation;
otherwise, the element remains unchanged. A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic
instructions.
The Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions are able to report exceptions via MXCSR lags but never cause traps as all SIMD loating-point exceptions are always masked
(unlike Intel® SSE/Intel® AVX instructions in other processors, that may trap if loating-point exceptions
are unmasked, depending on the value of the OM/UM/IM/PM/DM/ZM bits). The reason is that the Intel®
Xeon Phi™ coprocessor forces the new DUE bit (Disable Unmasked Exceptions) in the MXCSR (bit21) to be
set to 1.
On the Intel® Xeon Phi™ coprocessor, both single precision and double precision loating-point instructions
use MXCSR.DAZ and MXCSR.FZ to decide whether to treat input denormals as zeros or to lush tiny results
to zero (the latter are in most cases - but not always - denormal results which are lushed to zero when
MXCSR.FZ is set to 1; see the IEEE Standard 754-2008, section 7.5, for a de inition of tiny loating-point
results).
Table 2.15 shows the bit layout of the MXCSR control register.
36

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

Figure 2.1: 64 bit Execution Environment
Reference Number: 327364-001

37

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

Figure 2.2: Vector and Vector Mask Registers

38

Reference Number: 327364-001

CHAPTER 2. INSTRUCTIONS TERMINOLOGY AND STATE

MXCSR bit 20 is reserved, however it is not reported as Reserved by MXCSR_MASK. Setting this bit will
result in unde ined behavior
General-purpose registers. The sixteen general-purpose registers are available in the Intel® Xeon Phi™ coprocessor's 64 bit mode execution environment. These registers are identical to those available in the 64 bit
execution environment described in the Intel64® Intel® Architecture Software Developer's Manual.
EFLAGS register. R/EFLAGS are updated by instructions according to the Intel64® Intel® Architecture Software Developer's Manual. Additionally, it is also updated by the Intel® Xeon Phi™ coprocessor's KORTEST
instruction.
FCW and FSW registers. Used by x87 instruction set extensions to set rounding modes, exception masks and
lags in the case of the FCW, and to keep track of exceptions in the case of the FSW.
x87 stack. An eight-element stack used to perform loating-point operations on 32/64/80-bit loating-point
data using the x87 instruction set.

Bit ields
31-22
21
20-16
15
14-13
12-7
6
5
4
3
2
1
0

Field
Reserved
DUE
Reserved
FZ
RC
Reserved
DAZ
PE
UE
OE
ZE
DE
IE

Description
Reserved bits
Disable Unmasked Exceptions (always set to 1)
Reserved bits
Flush To Zero
Rounding Control
Reserved bits (IM/DM/ZM/OM/UM/PM in other proliferations)
Denormals Are Zeros
Precision Flag
Under low Flag
Over low Flag
Divide-by-Zero Flag
Denormal Operation Flag
Invalid Operation Flag

Table 2.15: MXCSR bit layout. Note: MXCSR bit 20 is reserved, however it is not reported as Reserved by
MXCSR_MASK. Setting this bit will result in unde ined behavior

Reference Number: 327364-001

39

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Chapter 3
Intel® Xeon Phi™ Coprocessor Instruction
Set Architecture Format
This chapter describes the instruction encoding format and assembly instruction syntax of new instructions
supported by the Intel® Xeon Phi™ coprocessor.

3.1

Overview

The Intel® Xeon Phi™ coprocessor introduces 512-bit vector instructions operating on 512-bit vector registers
(zmm0-zmm31), and offers vector mask registers (k0-k7) to support a rich set of conditional operations on data
elements within the zmm registers. Vector instructions operating on zmm registers are encoded using a multibyte pre ix encoding scheme, with 62H being the 1st of the multi-byte pre ix. This multi-byte pre ix is referred
to as MVEX in this document.
Instructions operating on the vector mask registers are encoded using another multi-byte pre ix, with C4H or
C5H being the 1st of the multi-byte pre ix. This multi-byte pre ix is similar to the VEX pre ix that is de ined in
the "Intel® Architecture Instruction Set Architecture Programming Reference". We will refer to the C4H/C5H
based VEX-like pre ix as "VEX" in this document. Additionally, the Intel® Xeon Phi™ coprocessor also provides a
handful of new instructions operating on general-purpose registers but are encoded using VEX. In some cases,
new scalar instructions supported by the Intel® Xeon Phi™ coprocessor can be encoded with either MVEX or
VEX.

40

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Figure 3.1: New Instruction Encoding Format with MVEX Pre ix

Figure 3.2: New Instruction Encoding Format with VEX Pre ix

3.2

Instruction Formats

Instructions encoded by MVEX have the format shown in Figure 3.1.
Instructions encoded by VEX have the format shown in Figure 3.2.

3.2.1

MVEX/VEX and the LOCK prex

Any MVEX-encoded or VEX-encoded instruction with a LOCK pre ix preceding the multi-byte pre ix will generate
an invalid opcode exception (#UD).

3.2.2

MVEX/VEX and the 66H, F2H, and F3H prexes

Any MVEX-encoded or VEX-encoded instruction with a 66H, F2H, or F3H pre ix preceding the multi-byte pre ix
will generate an invalid opcode exception (#UD).

3.2.3

MVEX/VEX and the REX prex

Any MVEX-encoded or VEX-encoded instruction with a REX pre ix preceding the multi-byte pre ix will generate
an invalid opcode exception (#UD).

Reference Number: 327364-001

41

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

3.3

The MVEX Prex

The MVEX pre ix consists of four bytes that must lead with byte 62H. An MVEX-encoded instruction supports
up to three operands in its syntax and is operating on vectors in vector registers or memory using a vector mask
register to control the conditional processing of individual data elements in a vector. Swizzling, conversion and
other operations on data elements within a vector can be encoded with bit ields in the MVEX pre ix, as shown
in Figure 3.3. The functionality of these bit ields is summarized below:
• 64 bit mode register speci ier encoding (R, X, B, R', W, V') for memory and vector register operands (encoded in 1's complement form).
– A vector register as source or destination operand is encoded by combining the R'R bits with the reg
ield, or the XB bits with the r/m ield of the modR/M byte.
– The base of a memory operand is a general purpose register encoded by combining the B bit with the
r/m ield. The index of a memory operand is a general purpose register encoded by combining the X
bit with the SIB.index ield.
– The vector index operand in the gather/scatter instruction family is a vector register, encoded by
combining the VX bits with the SIB.index ield. MVEX.vvvv is not used in the gather/scatter instruction family.
• Non-destructive source register speci ier (applicable to the three operand syntax): This is the irst source
operand in the three-operand instruction syntax. It is represented by the notation, MVEX.vvvv. It can
encode any of the lower 16 zmm vector registers, or using the low 3 bits to encode a vector mask register
as a source operand. It can be combined with V to encode any of the 32 zmm vector registers
• Vector mask register and masking control: The MVEX.aaa ield encodes a vector mask register that is
used in controlling the conditional processing operation on the data elements of a 512-bit vector instruction. The MVEX.aaa ield does not encode a source or a destination operand. When the encoded value of
MVEX.aaa is 000b, this corresponds to "no vector mask register will act as conditional mask for the vector
instruction".
• Non-temporal/eviction hint. The MVEX.E ield can encode a hint to the processor on a memory referencing
instruction that the data is non-temporal and can be prioritized for eviction. When an instruction encoding
does not reference any memory operand, this bit may also be used to control the function of the MVEX.SSS
ield.
• Compaction of legacy pre ixes (66H, F2H, F3H): This is encoded in the MVEX.pp ield.
• Compaction of two-byte and three-byte opcode: This is encoded in the MVEX.mmmm ield.
• Register swizzle/memory conversion operations (broadcast/up-convert/down-convert)/static-rounding
override: This is encoded in the MVEX.SSS ield.
– Swizzle operation is supported only for register-register syntax of 512-bit vector instruction, and requires MVEX.E = 0, the encoding of MVEX.SSS determines the exact swizzle operation - see Section 2.2
– Static rounding override only applies to register-register syntax of vector loating-point instructions,
and requires MVEX.E = 1.
The MVEX pre ix is required to be the last pre ix and immediately precedes the opcode bytes.
42

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Figure 3.3: MVEX bit ields

Reference Number: 327364-001

43

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

3.3.1

Vector SIB (VSIB) Memory Addressing

In the gather/scatter instruction family, an SIB byte that follows the ModR/M byte can support VSIB memory
addressing to an array of linear addresses. VSIB memory addressing is supported only with the MVEX pre ix.
In VSIB memory addressing, the SIB byte consists of:
• The scale ield (bit 7:6), which speci ies the scale factor.
• The index ield (bits 5:3), which is prepended with the 2-bit logical value of the MVEX.V'X bits to specify
the vector register number of the vector index operand; each element in the vector register speci ies an
index.
• The base ield (bits 2:0) is prepended with the logical value of MVEX.B ield to specify the register number
of the base register.

3.4

The VEX Prex

The VEX pre ix is encoded in either the two-byte form (the irst byte must be C5H) or in the three-byte form (the
irst byte must be C4H). Beyond the irst byte, the VEX pre ix consists of a number of bit ields providing speci ic
capability; they are shown in Figure 3.4.
The functionality of the bit ields is summarized below:
• 64 bit mode register speci ier encoding (R, X, B, W): The R/X/B bit ield is combined with the lower three
bits or register operand encoding in the modR/M byte to access the upper half of the 16 registers available
in 64 bit mode. The VEX.R, VEX.X, VEX.B ields replace the functionality of REX.R, REX.X, REX.B bit ields.
The W bit either replaces the functionality of REX.W or serves as an opcode extension bit. The usage of the
VEX.WRXB bits is explained in detail in section 2.2.1.2 of the Intel® 64 and IA-32 Architectures Software
developer's manual, Volume 2A. This bit is stored in 1's complement form (bit inverted format).
• Non-destructive source register speci ier (applicable to three operand syntax): this is the irst source
operand in the instruction syntax. It is represented by the notation, VEX.vvvv. It can encode any generalpurpose register, or using only 3 bits it can encode vector mask registers. This ield is encoded using 1's
complement form (bit inverted form), i.e. RAX/K0 is encoded as 1111B, and R15 is encoded as 0000B.
• Compaction of legacy pre ixes (66H, F2H, F3H): This is encoded in the VEX.pp ield.
• Compaction of two-byte and three-byte opcode: This is encoded in the VEX.mmmmm ield.
The VEX pre ix is required to be the last pre ix and immediately precedes the opcode bytes. It must follow any
other pre ixes. If the VEX pre ix is present a REX pre ix is not supported.

44

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Figure 3.4: VEX bit ields

Reference Number: 327364-001

45

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

3.5

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Assembly
Syntax

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture supports up to three operands. The rich encoding ields
for swizzle/broadcast/convert/rounding, masking control, and non-temporal hint are expressed as modi ier
expressions to the respective operands in the assembly syntax. A few common forms for the Intel® Xeon Phi™
coprocessor assembly instruction syntax are expressed in the general form:
mnemonic vreg{masking modifier}, source1, transform_modifier(vreg/mem)
mnemonic vreg{masking modifier}, source1, transform_modifier(vreg/mem), imm
mnemonic mem{masking modifier}, transform_modifier(vreg)
The speci ic forms to express assembly syntax operands, modi iers, and transformations are listed in Table 3.1.

3.6

Notation

The notation used to describe the operation of each instruction is given as a sequence of control and assignment
statements in C-like syntax. This document only contains the notation speci ically needed for vector instructions.
Standard Intel® 64 notation may be found at IA-32 Intel® Architecture Software Developer's Manual: Volume 2
for convenience.
When instructions are represented symbolically, the following notations are used:
label: mnemonic argument1 {write-mask}, argument2, argument3, argument4, ...
where:
• A mnemonic is a reserved name for a class of instruction opcodes which have the same function.
• The operands argument1, argument2, argument3, argument4, and so on are optional. There may be from
one to three register operands, depending on the opcode. The leftmost operand is always the destination; for certain instructions, such as vfmadd231ps, it may be a source as well. When the second leftmost
operand is a vector mask register, it may in certain cases be a destination as well, as for example with the
vpsubrsetbd instruction. All other register operands are sources. There may also be additional arguments
in the form of immediate operands; for example, the vcvtfxpntdq2ps instructions has a 3-bit immediate
ield that speci ies the exponent adjustment to be performed, if any. The write-mask operand speci ies the
vector mask mask register used to control the selective updating of elements in the destination register or
registers.

3.6.1

Operand Notation

In this manual we will consider vector registers from several perspectives. One perspective is is as an array of
64 bytes. Another is as an array of 16 doubleword elements. Another is an array of 8 quadword elements. Yet
another is as an array of 512 bits. In the mnemonic operation description pseudo-code, registers will be addressed using bit ranges, such as:

46

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

i = n*32
zmm1[i+31:i]
This example refers to the 32 bits of the n-th doubleword element of vector register zmm1.
We will use a similar bit-oriented notation to describe access to vector mask registers. In the case of vector mask
registers, we will usually specify a single bit, rather than a range of bits, because vector mask registers are used
for predication, carry, borrow, and comparison results, and a single bit per element is enough for any of those
purposes.
Using this notation, it is for example possible to test the value of the 12th bit in k1 as follows:
if ( k1[11] == 1 ) { ... code here ... }
Tables 3.1 and 3.2 summarize the notation used for instruction operands and their values.
In Intel® Xeon Phi™ Coprocessor Instruction Set Architecture, the contents of vector registers are variously interpreted as loating-point values (either 32 or 64 bits), integer values, or simply doubleword values of no particular
data type, depending on the instruction semantics.

3.6.2

The Displacement Bytes

The Intel® Xeon Phi™ coprocessor introduces a brand new displacement representation that allows for a more
compact encoding in unrolled code: compressed displacement of 8-bits, or disp8*N. Such compressed displacement is based on the assumption that the effective displacement is a multiple of the granularity of the memory
access, and hence we do not need to encode the redundant low-order bits of the address offset.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture using the MVEX pre ix (i.e. using encoding 62) have the
following displacement options:

• No displacement
• 32 bit displacement: this displacement works exactly the same as the legacy 32 bit displacement and
works at byte granularity
• Compressed 8 bit displacement (disp8*N): this displacement format substitutes the legacy 8-bit displacement in Intel® Xeon Phi™ Coprocessor Instruction Set Architecture using map 62. This displacement assumes
the same granularity as the memory operand size (which is dependent on the instructions and the memory
conversion function being used). Redundant low-order bits are ignored and hence, 8-bit displacements
are reinterpreted so that they are multiplied by the memory operands total size in order to generate the
inal displacement to be used in calculating the effective address.

Note that the displacements in the MVEX vector instruction pre ix are encoded in exactly the same way as regular
displacements (so there are no changes in the ModRM/SIB encoding rules), with the only exception that disp8 is
overloaded to disp8*N. In other words there are no changes in the encoding rules or encoding lengths, but only
in the interpretation of the displacement value by hardware (which needs to scale the displacement by the size
of the memory operand to obtain a byte-wise address offset).
Reference Number: 327364-001

47

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Notation
zmm1
zmm2
zmm3
Sf 32 (zmm/m)
Sf 64 (zmm/m)
Si32 (zmm/m)
Si64 (zmm/m)
Uf 32 (m)
Ui32 (m)
Uf 64 (m)
Ui64 (m)
Df 32 (zmm)
Di32 (zmm)
Df 64 (zmm)
Di64 (zmm)
m
mt
mvt

effective_address
imm8
SRC[a-b]

Meaning
A vector register operand in the argument1 ield of the instruction. The 64 byte
vector registers are: zmm0 through zmm31
A vector register operand in the argument2 ield of the instruction. The 64 byte
vector registers are: zmm0 through zmm31
A vector register operand in the argument3 ield of the instruction. The 64 byte
vector registers are: zmm0 through zmm31
A vector loating-point 32 bit swizzle/conversion. Refer to Table 2.2 for register
sources and Table 2.4 for memory conversions.
A vector loating-point 64 bit swizzle/conversion. Refer to Table 2.3 for register
sources and Table 2.6 for memory conversions.
A vector integer 32 bit swizzle/conversion. Refer to Table 2.2 for register sources
and Table 2.5 for memory conversions.
A vector integer 64 bit swizzle/conversion. Refer to Table 2.3 for register sources
and Table 2.7 for memory conversions.
A loating-point 32 bit load Up-conversion. Refer to Table 2.9 for the memory
conversions available for all the different datatypes.
An integer 32 bit load Up-conversion. Refer to Table 2.9 for the memory conversions available for all the different datatypes.
A loating-point 64 bit load Up-conversion. Refer to Table 2.11 for the memory
conversions available for all the different datatypes.
An integer 64 bit load Up-conversion. Refer to Table 2.11 for the memory conversions available for all the different datatypes.
A loating-point 32 bit store Down-conversion. Refer to Table 2.12 for the memory
conversions available for all the different datatypes.
An integer 32 bit store Down-conversion. Refer to Table 2.12 for the memory
conversions available for all the different datatypes.
A loating-point 64 bit store Down-conversion. Refer to Table 2.13 for the memory
conversions available for all the different datatypes.
An integer 64 bit store Down-conversion. Refer to Table 2.13 for the memory
conversions available for all the different datatypes.
A memory operand.
A memory operand that may have an EH hint attribute.
A vector memory operand that may have an EH hint attribute. This memory
operand is encoded using ModRM and VSIB bytes. It can be seen as a set of pointers where each pointer is equal to BASE + V IN DEX[i] × SCALE
Used to denote the full effective address when dealing with a memory operand.
An immediate byte value.
A bit- ield from an operand ranging from LSB b to MSB a.
Table 3.1: Operand Notation

Notation
zmm1[i+31:i]
zmm2[i+31:i]
k1[i]

Meaning
The value of the element located between bit i and bit i + 31 of the argument1
vector operand.
The value of the element located between bit i and bit i + 31 of the argument2
vector operand.
Speci ies the i-th bit in the vector mask register k1.
Table 3.2: Vector Operand Value Notation

48

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

3.6.3

Memory size and disp8*N calculation

Table 3.3 and Table 3.4 show the size of the vector (or element) being accessed in memory, which is equal to the
scaling factor for compressed displacement (disp8*N). Note that some instructions work at element granularity
instead of full vector granularity at memory level, and hence should use the "element level" column in Table 3.3
and Table 3.4 (namely VLOADUNPACK, VPACKSTORE, VGATHER, and VSCATTER instructions).
Table 3.3: Size of vector or element accessed in memory for upconversion
Function

Usage

U/Sf 32
000
001
010
011
100
101
110
111

[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

U/Si32
000
001
010
011
100
101
110
111

[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

U/Sf 64
000
001
010
011
100
101
110
111

[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

U/Si64
000
001
010
011

[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A

Reference Number: 327364-001

Memory accessed / Disp8*N
No broadcast

4to16 broadcast

64
4
16
32
16
16
32
32

16
NA
NA
8
4
4
8
8

No broadcast

4to16 broadcast

64
4
16
NA
16
16
32
32

16
NA
NA
NA
4
4
8
8

No broadcast

4to8 broadcast

64
8
32
NA
NA
NA
NA
NA

32
NA
NA
NA
NA
NA
NA
NA

No broadcast

4to8 broadcast

64
8
32
NA

32
NA
NA
NA

1to16 broadcast
or element level
4
NA
NA
2
1
1
2
2
1to16 broadcast
or element level
4
NA
NA
NA
1
1
2
2
1to8 broadcast
or element level
8
NA
NA
NA
NA
NA
NA
NA
1to8 broadcast
or element level
8
NA
NA
NA
49

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Table 3.3: Size of vector or element accessed in memory for upconversion
Function
100
101
110
111

Usage
N/A
N/A
N/A
N/A

Memory accessed / Disp8*N
NA
NA
NA
NA
NA
NA
NA
NA

NA
NA
NA
NA

Table 3.4: Size of vector or element accessed in memory for downconversion

50

Function

Usage

Memory accessed / Disp8*N

Df 32
000
001
010
011
100
101
110
111

zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

Regular store
64
NA
NA
32
16
16
32
32

Element level
4
NA
NA
2
1
1
2
2

Df 64
000
001
010
011
100
101
110
111

zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Regular store
64
NA
NA
NA
NA
NA
NA
NA

Element level
8
NA
NA
NA
NA
NA
NA
NA

Di64
000
001
010
011
100
101
110
111

zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Regular store
64
NA
NA
NA
NA
NA
NA
NA

Element level
8
NA
NA
NA
NA
NA
NA
NA

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

3.7

EH hint

All vector instructions that access memory provide the option of specifying a cache-line eviction hint, EH.
EH is a performance hint, and may operate in different ways or even be completely ignored in different hardware
implementations. The Intel® Xeon Phi™ coprocessor is designed to provide support for cache-ef icient access
to memory locations that have either low temporal locality of access or bursts of a few very closely bunched
accesses.
There are two distinct modes of EH hint operation, one for prefetching and one for loads, stores, and load-op
instructions.
The interaction of the EH hint with prefetching is summarized in Table 3.5.
EH value
EH not set
EH set

Hit behavior
Make data MRU
Make data MRU

Miss behavior
Fetch data and make it MRU
Fetch data into way #N, where N is the
thread number, and make it MRU

Table 3.5: Prefetch behavior based on the EH (cache-line eviction hint)
The above table describes the effect of the EH bit on gather/scatter prefetches into the targeted cache (e.g. L1
for vgatherpf0dps, L2 for vgatherpf1dps). If vgatherpf0dps misses both L1 and L2, the resulting prefetch into L1
is a non-temporal prefetch into way #N of L1, but the prefetch into L2 is a normal prefetch, not a non-temporal
prefetch. If you want the data to be non-temporally fetched into L2, you must use vgatherpf1dps with the EH bit
set.
The operation of the EH hint with prefetching is designed to limit the cache impact of streaming data.
Note that regular prefetch instructions (like vprefetch0) do not have an embedded EH hint. Instead, the nontemporal hint is given by the opcode/mnemonic (see VPREFETCHNTA/0/1/2 descriptions for details). The
same rules described in Table 3.5 still apply.
Table 3.6 summarizes the interaction of the EH hint with load and load-op instructions.
EH value
EH not set
EH set

L1 hit behavior
Make data MRU
Make data LRU

L1 miss behavior
Fetch data and make it MRU
Fetch data and make it MRU

Table 3.6: Load/load-op behavior based on the EH bit.
The EH bit, when used with load and load-op instructions, affects only the L1 cache behavior. Any resulting L2
misses are handled normally, regardless of the setting of the EH bit.
Table 3.7 summarizes the interaction of the EH hint with store instructions. Note that stores that write a full
cache-line (no mask, no down-conversion) evict the line from L1 (invalidation) while updating the contents
directly into the L2 cache. In any other case, a store with an EH hint works as a load with an EH hint.
The EH bit, when used with load and load-op instructions, affects only the L1 cache behavior. Any resulting L2
misses are handled normally, regardless of the setting of the EH bit.
Reference Number: 327364-001

51

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

EH value
EH not set
EH set
EH set

Store type
No mask, no downconv.
Mask or downconv.

L1 hit behavior
Make data MRU
Invalidate L1 - Update L2
Make data LRU

L1 miss behavior
Fetch data and make it MRU
Fetch data and make it MRU
Fetch data and make it MRU

Table 3.7: Store behavior based on the EH bit.

3.8

Functions and Tables Used

Some mnemonic de initions use auxiliary tables and functions to ease the process of describing the operations of
the instruction. The following section describes those tables and functions that do not have an obvious meaning.

3.8.1

MemLoad and MemStore

This document uses two functions, Mem-Load and MemStore, to describe in pseudo-code memory transfers that
involve no conversions or broadcasts:
• MemLoad: Given an address pointer, this function returns the associated data from memory. Size is deined by the explicit destination size in the pseudo-code (see for example LDMXCSR in Appendix B)
• MemStore: Given an address pointer, this function stores the associated data to memory. Size is de ined
by the explicit source data size in the pseudo-code.

3.8.2

SwizzUpConvLoad, UpConvLoad and DownConvStore

In this document, the detailed discussions of memory-accessing instructions that support datatype conversion
and/or broadcast (as de ined by the UpConv, SwizzUpConv, and DownConv tables in section 2.2) use the functions shown in Table 3.8 in their Operation sections (the instruction pseudo-code). These functions are used
to describe any swizzle, broadcast, and/or conversion that can be performed by the instruction, as well as the
actual load in the case of SwizzUpConv and UpConv. Note that zmm/m means that the source may be either a
vector operand or a memory operand, depending on the ModR/M encoding.
The Operation section may use UpConvSizeOf, which returns the inal size (in bytes) of an up-converted memory
element given a speci ied up-conversion mode. A speci ic subset of a memory stream may be used as a parameter
for UpConv; Size of the subset is inferred by the size of destination together with the up-conversion mode.
Additionally, the Operation section may also use DownConvStoreSizeOf, which returns the inal size (in bytes) of
a downcoverted vector element given a speci ied down-conversion mode. A speci ic subset of a vector register
may be used as a parameter for DownConvStore; for example, DownConvStore(zmm2[31:0]) speci ies that the
low 32 bits of zmm2 form the parameter for DownConv.

3.8.3

Other Functions/Identiers

The following identi iers are used in the algorithmic descriptions:
52

Reference Number: 327364-001

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

Swizzle/conversion used
Sf 32 (zmm/m)
Sf 64 (zmm/m)
Si32 (zmm/m)
Si64 (zmm/m)
Uf 32 (m)
Ui32 (m)
Uf 64 (m)
Ui64 (m)
Df 32 (zmm)
Di32 (zmm)
Df 64 (zmm)
Di64 (zmm)

Function used in operation description
SwizzUpConvLoadf 32 (zmm/m)
SwizzUpConvLoadf 64 (zmm/m)
SwizzUpConvLoadi32 (zmm/m)
SwizzUpConvLoadi64 (zmm/m)
UpConvLoadf 32 (m)
UpConvLoadi32 (m)
UpConvLoadf 64 (m)
UpConvLoadi64 (m)
DownConvStoref 32 (zmm) or DownConvStoref 32 (zmm[xx:yy])
DownConvStorei32 (zmm) or DownConvStorei32 (zmm[xx:yy])
DownConvStoref 64 (zmm) or DownConvStoref 64 (zmm[xx:yy])
DownConvStorei64 (zmm) or DownConvStorei64 (zmm[xx:yy])

Table 3.8: SwizzUpConv, UpConv and DownConv function conventions
• Carry - The carry bit from an addition.
• FpMaxAbs - The greater of the absolute values of two loating-point numbers. See the description of the
VGMAXABSPS instruction for further details.
• FpMax - The greater of two loating-point numbers. See the description of the VGMAXPS instruction for
further details.
• FpMin - The lesser of two loating-point numbers. See the description of the VGMINPS instruction for
further details.
• Abs - The absolute value of a number.
• IMax - The greater of two signed integer numbers.
• UMax - The greater of two unsigned integer numbers.
• IMin - The lesser of two signed integer numbers.
• UMin - The lesser of two unsigned integer numbers.
• CvtInt32ToFloat32 - Convert a signed 32 bit integer number to a 32 bit loating-point number.
• CvtInt32ToFloat64 - Convert a signed 32 bit integer number to a 64 bit loating-point number.
• CvtFloat32ToInt32 - Convert a 32 bit loating-point number to a 32 bit signed integer number using the
speci ied rounding mode.
• CvtFloat64ToInt32 - Convert a 64 bit loating-point number to a 32 bit signed integer number using the
speci ied rounding mode.
• CvtFloat32ToUint32 - Convert a 32 bit loating-point number to a 32 bit unsigned integer number using
the speci ied rounding mode.
• CvtFloat64ToUint32 - Convert a 64 bit loating-point number to a 32 bit unsigned integer number using
the speci ied rounding mode.
• CvtFloat32ToFloat64 - Convert a 32 bit loating-point number to a 64 bit loating-point number.
Reference Number: 327364-001

53

CHAPTER 3. INTEL® XEON PHI™ COPROCESSOR INSTRUCTION SET ARCHITECTURE FORMAT

• CvtFloat64ToFloat32 - Convert a 64 bit loating-point number to a 32 bit loating-point number using
the speci ied rounding mode.
• CvtUint32ToFloat32 - Convert an unsigned 32 bit integer number to a 32 bit loating-point number.
• CvtUint32ToFloat64 - Convert an unsigned 32 bit integer number to a 64 bit loating-point number.
• GetExp - Obtains the (un-biased) exponent of a given loating-point number, returned in the form of a 32
bit loating-point number. See the description of the VGETEXPPS instruction for further details.
• RoundToInt - Rounds a loating-point number to the nearest integer, using the speci ied rounding mode.
The result is a loating-point representation of the rounded integer value.
• Borrow - The borrow bit from a subtraction.
• ZeroExtend - Returns a value zero-extended to the operand-size attribute of the instruction.
• FlushL1CacheLine - Flushes the cache line containing the speci ied memory address from L1.
• InvalidateCacheLine - Invalidate the cache line containing the speci ied memory address from the whole
memory cache hierarchy.
• FetchL1CacheLine - Prefetches the cache line containing the speci ied memory address into L1. See the
description of the VPREFETCH1 instruction for further details.
• FetchL2CacheLine - Prefetches the cache line containing the speci ied memory address into L2. See the
description of the VPREFETCH2 instruction for further details.

54

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

Chapter 4
Floating-Point Environment, Memory Addressing, and Processor State
This chapter describes the Intel® Xeon Phi™ coprocessor vector loating-point instruction exception behavior
and interactions related to system programming.

4.1

Overview

The Intel® Xeon Phi™ coprocessor 512-bit vector instructions that operate on loating-point data may signal
exceptions related to arithmetic processing. When SIMD loating-point exceptions occur, the Intel® Xeon Phi™
coprocessor supports exception reporting using exception lags in the MXCSR register, but traps (unmasked exceptions) are not supported.
Exceptions caused by memory accesses apply to vector loating-point, vector integer, and scalar instructions.
The MXCSR register (see Figure 4.1) in the Intel® Xeon Phi™ coprocessor provides:
• Exception lags to indicate SIMD loating-point exceptions signaled by loating-point instructions operating on zmm registers. The lags are: IE, DE, ZE, OE, UE, PE.
• Rounding behavior and control: DAZ, FZ and RC.
• Exception Suppression: DUE (always 1)

4.1.1

Suppress All Exceptions Attribute (SAE)

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that process loating-point data support a speci ic feature to disable loating-point exception signaling, called SAE ("suppress all exceptions"). The SAE mode is enabled via a speci ic bit in the register swizzle ield of the MVEX pre ix (by setting the EH bit to 1). When SAE is
enabled in the instruction encoding, that instruction does not report any SIMD loating-point exception in the
Reference Number: 327364-001

55

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

Figure 4.1: MXCSR Control/Status Register
MXCSR register. This feature is only available to the register-register format of the instructions and in combination with static rounding-mode.

4.1.2

SIMD Floating-Point Exceptions

SIMD loating-point exceptions are those exceptions that can be generated by Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that operate on loating-point data in zmm operands. Six classes of SIMD loating-point
exception lags can be signaled:
• Invalid operation (#I)
• Divide-by-zero (#Z)
• Numeric over low (#O)
• Numeric under low (#U)
• Inexact result (Precision) (#P)
• Denormal operand (#D)

4.1.3

SIMD Floating-Point Exception Conditions

The following sections describe the conditions that cause SIMD loating-point exceptions to be signaled, and the
masked response of the processor when these conditions are detected.
When more than one exception is encountered, then the following precedence rules are applied1 .
1 Note that the Intel® Xeon Phi™ coprocessor has no support for unmasked exceptions, so in this case the exception precedence rules
have no effect. All concurrently-encountered exceptions will be reported simultaneously.

56

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

1. Invalid-operation exception caused by sNaN operand
2. Any other invalid exception condition different from sNaN input operand
3. Denormal operand exception
4. A divide-by-zero exception
5. Over low/under low exception
6. Inexact result
All Intel® Xeon Phi™ Coprocessor Instruction Set Architecture loating-point exceptions are precise and are reported as soon as the instruction completes execution. The status lags from the MXCSR register set by each
instruction will be the logical OR of the lags set by each of the up to 16 (or 8) individual operations. The status
lags are sticky and can be cleared only via a LDMXCSR instruction.

4.1.3.1

Invalid Operation Exception (#I)

The loating-point invalid-operation exception (#I) occurs in response to an invalid arithmetic operand. The lag
(IE) and mask (IM) bits for the invalid operation exception are bits 0 and 7, respectively, in the MXCSR register.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture forces all loating-point exceptions, including invalidoperation exceptions, to be masked. Thus, for the #I exception the value returned in the destination register
is a QNaN, QNaN Inde inite, Integer Inde inite, or one of the source operands. When a value is returned to the
destination operand, it overwrites the destination register speci ied by the instruction. Table 4.1 lists the invalidarithmetic operations that the processor detects for instructions and the masked responses to these operations.
Normally, when one or more of the source operands are QNaNs (and neither is an SNaN or in an unsupported
format), an invalid-operation exception is not generated. For VCMPPS and VCMPPD when the predicate is one
of lt, le, nlt, or nle, a QNaN source operand does generate an invalid-operation exception.
Note that divide-by-zero exceptions (like all other loating-point exceptions) are always masked in the Intel®
Xeon Phi™ coprocessor.

4.1.3.2

Divide-By-Zero Exception (#Z)

The processor reports a divide-by-zero exception when a VRCP23PS instruction has a 0 operand.
Note that divide-by-zero exceptions (like all other loating-point exceptions) are always masked in the Intel®
Xeon Phi™ coprocessor.

4.1.3.3

Denormal Operand Exception (#D)

The processor reports a denormal operand exception when an arithmetic instruction attempts to operate on a
denormal operand and the DAZ bit in the MXCSR (the "Denormals Are Zero" bit) is not set to 0 (so that denormal
operands are not treated as zeros).
Note that denormal exceptions (like all other loating-point exceptions) are always masked in the Intel® Xeon
Phi™ coprocessor.
Reference Number: 327364-001

57

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

Condition
VADDNPD, VADDNPS, VADDPD, VADDPS, VADDSETSPS, VMULPD,
VMULPS, VRCP23PS, VRSQRT32PS, VLOG2PS, VSCALEPS, VSUBPD,
VSUBPS, VSUBRPD or an VSUBRPS instruction with an SNaN
operand
VCMPPD or VCMPPS with QNaN or SNaN operand

VCVTPD2PS, or VCVTPS2PD instruction with an SNaN operand
VCVTFXPNTPD2DQ, VCVTFXPNTPD2UDQ, VCVTFXPNTPS2DQ, or
VCVTFXPNTPS2DQ instruction with an NaN operand
VGATHERD, VMOVAPS, VLOADUNPACKHPS, VLOADUNPACKLPS, or
VBROADCATSS instruction with SNaN operand and selected UpConv32 that converts from loating-point to another loating-point
data type
VPACKSTOREHPS, VPACKSTORELPS, VSCATTERDPS, or VMOVAPS
instruction with SNaN operand and selected a DownConv32 that
converts from loat to another loat datatype
VFMADD132PD, VFMADD132PS, VFMADD213PD, VFMADD213PS,
VFMADD231PD, VFMADD233PS, VFNMSUB132PD, VFNMSUB132PS, VFNMSUB213PD, VFNMSUB213PS, VFNMSUB231PD,
VNMSUB231PS, VFMSUB132PD, VFMSUB132PS, VFMSUB213PD,
VFMSUB213PS, VFMSUB231PD, VFMSUB231PS, VFNMADD132PD,
VFNMADD132PS, VFNMADD213PD, VFNMADD213PS, VFNMADD231PD, or VFNMADD231PS instruction with an SNaN
operand.
VGMAXPD, VGMAXPS, VGMINPD or VGMINPS instruction with SNaN
operand
VGMAXABSPS instruction with SNaN operand.

Multiplication of in inity by zero
VGETEXPPS, VRCP23PS, VRSQRT23PS or VRNDFXPNTPS instruction with SNaN operand
VRSQRT23PS instruction with NaN or negative value
Addition of opposite signed in inities or subtraction of like-signed
in inities

Masked Response
Return the SNaN converted to a QNaN.
For more detailed information refer to
Table 4.3
Return 0 (except for the predicates notequal, unordered, not-less-than, or notless- than-or-equal, which return a 1)
Return the SNaN converted to a QNaN.
Return a 0.
Return the SNaN converted to a QNaN.

Return the SNaN converted to a QNaN.

Follow rules described in Table 4.4.

Returns non NaN operand. If both
operands are NaN, return irst source
NaN.
Returns non NaN operand. If both
operands are NaN, return irst source
NaN with its sign bit cleared.
Return the QNaN loating-point Inde inite.
Return the SNaN converted to a QNaN.
Return the QNaN loating-point Inde inite.
Return the QNaN loating-point Inde inite

Table 4.1: Masked Responses of Intel® Xeon Phi™ Coprocessor Instruction Set Architecture to Invalid Arithmetic
Operations

4.1.3.4

Numeric Overow Exception (#O)

The processor reports a numeric over low exception whenever the rounded result of an arithmetic instruction
exceeds the largest allowable inite value that its in the destination operand.
Note that over low exceptions (like all other loating-point exceptions) are always masked in the Intel® Xeon
58

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE
™

Phi coprocessor.

4.1.3.5

Numeric Underow Exception (#U)

The processor signals an under low exception whenever (a) the rounded result of an arithmetic instruction,
calculated assuming unbounded exponent, is less than the smallest possible normalized inite value that will it
in the destination operand (the result is tiny), and (b) the inal rounded result, calculated with bounded exponent
determined by the destination format, is inexact.
Note that under low exceptions (like all other loating-point exceptions) are always masked in the Intel® Xeon
Phi™ coprocessor.
The lush-to-zero control bit provides an additional option for handling numeric under low exceptions in the
Intel® Xeon Phi™ coprocessor. If set (FZ = 1), tiny results (these are usually, but not always, denormal values) are
replaced by zeros of the same sign. If not set (FZ=0) then tiny results will be rounded to 0, a denormalized value,
or the smallest normalized loating-point number in the destination format, with the sign of the exact result.

4.1.3.6

Inexact Result (Precision) Exception (#P)

The inexact-result exception (also called the precision exception) occurs if the result of an operation is not exactly
representable in the destination format. For example, the fraction 1/3 cannot be precisely represented in binary
form. This exception occurs frequently and indicates that some (normally acceptable) accuracy has been lost.
The exception is supported for applications that need to perform exact arithmetic only. In lush-to-zero mode,
the inexact result exception is signaled for any tiny result. (By de inition, tiny results are not zero, and are lushed
to zero when MXCSR.FZ = 1 for all instructions that support this mode.)
Note that inexact exceptions (like all other loating-point exceptions) are always masked in the Intel® Xeon Phi™
coprocessor.

4.2
4.2.1

Denormal Flushing Control
Denormal control in up-conversions and down-conversions

Instruction up-conversions and down-conversions follow speci ic denormal lushing rules, i.e. for treating input
denormals as zeros and for lushing tiny results to zero:

4.2.1.1

Up-conversions

• Up-conversions from loat16 to loat32 ignore the MXCSR.DAZ setting and this never treat input denormals
as zeros. Denormal exceptions are never signaled (the MXCSR.DE lag is never set by these operations).
• Up-conversions from any small loating-point number (namely, loat16) to loat32 can never generate a
loat32 output denormal
Reference Number: 327364-001

59

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

4.2.1.2

Down-conversions

• Down-conversions from loat32 to loat16 follow the MXCSR.DAZ setting to decide whether to treat input
denormals as zeros or not. For input denormals, the MXCSR.DE lag is set only if MXCSR.DAZ is not set,
otherwise it is left unchanged.
• Down-conversions from loat32 to any integer format follow the MXCSR.DAZ setting to decide whether to
treat input denormals as zeros or not (this may matter only in directed rounding modes). The MXCSR.DE
status lag is never set.
• Down-conversions from loat32 to any small loating-point number ignore MXCSR.FZ and always preserve
output denormals.

4.3

Extended Addressing Displacements

Address displacements used by memory operands to the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions, as well as MVEX-encoded versions of VPREFETCH and CLEVICT, operate differently than
do normal x86 displacements. Intel® Xeon Phi™ Coprocessor Instruction Set Architecture 8-bit displacements (i.e.
when MOD.mod=01) are reinterpreted so that they are multiplied by the memory operand's total size in order
to generate the inal displacement to be used in calculating the effective address (32 bit displacements, which
vector instructions may also use, operate normally, in the same way as for normal x86 instructions). Note that
extended 8-bit displacements are still signed integer numbers and need to be sign extended.
A given vector instruction's 8-bit displacement is always multiplied by the total number of bytes of memory
the instruction accesses, which can mean multiplication by 64, 32, 16, 8, 4, 2 or 1, depending on any broadcast
and/or data conversion in effect. Thus when reading a 64-byte (no conversion, no broadcast) source operand,
for example via
vmovaps zmm0, [rsi]
the encoded 8-bit displacement is irst multiplied by 64 (shifted left by 6) before being used in the effective
address calculation. For
vbroadcastss zmm0, [rsi]{uint16}

//

{1to16} broadcast of {uint16} data

however, the encoded displacement would be multiplied by 2. Note that for MVEX versions of VPREFETCH and
CLEVICT, we always use disp8*64; for VEX versions we use the standard x86 disp8 displacement.
The use of disp8*N makes it possible to avoid using 32 bit displacements with vector instructions most of the
time, thereby reducing code size and shrinking the required size of the paired-instruction decode window by
3 bytes. Disp8*N overcomes disp8 limitations, as it is simply too small to access enough vector operands to
be useful (only 4 64-byte operands). Moreover, although disp8*N can only generate displacements that are
multiples of N, that's not a signi icant limitation, since Intel® Xeon Phi™ Coprocessor Instruction Set Architecture
memory operands must already be aligned to the total number of bytes of memory the instruction accesses in
order to avoid raising a #GP fault, and that alignment is exactly what disp8*N results in, given aligned base+index
addressing.

60

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

4.4

Swizzle/up-conversion exceptions

There is a set of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that do not accept all regular forms
of memory up-conversion/register swizzling and raise a #UD fault for illegal combinations. The instructions
are:
• VALIGND
• VCVTDQ2PD
• VCVTPS2PD
• VCVTUDQ2PD
• VEXP223PS
• VFMADD233PS
• VLOG2PS
• VPERMD
• VPERMF32X4
• VPMADD233D
• VPSHUFD
• VRCP23PS
• VRSQRT23PS
Table 4.2 summarizes which up-conversion/swizzling primitives are allowed for every one of those instructions:

Mnemonic
VALIGND
VCVTDQ2PD
VCVTPS2PD
VCVTUDQ2PD
VEX223PS
VFMADD233PS
VLOG2PS
VPERMD
VPERMF32X4
VPMADD233D
VPSHUFD
VRCP23PS
VRSQRT23PS

None
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes

{1to16}
no
yes
yes
yes
no
no
no
no
no
no
no
no
no

{4to16}
no
yes
yes
yes
no
yes
no
no
no
yes
no
no
no

Register
swizzles
no
yes
yes
yes
no
no
no
no
no
no
no
no
no

Memory
Conversions
no
no
no
no
no
no
no
no
no
no
no
no
no

Table 4.2: Summary of legal and illegal swizzle/conversion primitives for special instructions.
Reference Number: 327364-001

61

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

4.5

Accessing uncacheable memory

When accessing non cacheable memory, it's important to de ine the amount of data that is really accessed when
using Intel® Xeon Phi™ Coprocessor Instruction Set Architecture (mainly when Intel® Xeon Phi™ Coprocessor Instruction Set Architecture instructions are used to access to memory mapped I/O regions). Depending on the
memory region accessed, an access may cause that a mapped device behave differently.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture, when accessing to uncacheable memory access, can be
categorized in four different groups:
• regular memory read operations
• vloadunpackh*/vloadunpackl*
• vgatherd*
• memory store operations

4.5.1

Memory read operations

Any Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that read from memory, apart from vloadunpackh*/vloadunpackl* and vgatherd, access as many consecutive bytes as dictated by the combination of memory SwizzUpConv modi iers.

4.5.2

vloadunpackh*/vloadunpackl*

vloadunpackh*/vloadunpackl* instructions are exceptions to the general rule. Those two instructions will
always access 64 bytes of memory. The memory region accessed is between effective_address & ( 0x3F) and
(effective_address & ( 0x3F)) + 63 in both cases.

4.5.3

vgatherd*

vgatherd instructions are able to gather to up to 16 32 bit elements. The amount of elements accessed is determined by the number of bits set in the vector mask provided as source. Vgatherd* instruction will access up to 16
different 64-byte memory regions when gathering the elements. Note that, depending on the implementation,
only one 64-byte memory access is performed for a variable number of vector elements located in that region.
Each accessed regions will be between element_effective_address & ( 0x3F) and (element_effective_address &
( 0x3F)) + 63.

4.5.4

Memory stores

All Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that perform memory store operations, update those
memory positions determined by the vector mask operand. Vector mask speci ies which elements will be actually stored in memory. DownConv* determine the number of bytes per element that will be modi ied in memory.
62

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

4.6
4.6.1

Floating-point Notes
Rounding Modes

VRNDFXPNTPS and conversion instructions with loat32 sources, such as VCVTFXPNTPS2DQ, support four selectable rounding modes: round to nearest (even), round toward negative in inity (round down), round toward
positive in inity (round up), and round toward zero, These are the standard IEEE rounding modes; see IA-32
Intel® Architecture Software Developer's Manual: Volume 1, Section 4.8.4, for details.
The Intel® Xeon Phi™ coprocessor introduces general support for all four rounding-modes mandated for binary
loating-point arithmetic by the IEEE Standard 754-2008.

4.6.1.1

Swizzle-explicit rounding modes

The Intel® Xeon Phi™ coprocessor introduces the option of specifying the rounding-mode per instruction via a
speci ic register swizzle mode (by setting the EH bit to 1). This speci ic rounding-mode takes precedence over
whatever MXCSR.RC speci ies.
For those instructions (like VRNDFXPNTPS) where an explicit rounding-mode is speci ied via immediate, this
immediate takes precedence over a swizzle-explicit rounding-mode embedded into the encoding of the instruction.
The priority of the rounding-modes of an instruction hence becomes (from highest to lowest):
1. Rounding mode speci ied in the instruction immediate (if any)
2. Rounding mode speci ied is the instruction swizzle attribute
3. Rounding mode speci ied in RC bits of the MXCSR

4.6.1.2

Denition and propagation of NaNs

The IA-32 architecture de ines two classes of NaNs: quiet NaNs (QNaNs) and signaling NaNs (SNaNs). Quiet
NaNs have 1 as their irst fraction bit, SNaNs have 0 as their irst fraction bit. An SNaN is quieted by setting its
irst irst fraction bit to 1. The class of a NaN (quiet or signaling) is preserved when converting between different
precisions.
The processor never generates an SNaN as a result of a loating-point operation with no SNaN operands, so
SNaNs must be present in the input data or have to be inserted by the software.
QNaNs are allowed to propagate through most arithmetic operations without signaling an exception. Note also
that Intel® Xeon Phi™ Coprocessor Instruction Set Architecture instructions do not trap for arithmetic exceptions,
as loating-point exceptions are always masked.
If any operation has one or more NaN operands then the result, in most cases, is a QNaN that is one of the input
NaNs, quieted if it is an SNaN. This is chosen as the irst NaN encountered when scanning the operands from left
to right, as presented in the instruction descriptions from Chapter 6.
If any loating-point operation with operands that are not NaNs leads to an inde inite result (e.g. 0/0, 0 × ∞, or
∞ − ∞), the result will be QNaN Inde inite: 0xFFC00000 for 32 bit operations and 0xFFF8000000000000 for
Reference Number: 327364-001

63

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

64 bit operations.
When operating on NaNs, if the instruction does not de ine any other behavior, Table 4.3 describes the NaN
behavior for unary and binary instructions. Table 4.4 shows the NaN behavior for ternary fused multiply and
add/sub operations. This table can be derived by considering the operation as a concatenation of two binary operations. The irst binary operation, the multiply, produces the product. The second operation uses the product
as the irst operand for the addition.

Source operands
SNaN
QNaN
SNaN and QNaN
Two SNaNs
Two QNaNs
SNaN and a loating-point value
QNaN and a loating-point value

Result
SNaN source operand, converted into a QNaN
QNaN source operand
First operand (if this operand is an SNaN, it is converted to a QNaN)
First operand converted to a QNaN
First operand
SNaN source operand, converted into a QNaN
QNaN source operand

Table 4.3: Rules for handling NaNs for unary and binary operations.

4.6.1.3

Signed Zeros

Zero can be represented as a +0 or a −0 depending on the sign bit. Both encodings are equal in value. The sign
of a zero result depends on the operation being performed and the rounding mode being used.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture introduces the fused "multiply and add'' and "multiply
and sub'' operations. These consist of a multiplication (whose sign is possibly negated) followed by an addition
or subtraction, all calculated with just one rounding error.
The sign of the multiplication result is the exclusive-or of the signs of the multiplier and multiplicand, regardless
of the rounding mode (a positive number has a sign bit of 0, and a negative one, a sign bit of 1).
The sign of the addition (or subtraction) result is in general that of the exact result. However, when this result
is exactly zero, special rules apply: when the sum of two operands with opposite signs (or the difference of two
operands with like signs) is exactly zero, the sign of that sum (or difference) is +0 in all rounding modes, except
round down; in that case, the sign of an exact zero sum (or difference) is −0. This is true even if the operands
are zeros, or denormals treated as zeros because MXCSR.DAZ is set to 1. Note that x + x = x − (−x) retains the
same sign as x even when x is zero; in particular, (+0) + (+0) = +0, and (−0) + (−0) = −0, in all rounding
modes.
When (a × b) ± c is exactly zero, the sign of fused multiply-add/subtract shall be determined by the rules above
for a sum of operands. When the exact result of ±(a × b) ± c is non-zero yet the inal result is zero because of
rounding, the zero result takes the sign of the exact result.
The result for "fused multiply and add" follows by applying the following algorithm:
• (xd , yd , zd ) =DAZ applied to (Src1, Src2, Src3) (denormal operands, if any, are treated as zeros of the same
sign as the operand; other operands are not changed)
• Resultd = xd × yd + zd computed exactly then rounded to the destination precision.
64

Reference Number: 327364-001

Reference Number: 327364-001

Src2
NaN2 ,
NaN2 ,
value,
NaN2 ,
value,
NaN2 ,
vaule,

Src3
NaN3
value
NaN3
NaN3
value
value
NaN3
qNaN2
qNaN2
qNaN3
qNaN2
qNaN1
qNaN2
qNaN3

vfmadd132ps
vfnmsub132ps
vfmsub132ps
vfnmadd132ps
vmadd132pd
vfnmsub132pd
vfmsub132pd
vfnmadd132pd
qNaN1
qNaN1
qNaN1
qNaN3
qNaN1
qNaN2
qNaN3

vfmadd213ps
vfnmsub213ps
vfmsub213ps
vfnmadd213ps
vmadd213pd
vfnmsub213pd
vfmsub213pd
vfnmadd213pd
qNaN2
qNaN2
qNaN1
qNaN2
qNaN1
qNaN2
qNaN3
qNaN2
qNaN2
qNaN3
qNaN2
qNaN1
qNaN2
qNaN3

vfmadd233psa

a The interpretation of the sources is slightly different for this instruction. Here the Src1 column and NaN are associated with Src3[31:0]. Similarly the Src3 column and NaN are
1
3
associated with Src3[63:32].

Table 4.4: Rules for handling NaNs for fused multiply and add/sub operations (ternary).

Src1
NaN1 ,
NaN1 ,
NaN1 ,
value,
NaN1 ,
value,
value,

vfmadd231ps
vfmsub231ps
vfnmadd231ps
vmadd231pd
vfnmsub231pd
vfmsub231pd
vfnmadd231pd

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

65

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

• Result = FTZ applied to Resultd (tiny results are replaced by zeros of the same sign; other results are not
changed).

4.6.2

REX prex and Intel® Xeon Phi™ Coprocessor Instruction Set Architecture interactions

The REX pre ix is illegal in combination with Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector
instructions, or with mask and scalar instructions allocated using VEX and MVEX pre ixes.
Following the Intel® 64 behavior, if the REX pre ix is followed with any legacy pre ix and not located just before
the opcode escape, it will be ignored.

4.7

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture State
Save

The Intel® Xeon Phi™ coprocessor does not include any explicit instruction to perform context save and restore
of the Intel® Xeon Phi™ coprocessor state. To perform a context save and restore we may use:
• Vector loads and stores for vector registers
• A combination of kmov plus scalar loads and stores for mask registers
• LDMXCSR/STMXCSR for the MXCSR state register
Note also that vector instructions raise a device-not-available (#NM) exceptions when CR0.TS is set. This allows
to perform selective lazy save and restore of state.

4.8

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Processor
State After Reset

Table 4.5 shows the state of the lags and other registers following power-up for the Intel® Xeon Phi™ coprocessor.

66

Reference Number: 327364-001

CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE

Register
EFLAGS
EIP
CR0
CR2, CR3, CR4
CS

SS, DS, ES, FS, GS

EDX
EAX
EBX, ECX, ESI, EDI, EBP, ESP
ST0 through ST7
x87 FPU Control Word
x87 FPU Status Word
x87 FPU Tag Word
x87 FPU Data Operand and CS
Seg. Selectors
x87 FPU Data Operand and
Inst. Pointers
MM0 through MM7
XMM0 through XMM7
k0 through k7
zmm0 through zmm31
MXCSR
GDTR, IDTR
LDTR, Task Register

DR0, DR1, DR2, DR3
DR6
DR7
Time-Stamp Counter
Perf. Counters and Event Select
All Other MSRs
Data and Code Cache, TLBs
MTRRs, Machine-Check
APIC

Intel® Xeon Phi™ coprocessor
00000002H
0000FFF0H
60000010H2
00000000H
Selector = F000H; Base = FFFF0000H
Limit = FFFFH
AR = Present, R/W, Accessed
Selector = 0000H; Base = 00000000H
Limit = FFFFH
AR = Present, R/W, Accessed
000005xxH
04
00000000H
Pwr up or Reset: +0.0
FINIT/FNINIT: Unchanged
Pwr up or Reset: 0040H
FINIT/FNINIT: 037FH
Pwr up or Reset: 0000H
FINIT/FNINIT: 0000H
Pwr up or Reset: 5555H
FINIT/FNINIT: FFFFH
Pwr up or Reset: 0000H
FINIT/FNINIT: 0000H
Pwr up or Reset: 00000000H
FINIT/FNINIT: 00000000H
NA
NA
0000H
0 (64 bytes)
0020_0000H
Base = 00000000H, Limit = FFFFH
AR = Present, R/W
Selector = 0000H, Base = 00000000H
Limit = FFFFH
AR = Present, R/W
00000000H
FFFF0FF0H
00000400H
Power up or Reset: 0H
INIT: Unchanged
Power up or Reset: 0H
INIT: Unchanged
Power up or Reset: Unde ined
INIT: Unchanged
Invalid
Not Implemented
Pwr up or Reset: Enabled
INIT: Unchanged

Table 4.5: Processor State Following Power-up, Reset, or INIT.

Reference Number: 327364-001

67

CHAPTER 5. INSTRUCTION SET REFERENCE

Chapter 5
Instruction Set Reference
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture that are described in this document follow the general
documentation convention established in this chapter.

5.1

Interpreting Instruction Reference Pages

This section describes the format of information contained in the instruction reference pages in this chapter. It
explains notational conventions and abbreviations used in these sections

5.1.1

Instruction Format

The following is an example of the format used for each instruction description in this chapter.
Opcode
MVEX.NDS.512.66.0F38.W1 50 /r

Instruction
vaddnpd zmm1 k1, zmm2, Sf 64 (zmm3/mi )

VEX.0F.W0 41 /r

kand k1 , k2

5.1.2

Description
Add
loat64
vector
zmm2 and loat64 vector
Sf64(zmm3/mt), negate
the sum, and store the
result in zmm1, under
write-mask.
Perform a bitwise AND
between k1 and k2, store
result in k1

Opcode Notations for MVEX Encoded Instructions

In the Instruction Summary Table, the Opcode column presents the details of each instruction byte encoding
using notations described in this section. For MVEX encoded instructions, the notations are expressed in the
following form (including the modR/M byte if applicable, and the immediate byte if applicable):

68

Reference Number: 327364-001

CHAPTER 5. INSTRUCTION SET REFERENCE

MVEX.[NDS,NDD].[512].[66,F2,F3].0F/0F3A/0F38.[W0,W1] opcode [/r] [/ib]

• MVEX: indicates the presence of the MVEX pre ix is required. The MVEX pre ix consists of 4 bytes with the
leading byte 62H.
The encoding of various sub- ields of the MVEX pre ix is described using the following notations:
– NDS,NDD: speci ies that MVEX.vvvv ield is valid for the encoding of a register operand:
* MVEX.NDS: MVEX.vvvv encodes the irst source register in an instruction syntax where the content of source registers will be preserved. To encode a vector register in the range zmm16zmm31, the MVEX.vvvv ield is pre-pended with MVEX.V'.
* MVEX.NDD: MVEX.vvvv encodes the destination register that cannot be encoded by ModR/M:reg
ield. To encode a vector register in the range zmm16-zmm31, the MVEX.vvvv ield is pre-pended
with MVEX.V'.
* If none of NDS, NDD is present, MVEX.vvvv must be 1111b (i.e. MVEX.vvvv does not encode an
operand).
– 66,F2,F3: The presence or absence of these value maps to the MVEX.pp ield encodings. If absent,
this corresponds to MVEX.pp=00B. If present, the corresponding MVEX.pp value affects the "opcode"
byte in the same way as if a SIMD pre ix (66H, F2H or F3H) does to the ensuing opcode byte. Thus a
non-zero encoding of MVEX.pp may be considered as an implied 66H/F2H/F3H pre ix.
– 0F,0F3A,0F38: The presence of these values maps to a valid encoding of the MVEX.mmmm ield. Only
three encoded values of MVEX.mmmm are de ined as valid, corresponding to the escape byte sequence of 0FH, 0F3AH and 0F38H.
– W0: MVEX.W=0
– W1: MVEX.W=1
– The presence of W0/W1 in the opcode column applies to two situations: (a) it is treated as an extended opcode bit, (b) the instruction semantics support an operand size promotion to 64 bit of a
general-purpose register operand or a 32 bit memory operand.
• opcode: Instruction opcode.
• /r: Indicates that the ModR/M byte of the instruction contains a register operand and an r/m operand.
• /vsib: Indicates the memory addressing uses the vector SIB byte.
• ib: A 1-byte immediate operand to the instruction that follows the opcode, ModR/M bytes or scale/indexing
bytes.
In general, the encoding of the MVEX.R, MVEX.X, MVEX.B, and MVEX.V' ields are not shown explicitly in the
opcode column. The encoding scheme of MVEX.R, MVEX.X, MVEX.B, and MVEX.V' ields must follow the rules
de ined in Chapter 3.

5.1.3

Opcode Notations for VEX Encoded Instructions

In the Instruction Summary Table, the Opcode column presents the details of each instruction byte encoding
using notations described in this section. For VEX encoded instructions, the notations are expressed in the following form (including the modR/M byte if applicable, the immediate byte if applicable):
VEX.[NDS,NDD].[66,F2,F3].0F/0F3A/0F38.[W0,W1] opcode [/r] [/ib]

Reference Number: 327364-001

69

CHAPTER 5. INSTRUCTION SET REFERENCE

• VEX: indicates the presence of the VEX pre ix is required. The VEX pre ix can be encoded using the
three-byte form (the irst byte is C4H), or using the two-byte form (the irst byte is C5H). The two-byte
form of VEX only applies to those instructions that do not require the following ields to be encoded:
VEX.mmmmm, VEX.W, VEX.X, VEX.B. Refer to Chapter 3 for more details on the VEX pre ix.
The encoding of various sub- ields of the VEX pre ix is described using the following notations:
– NDS,NDD: speci ies that VEX.vvvv ield is valid for the encoding of a register operand:
* VEX.NDS: VEX.vvvv encodes the irst source register in an instruction syntax where the content
of source registers will be preserved.
* VEX.NDD: VEX.vvvv encodes the destination register that cannot be encoded by ModR/M:reg
ield.
* If none of NDS, NDD is present, VEX.vvvv must be 1111b (i.e. VEX.vvvv does not encode an
operand). The VEX.vvvv ield can be encoded using either the 2-byte or 3-byte form of the VEX
pre ix.
– 66,F2,F3: The presence or absence of these value maps to the VEX.pp ield encodings. If absent, this
corresponds to VEX.pp=00B. If present, the corresponding VEX.pp value affects the "opcode" byte in
the same way as if a SIMD pre ix (66H, F2H or F3H) does to the ensuing opcode byte. Thus a non-zero
encoding of VEX.pp may be considered as an implied 66H/F2H/F3H pre ix. The VEX.pp ield may be
encoded using either the 2-byte or 3-byte form of the VEX pre ix.
– 0F,0F3A,0F38: The presence of these values maps to a valid encoding of the VEX.mmmmm ield. Only
three encoded values of VEX.mmmmm are de ined as valid, corresponding to the escape byte sequence of 0FH, 0F3AH and 0F38H. The effect of a valid VEX.mmmmm encoding on the ensuing opcode
byte is same as if the corresponding escape byte sequence on the ensuing opcode byte for non-VEX
encoded instructions. Thus a valid encoding of VEX.mmmmm may be consider as an implies escape
byte sequence of either 0FH, 0F3AH or 0F38H. The VEX.mmmmm ield must be encoded using the
3-byte form of VEX pre ix.
– 0F,0F3A,0F38 and 2-byte/3-byte VEX: The presence of 0F3A and 0F38 in the opcode column implies
that opcode can only be encoded by the three-byte form of VEX. The presence of 0F in the opcode
column does not preclude the opcode to be encoded by the two-byte of VEX if the semantics of the
opcode does not require any sub ield of VEX not present in the two-byte form of the VEX pre ix.
– W0: VEX.W=0
– W1: VEX.W=1
– The presence of W0/W1 in the opcode column applies to two situations: (a) it is treated as an extended opcode bit, (b) the instruction semantics support an operand size promotion to 64 bit of a
general-purpose register operand or a 32 bit memory operand. The presence of W1 in the opcode
column implies the opcode must be encoded using the 3-byte form of the VEX pre ix. The presence
of W0 in the opcode column does not preclude the opcode to be encoded using the C5H form of the
VEX pre ix, if the semantics of the opcode does not require other VEX sub ields not present in the
two-byte form of the VEX pre ix.
• opcode: Instruction opcode.
• /r: Indicates that the ModR/M byte of the instruction contains a register operand and an r/m operand.
• ib: A 1-byte immediate operand to the instruction that follows the opcode, ModR/M bytes or scale/indexing
bytes.
• In general, the encoding of the VEX.R, VEX.X, and VEX.B ields are not shown explicitly in the opcode column. The encoding scheme of VEX.R, VEX.X, and VEX.B ields must follow the rules de ined in Chapter
3.

70

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Chapter 6
Instruction Descriptions
This Chapter de ines all of the Intel® Xeon Phi™ Coprocessor Instruction Set Architecture vector instructions. Note:
Some instruction descriptions refer to the SSS or S2 S1 S0 , which are bits 6-4 from the MVEX pre ix encoding.
See Table 2.14 for more details

Reference Number: 327364-001

71

CHAPTER 6. INSTRUCTION DESCRIPTIONS

6.1

72

Vector Mask Instructions

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

JKNZD - Jump near if mask is not zero

Opcode
VEX.NDS.128.0F.W0 85 id
VEX.NDS.128.W0 75 ib

Instruction
jknzd k1, rel32
jknzd k1, rel8

Description
Jump near if mask is not zero.
Jump near if mask is not zero.

Description
Checks the value of source mask, and if not all mask bits are set to 0, performs a jump to
the target instruction speci ied by the destination operand. If the condition is not satisied, the jump is not performed and execution continues with the instruction following
the instruction.
The target instruction is speci ied with a relative offset (a signed offset relative to the
current value of the instruction pointer in the EIP register). A relative offset (rel8, rel16,
or rel32) is generally speci ied as a label in assembly code, but at the machine code level,
it is encoded as a signed, 8-bit or 32 bit immediate value, which is added to the instruction
pointer. Instruction coding is most ef icient for offsets of -128 to +127. If the operand-size
attribute is 16, the upper two bytes of the EIP register are cleared, resulting in a maximum
instruction pointer size of 16 bits.
The instruction does not support far jumps (jumps to other code segments). When the
target for the conditional jump is in a different segment, use the opposite condition from
the condition being tested for the JKNZD instruction, and then access the target with an
unconditional far jump (JMP instruction) to the other segment. For example, the following
conditional far jump is illegal:
JKNZD FARLABEL;
To accomplish this far jump, use the following two instructions:
JKZD BEYOND;
JMP FARLABEL;
BEYOND:
This conditional jump is converted to code fetch of one or two cache lines, regardless of
jump address or cacheability.
In 64 bit mode, operand size (OSIZE) is ixed at 64 bits. JMP Short is RIP = RIP + 8-bit
offset sign extended to 64 bits. JMP Near is RIP = RIP + 32 bit offset sign extended to 64
bits.

Reference Number: 327364-001

73

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if (k1[15:0]!=0)
{
tempEIP = EIP + SignExtend(DEST);
if(OSIZE == 16)
{
tempEIP = tempEIP & 0000FFFFH;
}
if (*tempEIP is not within code segment limit*)
{
#GP(0);
}
else
{
EIP = tempEIP
}
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
None.

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode

74

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)
#NM

Reference Number: 327364-001

If the memory address is in a non-canonical form.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

75

CHAPTER 6. INSTRUCTION DESCRIPTIONS

JKZD - Jump near if mask is zero

Opcode
VEX.NDS.128.0F.W0 84 id
VEX.NDS.128.W0 74 ib

Instruction
jkzd k1, rel32
jkzd k1, rel8

Description
Jump near if mask is zero.
Jump near if mask is zero.

Description
Checks the value of source mask, and if all mask bits are set to 0, performs a jump to the
target instruction speci ied by the destination operand. If the condition is not satis ied,
the jump is not performed and execution continues with the instruction following the instruction.
The target instruction is speci ied with a relative offset (a signed offset relative to the
current value of the instruction pointer in the EIP register). A relative offset (rel8, rel16,
or rel32) is generally speci ied as a label in assembly code, but at the machine code level,
it is encoded as a signed, 8-bit or 32 bit immediate value, which is added to the instruction
pointer. Instruction coding is most ef icient for offsets of -128 to +127. If the operand-size
attribute is 16, the upper two bytes of the EIP register are cleared, resulting in a maximum
instruction pointer size of 16 bits.
The instruction does not support far jumps (jumps to other code segments). When the
target for the conditional jump is in a different segment, use the opposite condition from
the condition being tested for the JKNZD instruction, and then access the target with an
unconditional far jump (JMP instruction) to the other segment. For example, the following
conditional far jump is illegal:
JKZD FARLABEL;
To accomplish this far jump, use the following two instructions:
JKNZD BEYOND;
JMP FARLABEL;
BEYOND:
This conditional jump is converted to code fetch of one or two cache lines, regardless of
jump address or cacheability.
In 64 bit mode, operand size (OSIZE) is ixed at 64 bits. JMP Short is RIP = RIP + 8-bit
offset sign extended to 64 bits. JMP Near is RIP = RIP + 32 bit offset sign extended to 64
bits.

76

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if (k1[15:0]==0)
{
tempEIP = EIP + SignExtend(DEST);
if(OSIZE == 16)
{
tempEIP = tempEIP & 0000FFFFH;
}
if (*tempEIP is not within code segment limit*)
{
#GP(0);
}
else
{
EIP = tempEIP
}
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
None.

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode

Reference Number: 327364-001

77

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)
#NM

78

If the memory address is in a non-canonical form.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KAND - AND Vector Mask

Opcode
VEX.128.0F.W0 41 /r

Instruction
kand k1, k2

Description
Perform a bitwise AND between vector masks
k1 and k2 and store the result in vector mask
k1.

Description
Performs a bitwise AND between the vector masks k2 and the vector mask k1, and writes
the result into vector mask k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = k1[n] & k2[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kand (__mmask16, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

79

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

80

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KANDN - AND NOT Vector Mask

Opcode
VEX.128.0F.W0 42 /r

Instruction
kandn k1, k2

Description
Perform a bitwise AND between NOT (vector
mask k1) and vector mask k2 and store the result in vector mask k1.

Description
Performs a bitwise AND between vector mask k2, and the NOT (bitwise logical negation)
of vector mask k1, and writes the result into vector mask k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = (~(k1[n])) & k2[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kandn (__mmask16, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

81

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

82

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KANDNR - Reverse AND NOT Vector Mask

Opcode
VEX.128.0F.W0 43 /r

Instruction
kandnr k1, k2

Description
Perform a bitwise AND between NOT (vector
mask k2) and vector mask k1 and store the result in vector mask k1.

Description
Performs a bitwise AND between the NOT (bitwise logical negation) of vector mask k2,
and the vector mask k1, and writes the result into vector mask k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = ~(k2[n]) & k1[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kandnr (__mmask16, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

83

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

84

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KCONCATH - Pack and Move High Vector Mask

Opcode
VEX.NDS.128.0F.W0 95 /r

Instruction
kconcath r64, k1, k2

Description
Concatenate vector masks k1 and k2 into the high part
of register r64.

Description
Packs vector masks k1 and k2 and moves the result to the high 32 bits of destination register r64. The rest of the destination register is zeroed.

Operation
TMP[15:0] = k2[15:0]
TMP[31:16] = k1[15:0]
r64[31:0] = 0
r64[63:32] = TMP

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__int64

_mm512_kconcathi_64(__mmask16, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

85

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

86

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If destination is a memory operand.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KCONCATL - Pack and Move Low Vector Mask

Opcode
VEX.NDS.128.0F.W0 97 /r

Instruction
kconcatl r64, k1, k2

Description
Concatenate vector masks k1 and k2 into the low part of
register r64.

Description
Packs vector masks k1 and k2 and moves the result to the low 32 bits of destination register r64. The rest of the destination register is zeroed.

Operation
TMP[15:0] = k2[15:0]
TMP[31:16] = k1[15:0]
r64[31:0] = TMP
r64[63:32] = 0

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__int64

_mm512_kconcatlo_64(__mmask16, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

87

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

88

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If destination is a memory operand.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KEXTRACT - Extract Vector Mask From Register

Opcode
VEX.128.66.0F3A.W0 3E /r ib

Instruction
kextract k1, r64, imm8

Description
Extract ield from general purpose register r64
into vector mask k1 using imm8.

Description
Extract the 16-bit ield selected by imm8[1:0] from general purpose register r64 and write
the result into destination mask register k1.

Operation
index = imm8[1:0] * 16
k1[15:0] = r64[index+15:index]

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kextract_64(__int64, const in);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

89

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

90

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If source is a memory operand.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KMERGE2L1H - Swap and Merge High Element Portion and Low Portion of
Vector Masks

Opcode
VEX.128.0F.W0 48 /r

Instruction
kmerge2l1h k1, k2

Description
Concatenate the low half of vector mask k2 and the high half of
vector mask k1 and store the result in the vector mask k1.

Description
Move high element from vector mask register k1 into low element of vector mask register
k1, and insert low element of k2 into the high portion of vector mask register k1.

Operation
tmp = k1[15:8]
k1[15:8] = k2[7:0]
k1[7:0] = tmp

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kmerge2l1h (__mmask16, __mmask16 k2);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

91

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

92

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KMERGE2L1L - Move Low Element Portion into High Portion of Vector
Mask

Opcode
VEX.128.0F.W0 49 /r

Instruction
kmerge2l1l k1, k2

Description
Move low half of vector mask k2 into the high half of vector mask
k1.

Description
Insert low element from vector mask register k2 into high element of vector mask register
k1. Low element of k1 remains unchanged.

Operation
k1[15:8] = k2[7:0]
*k1[7:0] remains unchanged*

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kmerge2l1l (__mmask16, __mmask16 k2);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

93

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

94

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KMOV - Move Vector Mask

Opcode
VEX.128.0F.W0 90 /r
VEX.128.0F.W0 93 /r
VEX.128.0F.W0 92 /r

Instruction
kmov k1, k2
kmov r32, k2
kmov k1, r32

Description
Move vector mask k2 and store the result in k1.
Move vector mask k2 to general purpose register r32.
Move general purpose register r32 to vector mask k1.

Description
Either the vector mask register k2 or the general purpose register r32 is read, and its
contents written into destination general purpose register r32 or vector mask register k1;
however, general purpose register to general purpose register copies are not supported.
When the destination is a general purpose register, the 16 bit value that is copied is zeroextended to the maximum operand size in the current mode.

Operation
if(DEST is a general purpose register) {
DEST[63:16] = 0
DEST[15:0] = k2[15:0]
} else if(DEST is vector mask and SRC is a general purpose register) {
k1[15:0] = SRC[15:0]
} else {
k1[15:0] = k2[15:0]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16
int

_mm512_kmov (__mmask16);
_mm512_int2mask (int);
_mm512_mask2int (__mmask16);

Reference Number: 327364-001

95

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

96

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If source/destination is a memory operand.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KNOT - Not Vector Mask

Opcode
VEX.128.0F.W0 44 /r

Instruction
knot k1, k2

Description
Perform a bitwise NOT on vector mask k2 and
store the result in k1.

Description
Performs the bitwise NOT of the vector mask k2, and writes the result into vector mask
k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = ~ k2[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_knot(__mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

97

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

98

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KOR - OR Vector Masks

Opcode
VEX.128.0F.W0 45 /r

Instruction
kor k1, k2

Description
vector masks k1 and k2 and store the result in
vector mask k1.

Description
Performs a bitwise OR between the vector mask k2, and the vector mask k1, and writes
the result into vector mask k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = k1[n] | k2[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kor(__mmask16, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

99

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

100

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KORTEST - OR Vector Mask And Set EFLAGS

Opcode
VEX.128.0F.W0 98 /r

Instruction
kortest k1, k2

Description
vector masks k1 and k2 and update ZF and CF
EFLAGS accordingly.

Description
Performs a bitwise OR between the vector mask register k2, and the vector mask register
k1, and sets CF and ZF based on the operation result.
ZF lag is set if both sources are 0x0. CF is set if, after the OR operation is done, the operation result is all 1's.

Operation
CF = 1
ZF = 1
for (n = 0; n < 16; n++) {
tmp = (k1[n] | k2[n])
ZF &= (tmp == 0x0)
CF &= (tmp == 0x1)
}

Flags Affected
• The ZF lag is set if the result of OR-ing both sources is all 0s
• The CF lag is set if the result of OR-ing both sources is all 1s
• The OF, SF, AF, and PF lags are set to 0.

Intel® C/C++ Compiler Intrinsic Equivalent
int
int

_mm512_kortestz (__mmask16, __mmask16);
_mm512_kortestc (__mmask16, __mmask16);

Reference Number: 327364-001

101

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

102

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KXNOR - XNOR Vector Masks

Opcode
VEX.128.0F.W0 46 /r

Instruction
kxnor k1, k2

Description
vector masks k1 and k2 and store the result in
vector mask k1.

Description
Performs a bitwise XNOR between the vector mask k1 and the vector mask k2, and the
result is written into vector mask k1.
The primary purpose of this instruction is to provide a way to set a vector mask register
to 0xFFFF in a single clock; this is accomplished by selecting the source and destination to
be the same mask register. In this case the result will be 0xFFFF regardless of the original
contents of the register.

Operation
for (n = 0; n < 16; n++) {
k1[n] = ~(k1[n] ^ k2[n])
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kxnor (__mmask16, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
Reference Number: 327364-001

103

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#NM

104

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

KXOR - XOR Vector Masks

Opcode
VEX.128.0F.W0 47 /r

Instruction
kxor k1, k2

Description
vector masks k1 and k2 and store the result in
vector mask k1.

Description
Performs a bitwise XOR between the vector mask k2, and the vector mask k1, and writes
the result into vector mask k1.

Operation
for (n = 0; n < 16; n++) {
k1[n] = k1[n] ^ k2[n]
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16

_mm512_kxor (__mmask16, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

105

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#NM

106

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

6.2

Vector Instructions

Reference Number: 327364-001

107

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VADDNPD - Add and Negate Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vaddnpd zmm1 {k1},
50 /r
Sf 64 (zmm3/mt )

zmm2,

Description
Add loat64 vector zmm2 and loat64 vector
Sf 64 (zmm3/mt ), negate the sum, and store the
result in zmm1, under write-mask.

Description
Performs an element-by-element addition between loat64 vector zmm2 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or loat64 vector
zmm3, then negates the result. The inal result is written into loat64 vector zmm1.
Note that all the operations must be performed before rounding.
x
+0
+0
-0
-0

y
+0
-0
+0
-0

RN/RU/RZ
(-0) + (-0) = -0
(-0) + (+0) = +0
(+0) + (-0) = +0
(+0) + (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.1: VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3
for other cases with a result of zero.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = (-zmm2[i+63:i]) + (-tmpSrc3[i+63:i])
}
108

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_addn_pd(__m512d, __m512d);
_mm512_mask_addn_pd(__m512d, __mmask8, __m512d, __m512d);

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

109

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

110

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VADDNPS - Add and Negate Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vaddnps zmm1
50 /r
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Add loat32 vector zmm2 and loat32 vector
Sf 32 (zmm3/mt ), negate the sum, and store the
result in zmm1, under write-mask.

Description
Performs an element-by-element addition between loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3, then negates the result. The inal result is written into loat32 vector zmm1.
Note that all the operations must be performed before rounding.
x
+0
+0
-0
-0

y
+0
-0
+0
-0

RN/RU/RZ
(-0) + (-0) = -0
(-0) + (+0) = +0
(+0) + (-0) = +0
(+0) + (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.2: VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3
for other cases with a result of zero.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = (-zmm2[i+31:i]) + (-tmpSrc3[i+31:i])
}
Reference Number: 327364-001

111

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_addn_ps (__m512, __m512);
_mm512_mask_addn_ps (__m512, __mmask16, __m512, __m512);

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

112

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

113

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VADDPD - Add Float64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
58 /r

Instruction
vaddpd zmm1
Sf 64 (zmm3/mt )

{k1},

zmm2,

Description
Add loat64 vector zmm2 and loat64 vector
Sf 64 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element addition between loat64 vector zmm2 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or loat64 vector
zmm3. The result is written into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] + tmpSrc3[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

114

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_add_pd(__m512d, __m512d);
_mm512_mask_add_pd(__m512d, __mmask8, __m512d , __m512d);

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

115

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

116

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VADDPS - Add Float32 Vectors

Opcode
MVEX.NDS.512.0F.W0 58 /r

Instruction
vaddps zmm1
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Add loat32 vector zmm2 and loat32 vector
Sf 32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element addition between loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3. The result is written into loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Reference Number: 327364-001

117

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_add_ps (__m512, __m512);
_mm512_mask_add_ps (__m512, __mmask16, __m512, __m512);

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

118

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

119

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VADDSETSPS - Add Float32 Vectors and Set Mask to Sign

Opcode
MVEX.NDS.512.66.0F38.W0 CC /r

Instruction
vaddsetsps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Add loat32 vector
zmm2
and
loat32
vector
Sf 32 (zmm3/mt )
and store the sum in
zmm1 and the sign
from the sum in k1,
under write-mask.

Description
Performs an element-by-element addition between loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3. The result is written into loat32 vector zmm1.
In addition, the sign of the result for the n-th element is written into the n-th bit of vector
mask k1.
It is the sign bit of the inal result that gets copied to the destination, as opposed to the
result of comparison with zero.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k1. Elements in zmm1
and k1 with the corresponding bit clear in k1 register retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i]
k1[n] = zmm1[i+31]
}
120

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_addsets_ps (__m512, __m512, __mmask16*);
_mm512_mask_addsets_ps (__m512, __mmask16,
__mmask16*);

__m512

,

__m512,

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Reference Number: 327364-001

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

121

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_addsets_ps (__m512, __m512, __mmask16*);
_mm512_mask_addsets_ps (__m512, __mmask16,
__mmask16*);

__m512

,

__m512,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
122

If a memory address referencing the SS segment is
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If no write mask is provided or selected write-mask is k0.

123

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VALIGND - Align Doubleword Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F3A.W0 valignd zmm1 {k1},
03 /r ib
zmm3/mt , offset

zmm2,

Description
Shift right and merge vectors zmm2 and
zmm3/mt with doubleword granularity using
offset as number of elements to shift, and store
the inal result in zmm1, under write-mask.

Description
Concatenates and shifts right doubleword elements from vector zmm2 and memory/vector
zmm3. The result is written into vector zmm1.
No swizzle, broadcast, or conversion is performed by this instruction.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
src[511:0] = zmm3/mt
// Concatenate sources
tmp[511:0] = src[511:0]
tmp[1023:512] = zmm2[511:0]
// Shift right doubleword elements
SHIFT = imm8[3:0]
tmp[1023:0] = tmp[1023:0] >> (32*SHIFT)
// Apply write-mask
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = tmp[i+31:i]
}
}

124

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_alignr_epi32 (__m512i, __m512i, const int);
_mm512_mask_alignr_epi32 (__m512i, __mmask16, __m512i, __m512i, const int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

125

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBLENDMPD - Blend Float64 Vectors using the Instruction Mask

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vblendmpd zmm1 {k1}, zmm2,
65 /r
Sf 64 (zmm3/mt )

Description
Blend loat64 vector zmm2 and loat64 vector
Sf 64 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element blending between loat64 vector zmm2 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or loat64 vector
zmm3, using the instruction mask as selector. The result is written into loat64 vector
zmm1.
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector: every element of the destination is conditionally selected between irst
source or second source using the value of the related mask bit (0 for irst source, 1 for
second source ).

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = tmpSrc3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (tmpSrc3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n]==1 or *no write-mask*) {
zmm1[i+63:i] = tmpSrc3[i+63:i]
} else {
zmm1[i+63:i] = zmm2[i+63:i]
}
}

SIMD Floating-Point Exceptions
None.

126

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d

_mm512_mask_blend_pd (__mmask8, __m512d, __m512d);

Reference Number: 327364-001

127

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

128

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBLENDMPS - Blend Float32 Vectors using the Instruction Mask

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vblendmps zmm1 {k1}, zmm2,
65 /r
Sf 32 (zmm3/mt )

Description
Blend loat32 vector zmm2 and loat32 vector
Sf 32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element blending between loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3, using the instruction mask as selector. The result is written into loat32 vector
zmm1.
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector: every element of the destination is conditionally selected between irst
source or second source using the value of the related mask bit (0 for irst source, 1 for
second source ).

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = tmpSrc3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (tmpSrc3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n]==1 or *no write-mask*) {
zmm1[i+31:i] = tmpSrc3[i+31:i]
} else {
zmm1[i+31:i] = zmm2[i+31:i]
}
}

SIMD Floating-Point Exceptions
Invalid.

Reference Number: 327364-001

129

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512

_mm512_mask_blend_ps (__mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

130

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

131

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTF32X4 - Broadcast 4xFloat32 Vector

Opcode
MVEX.512.66.0F38.W0 1A
/r

Instruction
vbroadcastf32x4
Uf 32 (mt )

zmm1

{k1},

Description
Broadcast 4x loat32 vector Uf 32 (mt ) into vector zmm1, under write-mask.

Description
The 4, 8 or 16 bytes (depending on the conversion and broadcast in effect) at memory
address mt are broadcast and/or converted to a loat32 vector. The result is written into
loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {4to16}
tmpSrc2[127:0] = UpConvLoadf 32 (mt )
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
j = i & 0x7F
zmm1[i+31:i] = tmpSrc2[j+31:j])
}
}

Flags Affected
Invalid.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111
132

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
16
N/A
N/A
8
4
4
8
8
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_extload_ps
(void
const*,_MM_UPCONV_PS_ENUM,
_MM_BROADCAST32_ENUM, int);
_mm512_mask_extload_ps
(__m512,
__mmask16,
void
const*,_MM_UPCONV_PS_ENUM, _MM_BROADCAST32_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

133

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTF64X4 - Broadcast 4xFloat64 Vector

Opcode
MVEX.512.66.0F38.W1 1B
/r

Instruction
vbroadcastf64x4
Uf 64 (mt )

zmm1

{k1},

Description
Broadcast 4x loat64 vector Uf 64 (mt ) into vector zmm1, under write-mask.

Description
The 32 bytes at memory address mt are broadcast to a loat64 vector. The result is written
into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {4to8}
tmpSrc2[255:0] = UpConvLoadf 64 (mt )
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
j = i & 0xFF
zmm1[i+63:i] = tmpSrc2[j+63:j])
}
}

Flags Affected
None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111
134

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
32
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_extload_pd
(void
const*,_MM_UPCONV_PD_ENUM,
_MM_BROADCAST64_ENUM, int);
_mm512_mask_extload_pd
(__m512,
__mmask8,
void
const*,
_MM_UPCONV_PD_ENUM, _MM_BROADCAST64_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

135

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTI32X4 - Broadcast 4xInt32 Vector

Opcode
MVEX.512.66.0F38.W0 5A
/r

Instruction
vbroadcasti32x4
Ui32 (mt )

zmm1

{k1},

Description
Broadcast 4xint32 vector Ui32 (mt ) into vector
zmm1, under write-mask.

Description
The 4, 8 or 16 bytes (depending on the conversion and broadcast in effect) at memory
address mt are broadcast and/or converted to a int32 vector. The result is written into
int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {4to16}
tmpSrc2[127:0] = UpConvLoadi32 (mt )
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
j = i & 0x7F
zmm1[i+31:i] = tmpSrc2[j+31:j])
}
}

Flags Affected
None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111
136

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
16
N/A
N/A
N/A
4
4
8
8
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_extload_epi32
(void
const*,_MM_UPCONV_EPI32_ENUM,
_MM_BROADCAST32_ENUM, int);
_mm512_mask_extload_epi32
(__m512i,
__mmask16,
void
const*,
_MM_UPCONV_EPI32_ENUM, _MM_BROADCAST32_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

137

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTI64X4 - Broadcast 4xInt64 Vector

Opcode
MVEX.512.66.0F38.W1 5B
/r

Instruction
vbroadcasti64x4
Ui64 (mt )

zmm1

{k1},

Description
Broadcast 4xint64 vector Ui64 (mt ) into vector
zmm1, under write-mask.

Description
The 32 bytes at memory address mt are broadcast to a int64 vector. The result is written
into int64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {4to8}
tmpSrc2[255:0] = UpConvLoadi64 (mt )
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
j = i & 0xFF
zmm1[i+63:i] = tmpSrc2[j+63:j])
}
}

Flags Affected
None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111
138

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
32
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_extload_epi64
(void
const*,_MM_UPCONV_EPI64_ENUM,
_MM_BROADCAST64_ENUM, int);
_mm512_mask_extload_epi64
(__m512i,
__mmask8,
void
const*,
_MM_UPCONV_EPI64_ENUM, _MM_BROADCAST64_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

139

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTSD - Broadcast Float64 Vector

Opcode
MVEX.512.66.0F38.W1 19
/r

Instruction
vbroadcastsd
Uf 64 (mt )

zmm1

{k1},

Description
Broadcast loat64 vector Uf 64 (mt ) into vector
zmm1, under write-mask.

Description
The 8 bytes at memory address mt are broadcast to a loat64 vector. The result is written
into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {1to8}
tmpSrc2[63:0] = UpConvLoadf 64 (mt )
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = tmpSrc2[63:0]
}
}

Flags Affected
None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111
140

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_extload_pd
(void
const*,_MM_UPCONV_PD_ENUM,
_MM_BROADCAST64_ENUM, int);
_mm512_mask_extload_pd
(__m512,
__mmask8,
void
const*,
_MM_UPCONV_PD_ENUM, _MM_BROADCAST64_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

141

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VBROADCASTSS - Broadcast Float32 Vector

Opcode
MVEX.512.66.0F38.W0 18
/r

Instruction
vbroadcastss
Uf 32 (mt )

zmm1

{k1},

Description
Broadcast loat32 vector Uf 32 (mt ) into vector
zmm1, under write-mask.

Description
The 1, 2, or 4 bytes (depending on the conversion and broadcast in effect) at memory
address mt are broadcast and/or converted to a loat32 vector. The result is written into
loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {1to16}
tmpSrc2[31:0] = UpConvLoadf 32 (mt )
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = tmpSrc2[31:0]
}
}

Flags Affected
Invalid.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111
142

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_extload_ps
(void
const*,_MM_UPCONV_PS_ENUM,
_MM_BROADCAST32_ENUM, int);
_mm512_mask_extload_ps
(__m512,
__mmask16,
void
const*,
_MM_UPCONV_PS_ENUM, _MM_BROADCAST32_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

143

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCMPPD - Compare Float64 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F.W1 C2 /r ib

Instruction
vcmppd k2 {k1}, zmm1, Sf 64 (zmm2/mt ), imm8

Description
Compare between
loat64 vector zmm1
and loat64 vector
Sf 64 (zmm2/mt )
and store the result in k2, under
write-mask.

Description
Performs an element-by-element comparison between loat64 vector zmm1 and the
loat64 vector result of the swizzle/broadcast/conversion from memory or loat64 vector
zmm2. The result is written into vector mask k2.
Note: If DAZ=1, denormals are treated as zeros in the comparison (original source registers untouched). untouched). +0 equals −0. Comparison with NaN returns false.
In inity of like signs, are considered equals. In inity values of either signs are considered
ordered values.
Table 6.3 summarizes VCMPPD behavior, in particular showing how various NaN results
can be produced.
Predicate
{eq}
{lt}
{le}
{gt}
{ge}
{unord}
{neq}
{nlt}
{nle}
{ngt}
{nge}
{ord}

Imm8 enc
000
001
010

011
100
101
110

111

Description
A=B
AB
A >= B
Unordered
NOT(A = B)
NOT(A < B)
NOT(A <= B)
NOT(A > B)
NOT(A >= B)
Ordered

Emulation

Swap operands, use LT
Swap operands, use LE

Swap operands, use NLT
Swap operands, use NLE

If NaN
False
False
False
False
False
True
True
True
True
True
True
False

QNaN operand signals invalid
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No

Table 6.3: VCMPPD behavior
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
144

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Immediate Format

eq
lt
le
unord
neq
nlt
nle
ord

Comparison Type
Equal
Less than
Less than or Equal
Unordered
Not Equal
Not Less than
Not Less than or Equal
Ordered

I2
0
0
0
0
1
1
1
1

I1
0
0
1
1
0
0
1
1

I0
0
1
0
1
0
1
0
1

Operation
switch (IMM8[2:0]) {
case 0: OP ← EQ; break;
case 1: OP ← LT; break;
case 2: OP ← LE; break;
case 3: OP ← UNORD; break;
case 4: OP ← NEQ; break;
case 5: OP ← NLT; break;
case 6: OP ← NLE; break;
case 7: OP ← ORD; break;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 64*n
// float64 operation
k2[n] = (zmm1[i+63:i] OP tmpSrc2[i+63:i]) ? 1 : 0
}
}
k2[15:8] = 0

Reference Number: 327364-001

145

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Instruction Pseudo-ops
Compilers and assemblers may implement the following pseudo-ops in addition to the
standard instruction op:
Pseudo-Op
vcmpeqpd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpltpd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmplepd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpunordpd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpneqpd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpnltpd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpnlepd k2 {k1}, zmm1, Sd (zmm2/mt )
vcmpordpd k2 {k1}, zmm1, Sd (zmm2/mt )

Implementation
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {eq}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {lt}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {le}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {unord}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {neq}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {nlt}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {nle}
vcmppd k2 {k1}, zmm1, Sd (zmm2/mt ), {ord}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

146

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8
__mmask8

_mm512_cmpeq_pd_mask (__m512d, __m512d);
_mm512_mask_cmpeq_pd_mask(__mmask8, __m512d, __m512d);
_mm51_cmplt_pd_mask(__m512d, __m512d);
_mm512_mask_cmplt_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmple_pd_mask(__m512d, __m512d);
_mm512_mask_cmple_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmpunord_pd_mask(__m512d, __m512d);
_mm512_mask_cmpunord_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmpneq_pd_mask(__m512d, __m512d);
_mm512_mask_cmpneq_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmpnlt_pd_mask(__m512d, __m512d);
_mm512_mask_cmpnlt_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmpnle_pd_mask(__m512d, __m512d);
_mm512_mask_cmpnle_pd_mask(__mmask8, __m512d, __m512d);
_mm512_cmpord_pd_mask(__m512d, __m512d);
_mm512_mask_cmpord_pd_mask(__mmask8, __m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
147

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

148

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCMPPS - Compare Float32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.0F.W0 C2 /r ib

Instruction
vcmpps k2 {k1}, zmm1, Sf 32 (zmm2/mt ), imm8

Description
Compare
between
loat32 vector zmm1
and
loat32 vector
Sf 32 (zmm2/mt )
and
store the result in k2,
under write-mask.

Description
Performs an element-by-element comparison between loat32 vector zmm1 and the
loat32 vector result of the swizzle/broadcast/conversion from memory or loat32 vector
zmm2. The result is written into vector mask k2.
Note: If DAZ=1, denormals are treated as zeros in the comparison (original source registers untouched). untouched). +0 equals −0. Comparison with NaN returns false.
In inity of like signs, are considered equals. In inity values of either signs are considered
ordered values.
Table 6.4 summarizes VCMPPS behavior, in particular showing how various NaN results
can be produced.
Predicate
{eq}
{lt}
{le}
{gt}
{ge}
{unord}
{neq}
{nlt}
{nle}
{ngt}
{nge}
{ord}

Imm8 enc
000
001
010

011
100
101
110

111

Description
A=B
AB
A >= B
Unordered
NOT(A = B)
NOT(A < B)
NOT(A <= B)
NOT(A > B)
NOT(A >= B)
Ordered

Emulation

Swap operands, use LT
Swap operands, use LE

Swap operands, use NLT
Swap operands, use NLE

If NaN
False
False
False
False
False
True
True
True
True
True
True
False

QNaN operand signals invalid
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No

Table 6.4: VCMPPS behavior
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.
Reference Number: 327364-001

149

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Immediate Format

eq
lt
le
unord
neq
nlt
nle
ord

Comparison Type
Equal
Less than
Less than or Equal
Unordered
Not Equal
Not Less than
Not Less than or Equal
Ordered

I2
0
0
0
0
1
1
1
1

I1
0
0
1
1
0
0
1
1

I0
0
1
0
1
0
1
0
1

Operation

switch (IMM8[2:0]) {
case 0: OP ← EQ; break;
case 1: OP ← LT; break;
case 2: OP ← LE; break;
case 3: OP ← UNORD; break;
case 4: OP ← NEQ; break;
case 5: OP ← NLT; break;
case 6: OP ← NLE; break;
case 7: OP ← ORD; break;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// float32 operation
k2[n] = (zmm1[i+31:i] OP tmpSrc2[i+31:i]) ? 1 : 0
}
}

150

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Instruction Pseudo-ops
Compilers and assemblers may implement the following pseudo-ops in addition to the
standard instruction op:
Pseudo-Op
vcmpeqps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpltps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpleps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpunordps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpneqps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpnltps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpnleps k2 {k1}, zmm1, Sf (zmm2/mt )
vcmpordps k2 {k1}, zmm1, Sf (zmm2/mt )

Implementation
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {eq}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {lt}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {le}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {unord}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {neq}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {nlt}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {nle}
vcmpps k2 {k1}, zmm1, Sf (zmm2/mt ), {ord}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Reference Number: 327364-001

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

151

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16
__mmask16

_mm512_cmpeq_ps_mask (__m512, __m512);
_mm512_mask_cmpeq_ps_mask (__mmask16, __m512, __m512);
_mm51_cmplt_ps_mask (__m512, __m512);
_mm512_mask_cmplt_ps_mask (__mmask16, __m512, __m512);
_mm512_cmple_ps_mask (__m512, __m512);
_mm512_mask_cmple_ps_mask (__mmask16, __m512, __m512);
_mm512_cmpunord_ps_mask (__m512, __m512);
_mm512_mask_cmpunord_ps_mask (__mmask16, __m512, __m512);
_mm512_cmpneq_ps_mask (__m512, __m512);
_mm512_mask_cmpneq_ps_mask (__mmask16, __m512, __m512);
_mm512_cmpnlt_ps_mask (__m512, __m512);
_mm512_mask_cmpnlt_ps_mask (__mmask16, __m512, __m512);
_mm512_cmpnle_ps_mask (__m512, __m512);
_mm512_mask_cmpnle_ps_mask (__mmask16, __m512, __m512);
_mm512_cmpord_ps_mask (__m512, __m512);
_mm512_mask_cmpord_ps_mask (__mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
152

Instruction not available in these modes
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

153

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTDQ2PD - Convert Int32 Vector to Float64 Vector

Opcode
MVEX.512.F3.0F.W0 E6 /r

Instruction
vcvtdq2pd zmm1 {k1}, Si32 (zmm2/mt )

Description
Convert
int32
vector
Si32 (zmm2/mt ) to loat64, and
store the result in zmm1, under
write-mask.

Description
Performs an element-by-element conversion from the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector zmm2 to a loat64 vector . The
result is written into loat64 vector zmm1. The int32 source is read from either the lower
half of the source operand (int32 vector zmm2), full memory source (8 elements, i.e. 256bits) or the broadcast memory source.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[255:0] = zmm2[255:0]
} else {
tmpSrc2[255:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[i+63:i] =
CvtInt32ToFloat64(tmpSrc2[j+31:j])
}
}

SIMD Floating-Point Exceptions
None.

154

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
Not Applicable
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x4)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
32
4
16
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
_m512d
_m512d

_mm512_cvtepi32lo_pd (__m512i);
_mm512_mask_cvtepi32lo_pd (__m512d, __mmask8, __m512i);

Reference Number: 327364-001

155

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

156

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to 4, 16 or 32-byte (depending on the swizzle broadcast).
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv involving data conversion.
If SwizzUpConvMem function from memory is set to any
value different than "no action", {1to8} or{4to8}
then an Invalid Opcode fault is raised. Note
that this rule only applies to memory conversions
(register swizzles are allowed).

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTDQ2PS - Convert Fixed Point Int32 Vector to Float32 Vector

Opcode
MVEX.512.0F3A.W0 CB /r ib

Instruction
vcvtfxpntdq2ps zmm1 {k1}, Si32 (zmm2/mt ), imm8

Description
Convert int32 vector
Si32 (zmm2/mt ) to
loat32, and store
the result in zmm1,
using imm8, under
write-mask.

Description
Performs an element-by-element conversion from the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector zmm2 to a loat32 vector , then
performs an optional adjustment to the exponent.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format
Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (32.0 - no exponent adjustment)
24 (28.4)
25 (27.5)
28 (24.8)
216 (16.16)
224 (8.24)
231 (1.31)
232 (0.32)
*must UD*

I7
0
0
0
0
0
0
0
0
1

I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc2[511:0] = zmm2[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
Reference Number: 327364-001

157

CHAPTER 6. INSTRUCTION DESCRIPTIONS

for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] =
CvtInt32ToFloat32(tmpSrc2[i+31:i], RoundingMode) / EXPADJ_TABLE[expadj]
}
}

SIMD Floating-Point Exceptions
Precision.

Denormal Handling
Treat Input Denormals As Zeros :
Not Applicable
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

158

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_cvtfxpnt_round_adjustepi32_ps(__m512i, int, _MM_EXP_ADJ_ENUM);
_mm512_mask_cvtfxpnt_round_adjustepi32_ps( __m512, __mmask16, __m512i,
int, _MM_EXP_ADJ_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
159

CHAPTER 6. INSTRUCTION DESCRIPTIONS

If preceded by any REX, F0, F2, F3, or 66 pre ixes.

160

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTPD2DQ - Convert Float64 Vector to Fixed Point Int32 Vector

Opcode
MVEX.512.F2.0F3A.W1 E6 /r ib

Instruction
vcvtfxpntpd2dq zmm1 {k1}, Sf 64 (zmm2/mt ), imm8

Description
Convert
loat64
vector
Sf 64 (zmm2/mt )
to int32, and
store the result
in zmm1, using
imm8,
under
write-mask.

Description
Performs an element-by-element conversion and rounding from the loat64 vector result
of the swizzle/broadcast/conversion from memory or loat64 vector zmm2 to a int32 vector . The int32 result is written into the lower half of the destination register zmm1; the
other half of the destination is set to zero.
Out-of-range values are converted to the nearest representable value and that NaNs convert to 0, because this makes the calculation of Exp2 more ef icient (avoiding problems
with converting very large values to integers, where undetected incorrect values could
otherwise result from over low). Table 6.5 describes what should be the result when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
0
IN T _M AX
0
0
IN T _M IN

Table 6.5: Converting to integer special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format

rn
rd
ru
rz
Reference Number: 327364-001

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

I1
0
0
1
1

I0
0
1
0
1
161

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
RoundingMode = IMM8[1:0]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[j+31:j] =
CvtFloat64ToInt32(tmpSrc2[i+63:i], RoundingMode)
}
}
zmm1[511:256] = 0

SIMD Floating-Point Exceptions
Invalid, Precision.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111
162

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_cvtfxpnt_roundpd_epi32lo(__m512d, int);
_mm512_mask_cvtfxpnt_roundpd_epi32lo(__m512i, __mmask8, __m512d, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
163

CHAPTER 6. INSTRUCTION DESCRIPTIONS

164

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTPD2UDQ - Convert Float64 Vector to Fixed Point Uint32 Vector

Opcode
MVEX.512.F2.0F3A.W1 CA /r ib

Instruction
vcvtfxpntpd2udq zmm1 {k1}, Sf 64 (zmm2/mt ), imm8

Description
Convert
loat64 vector
Sf 64 (zmm2/mt )
to uint32, and
store the result
in zmm1, using
imm8,
under
write-mask.

Description
Performs an element-by-element conversion and rounding from the loat64 vector result
of the swizzle/broadcast/conversion from memory or loat64 vector zmm2 to a uint32
vector . The uint32 result is written into the lower half of the destination register zmm1;
the other half of the destination is set to zero.
Out-of-range values are converted to the nearest representable value and that NaNs convert to 0, because this makes the calculation of Exp2 more ef icient (avoiding problems
with converting very large values to integers, where undetected incorrect values could
otherwise result from over low). Table 6.6 describes what should be the result when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
0
IN T _M AX
0
0
IN T _M IN

Table 6.6: Converting to integer special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format

rn
rd
ru
rz
Reference Number: 327364-001

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

I1
0
0
1
1

I0
0
1
0
1
165

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
RoundingMode = IMM8[1:0]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[j+31:j] =
CvtFloat64ToUint32(tmpSrc2[i+63:i], RoundingMode)
}
}
zmm1[511:256] = 0

SIMD Floating-Point Exceptions
Invalid, Precision.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111
166

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_cvtfxpnt_roundpd_epi32lo(__m512d, int);
_mm512_mask_cvtfxpnt_roundpd_epi32lo(__m512i, __mmask8, __m512d, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
167

CHAPTER 6. INSTRUCTION DESCRIPTIONS

168

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTPS2DQ - Convert Float32 Vector to Fixed Point Int32 Vector

Opcode
MVEX.512.66.0F3A.W0 CB /r ib

Instruction
vcvtfxpntps2dq zmm1 {k1}, Sf 32 (zmm2/mt ), imm8

Description
Convert
loat32
vector
Sf 32 (zmm2/mt )
to int32, and
store the result
in zmm1, using
imm8,
under
write-mask.

Description
Performs an element-by-element conversion and rounding from the loat32 vector result
of the swizzle/broadcast/conversion from memory or loat32 vector zmm2 to a int32 vector , with an optional exponent adjustment before the conversion.
Out-of-range values are converted to the nearest representable value and that NaNs convert to 0, because this makes the calculation of Exp2 more ef icient (avoiding problems
with converting very large values to integers, where undetected incorrect values could
otherwise result from over low). Table 6.7 describes what should be the result when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
0
IN T _M AX
0
0
IN T _M IN

Table 6.7: Converting to integer special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format

rn
rd
ru
rz
Reference Number: 327364-001

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

I1
0
0
1
1

I0
0
1
0
1
169

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (32.0 - no exponent adjustment)
24 (28.4)
25 (27.5)
28 (24.8)
216 (16.16)
224 (8.24)
231 (1.31)
232 (0.32)
*must UD*

I7
0
0
0
0
0
0
0
0
1

I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
RoundingMode = IMM8[1:0]
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] =
CvtFloat32ToInt32(tmpSrc2[i+31:i] * EXPADJ_TABLE[expadj], RoundingMode)
}
}

SIMD Floating-Point Exceptions
Invalid, Precision.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

170

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_cvtfxpnt_round_adjustps_epi32(__m512, int, _MM_EXP_ADJ_ENUM);
_mm512_mask_cvtfxpnt_round_adjustps_epi32( __m512i, __mmask16, __m512,
int, _MM_EXP_ADJ_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
171

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

172

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTPS2UDQ - Convert Float32 Vector to Fixed Point Uint32 Vector

Opcode
MVEX.512.66.0F3A.W0 CA /r ib

Instruction
vcvtfxpntps2udq zmm1 {k1}, Sf 32 (zmm2/mt ), imm8

Description
Convert
loat32 vector
Sf 32 (zmm2/mt )
to uint32, and
store the result
in zmm1, using
imm8,
under
write-mask.

Description
Performs an element-by-element conversion and rounding from the loat32 vector result
of the swizzle/broadcast/conversion from memory or loat32 vector zmm2 to a uint32
vector , with an optional exponent adjustment before the conversion.
Out-of-range values are converted to the nearest representable value and that NaNs convert to 0, because this makes the calculation of Exp2 more ef icient (avoiding problems
with converting very large values to integers, where undetected incorrect values could
otherwise result from over low). Table 6.8 describes what should be the result when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
0
IN T _M AX
0
0
IN T _M IN

Table 6.8: Converting to integer special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format

rn
rd
ru
rz
Reference Number: 327364-001

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

I1
0
0
1
1

I0
0
1
0
1
173

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (32.0 - no exponent adjustment)
24 (28.4)
25 (27.5)
28 (24.8)
216 (16.16)
224 (8.24)
231 (1.31)
232 (0.32)
*must UD*

I7
0
0
0
0
0
0
0
0
1

I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
RoundingMode = IMM8[1:0]
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] =
CvtFloat32ToUint32(tmpSrc2[i+31:i] * EXPADJ_TABLE[expadj], RoundingMode)
}
}

SIMD Floating-Point Exceptions
Invalid, Precision.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

174

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_cvtfxpnt_round_adjustps_epi32(__m512, int, _MM_EXP_ADJ_ENUM);
_mm512_mask_cvtfxpnt_round_adjustps_epi32( __m512i, __mmask16, __m512,
int, _MM_EXP_ADJ_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
175

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

176

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTFXPNTUDQ2PS - Convert Fixed Point Uint32 Vector to Float32 Vector

Opcode
MVEX.512.0F3A.W0 CA /r ib

Instruction
vcvtfxpntudq2ps zmm1 {k1}, Si32 (zmm2/mt ), imm8

Description
Convert uint32 vector Si32 (zmm2/mt )
to loat32, and store
the result in zmm1,
using imm8, under
write-mask.

Description
Performs an element-by-element conversion from the uint32 vector result of the swizzle/broadcast/conversion from memory or uint32 vector zmm2 to a loat32 vector , then
performs an optional adjustment to the exponent.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format
Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (32.0 - no exponent adjustment)
24 (28.4)
25 (27.5)
28 (24.8)
216 (16.16)
224 (8.24)
231 (1.31)
232 (0.32)
*must UD*

I7
0
0
0
0
0
0
0
0
1

I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc2[511:0] = zmm2[511:0]
} else {
RoundingMode = MXCSR.RC
Reference Number: 327364-001

177

CHAPTER 6. INSTRUCTION DESCRIPTIONS

tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] =
CvtUint32ToFloat32(tmpSrc2[i+31:i], RoundingMode) / EXPADJ_TABLE[expadj]
}
}

SIMD Floating-Point Exceptions
Precision.

Denormal Handling
Treat Input Denormals As Zeros :
Not Applicable
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

178

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

179

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTPD2PS - Convert Float64 Vector to Float32 Vector

Opcode
MVEX.512.66.0F.W1 5A /r

Instruction
vcvtpd2ps zmm1 {k1}, Sf 64 (zmm2/mt )

Description
Convert
loat64
vector
Sf 64 (zmm2/mt ) to loat32, and
store the result in zmm1, under
write-mask.

Description
Performs an element-by-element conversion and rounding from the loat64 vector result
of the swizzle/broadcast/conversion from memory or loat64 vector zmm2 to a loat32
vector . The result is written into loat32 vector zmm1. The loat32 result is written into
the lower half of the destination register zmm1; the other half of the destination is set to
zero.
Input
NaN
+∞
+0
-0
−∞

Result
Quietized NaN. Copy leading bits of loat64 signi icand
+∞
+0
−0
−∞

Table 6.9: Converting loat64 to loat32 special values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc2[511:0] = zmm2[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[j+31:j] =
180

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

CvtFloat64ToFloat32(tmpSrc2[i+63:i], RoundingMode)
}
}
zmm1[511:256] = 0

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

181

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_cvtpd_pslo (__m512d);
_mm512_mask_cvtpd_pslo (__m512d, __mmask8, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
182

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

183

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTPS2PD - Convert Float32 Vector to Float64 Vector

Opcode
MVEX.512.0F.W0 5A /r

Instruction
vcvtps2pd zmm1 {k1}, Sf 32 (zmm2/mt )

Description
Convert
loat32
vector
Sf 32 (zmm2/mt ) to loat64, and store
the result in zmm1, under write-mask.

Description
Performs an element-by-element conversion and rounding from the loat32 vector result
of the swizzle/broadcast/conversion from memory or loat32 vector zmm2 to a loat64
vector . The result is written into loat64 vector zmm1. The loat32 source is read from
either the lower half of the source operand ( loat32 vector zmm2), full memory source (8
elements, i.e. 256-bits) or the broadcast memory source.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[255:0] = zmm2[255:0]
} else {
tmpSrc2[255:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[i+63:i] =
CvtFloat32ToFloat64(tmpSrc2[j+31:j])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

184

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x4)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
32
4
16
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
_m512d
_m512d

_mm512_cvtpslo_pd (__m512);
_mm512_mask_cvtpslo_pd (__m512d, __mmask8, __m512);

Reference Number: 327364-001

185

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

186

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to 4, 16 or 32-byte (depending on the swizzle broadcast).
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv involving data conversion.
If SwizzUpConvMem function from memory is set to any
value different than "no action", {1to8} or{4to8}
then an Invalid Opcode fault is raised. Note
that this rule only applies to memory conversions
(register swizzles are allowed).

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VCVTUDQ2PD - Convert Uint32 Vector to Float64 Vector

Opcode
MVEX.512.F3.0F.W0 7A /r

Instruction
vcvtudq2pd zmm1 {k1}, Si32 (zmm2/mt )

Description
Convert
uint32
vector
Si32 (zmm2/mt ) to loat64, and
store the result in zmm1, under
write-mask.

Description
Performs an element-by-element conversion from the uint32 vector result of the swizzle/broadcast/conversion from memory or uint32 vector zmm2 to a loat64 vector . The
result is written into loat64 vector zmm1. The uint32 source is read from either the lower
half of the source operand (uint32 vector zmm2), full memory source (8 elements, i.e.
256-bits) or the broadcast memory source.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[255:0] = zmm2[255:0]
} else {
tmpSrc2[255:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
j = 32*n
zmm1[i+63:i] =
CvtUint32ToFloat64(tmpSrc2[j+31:j])
}
}

SIMD Floating-Point Exceptions
None.

Reference Number: 327364-001

187

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
Not Applicable
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x4)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
32
4
16
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
_m512d
_m512d

188

_mm512_cvtepu32lo_pd (__m512i);
_mm512_mask_cvtepu32lo_pd (__m512d, __mmask8, __m512i);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to 4, 16 or 32-byte (depending on the swizzle broadcast).
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv involving data conversion.
If SwizzUpConvMem function from memory is set to any
value different than "no action", {1to8} or{4to8}
then an Invalid Opcode fault is raised. Note
that this rule only applies to memory conversions
(register swizzles are allowed).

189

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VEXP223PS - Base-2 Exponential Calculation of Float32 Vector

Opcode
MVEX.512.66.0F38.W0 C8
/r

Instruction
vexp223ps zmm1 {k1}, zmm2/mt

Description
Calculate the approx. exp2 from int32 vector
zmm2/mt and store the result in zmm1, under
write-mask.

Description
Computes the element-by-element base-2 exponential computation of the int32 vector
on memory or int32 vector zmm2 with 0.99ULP (relative error). Input int32 values are
considered as ixed point numbers with a fraction offset of 24 bits (i.e. 8 MSBs correspond
to sign and integer part; 24 LSBs correspond to fractional part). The result is written into
loat32 vector zmm1.
exp2 of a FP input value is computed as a two-instruction sequence:
1. vcvtfxpntps2dq (with exponent adjustment, so that destination format is 32b, with
8b for integer part and 24b for fractional part)
2. vexp223ps
All over lows are captured by the combination of the saturating behavior of vcvtfxpntps2dq instruction and the detection of MAX_INT/MIN_INT by the vexp223ps instruction. Tiny input numbers are quietly lushed to the ixed-point value 0 by the vcvtfxpntps2dq instruction, which produces an overall output exp2(0) = 1.0f .
The overall behavior of the two-instruction sequence is the following:
•
•
•
•
•

−∞ returns +0.0f
±0.0f returns 1.0f (exact result)
+∞ returns +∞ (#Over low)
NaN returns 1.0f (#Invalid)
n, where n is an integral value returns 2n (exact result)
Input
MIN_INT
MAX_INT

Result
+0.0f
+∞

Comments
Raise #O lag

Table 6.10: vexp2_1ulp() special int values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

190

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
tmpSrc2[511:0] = zmm2/mt
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags()
// SAE
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = exp2_1ulp(tmpSrc2[i+31:i])
}
}

SIMD Floating-Point Exceptions
Over low.

Denormal Handling
Treat Input Denormals As Zeros :
Not Applicable
Flush Tiny Results To Zero :
YES

Register Swizzle
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Usage
, {sae}
191

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_exp223_ps (__m512i);
_mm512_mask_exp223_ps (__m512, __mmask16, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

192

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFIXUPNANPD - Fix Up Special Float64 Vector Numbers With NaN Passthrough

Opcode
MVEX.NDS.512.66.0F38.W1 55 /r

Instruction
v ixupnanpd zmm1 {k1}, zmm2, Si64 (zmm3/mt )

Description
Fix up, with NaN
passthrough, special numbers in
loat64
vector
zmm1,
loat64
vector
zmm2
and int64 vector
Si64 (zmm3/mt )
and store the result
in zmm1, under
write-mask.

Description
Performs an element-by-element ix-up of various real and special number types in
the loat64 vector zmm2 using the 21-bit table values from the result of the swizzle/broadcast/conversion process on memory or int64 vector zmm3. The result is
merged into loat64 vector zmm1. Unlike in v ixuppd, source NaN values are passedthrough as quietized values. Note that, also unlike in v ixup, this quietization translates
into a #IE exception lag being reported for input SNaNs.
This instruction is speci ically intended for use in ixing up the results of arithmetic calculations involving one source, although it is generally useful for ixing up the results of
multiple-instruction sequences to re lect special-number inputs. For example, consider
rcp(0). Input 0 to rcp, and you should get inf. However, evaluating rcp via 2x − ax2
(Newton-Raphson), where x = approx(1/0) = ∞, incorrectly yields NaN. To deal with
this, v ixupps can be used after the N-R reciprocal sequence to set the result to ∞ when
the input is 0.
Denormal inputs must be treated as zeros of the same sign if DAZ is enabled.
Note that NO_CHANGE_TOKEN leaves the destination (output) unchanged. This means
that if the destination is a denormal, its value is not lushed to 0.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
enum TOKEN_TYPE
{
NO_CHANGE_TOKEN
Reference Number: 327364-001

= 0,
193

CHAPTER 6. INSTRUCTION DESCRIPTIONS

NEG_INF_TOKEN
NEG_ZERO_TOKEN
POS_ZERO_TOKEN
POS_INF_TOKEN
NAN_TOKEN
MAX_DOUBLE_TOKEN
MIN_DOUBLE_TOKEN

=
=
=
=
=
=
=

1,
2,
3,
4,
5,
6,
7,

}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpzmm3[511:0] = zmm3[511:0]
} else {
tmpzmm3[511:0] = SwizzUpConvLoadi64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tsrc[63:0] = zmm2[i+63:i]
if (IsNaN(tsrc[63:0])
{
zmm1[i+63:i] = QNaN(zmm2[i+63:i])
}
else
{
// tmp is an int value
if
(tsrc[63:0] == -inf)
tmp = 0
else if (tsrc[63:0] < 0)
tmp = 1
else if (tsrc[63:0] == -0) tmp = 2
else if (tsrc[63:0] == +0) tmp = 3
else if (tsrc[63:0] == inf) tmp = 5
else /* tsrc[63:0] > 0 */ tmp = 4
table[20:0] = tmpzmm3[i+63:i]
token = table[(tmp*3)+2: tmp*3]

//
//
//
//

table is viewed as one 21-bit
little-endian value.
token is an int value
the 7th entry is unused

// float64 result
if (token == NEG_INF_TOKEN)
zmm1[i+63:i] = -inf
else if (token == NEG_ZERO_TOKEN)
zmm1[i+63:i] = -0
else if (token == POS_ZERO_TOKEN)
zmm1[i+63:i] = +0
else if (token == POS_INF_TOKEN)
zmm1[i+63:i] = +inf
else if (token == NAN_TOKEN)
zmm1[i+63:i] = QNaN_indefinite
else if (token == MAX_DOUBLE_TOKEN) zmm1[i+63:i] = NMAX
else if (token == MIN_DOUBLE_TOKEN) zmm1[i+63:i] = -NMAX
else if (token == NO_CHANGE_TOKEN) { /* zmm1[i+63:i] remains unchanged */ }
}
}
194

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Invalid.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}
195

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_ ixupnan_pd (__m512d, __m512d, __m512i);
_mm512_mask_ ixupnan_pd (__m512d, __mmask8, __m512d, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

196

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFIXUPNANPS - Fix Up Special Float32 Vector Numbers With NaN Passthrough

Opcode
MVEX.NDS.512.66.0F38.W0 55 /r

Instruction
v ixupnanps zmm1 {k1}, zmm2, Si32 (zmm3/mt )

Description
Fix up, with NaN
passthrough, special numbers in
loat32
vector
zmm1,
loat32
vector
zmm2
and int32 vector
Si32 (zmm3/mt )
and store the result
in zmm1, under
write-mask.

Description
Performs an element-by-element ix-up of various real and special number types in
the loat32 vector zmm2 using the 21-bit table values from the result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. The result is
merged into loat32 vector zmm1. Unlike in v ixupps, source NaN values are passedthrough as quietized values. Note that, also unlike in v ixup, this quietization translates
into a #IE exception lag being reported for input SNaNs.
This instruction is speci ically intended for use in ixing up the results of arithmetic calculations involving one source, although it is generally useful for ixing up the results of
multiple-instruction sequences to re lect special-number inputs. For example, consider
rcp(0). Input 0 to rcp, and you should get inf. However, evaluating rcp via 2x − ax2
(Newton-Raphson), where x = approx(1/0) = ∞, incorrectly yields NaN. To deal with
this, v ixupps can be used after the N-R reciprocal sequence to set the result to ∞ when
the input is 0.
Denormal inputs must be treated as zeros of the same sign if DAZ is enabled.
Note that NO_CHANGE_TOKEN leaves the destination (output) unchanged. This means
that if the destination is a denormal, its value is not lushed to 0.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
enum TOKEN_TYPE
{
NO_CHANGE_TOKEN
Reference Number: 327364-001

= 0,
197

CHAPTER 6. INSTRUCTION DESCRIPTIONS

NEG_INF_TOKEN
NEG_ZERO_TOKEN
POS_ZERO_TOKEN
POS_INF_TOKEN
NAN_TOKEN
MAX_FLOAT_TOKEN
MIN_FLOAT_TOKEN

=
=
=
=
=
=
=

1,
2,
3,
4,
5,
6,
7,

}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpzmm3[511:0] = zmm3[511:0]
} else {
tmpzmm3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tsrc[31:0] = zmm2[i+31:i]
if (IsNaN(tsrc[31:0])
{
zmm1[i+31:i] = QNaN(zmm2[i+31:i])
}
else
{
// tmp is an int value
if
(tsrc[31:0] == -inf)
tmp = 0
else if (tsrc[31:0] < 0)
tmp = 1
else if (tsrc[31:0] == -0) tmp = 2
else if (tsrc[31:0] == +0) tmp = 3
else if (tsrc[31:0] == inf) tmp = 5
else /* tsrc[31:0] > 0 */ tmp = 4
table[20:0] = tmpzmm3[i+31:i]
token = table[(tmp*3)+2: tmp*3]

//
//
//
//

table is viewed as one 21-bit
little-endian value.
token is an int value
the 7th entry is unused

// float32 result
if (token == NEG_INF_TOKEN)
zmm1[i+31:i] = -inf
else if (token == NEG_ZERO_TOKEN)
zmm1[i+31:i] = -0
else if (token == POS_ZERO_TOKEN)
zmm1[i+31:i] = +0
else if (token == POS_INF_TOKEN)
zmm1[i+31:i] = +inf
else if (token == NAN_TOKEN)
zmm1[i+31:i] = QNaN_indefinite
else if (token == MAX_FLOAT_TOKEN) zmm1[i+31:i] = NMAX
else if (token == MIN_FLOAT_TOKEN) zmm1[i+31:i] = -NMAX
else if (token == NO_CHANGE_TOKEN) { /* zmm1[i+31:i] remains unchanged */ }
}
}
198

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Invalid.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}
199

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_ ixupnan_ps (__m512, __m512, __m512i);
_mm512_mask_ ixupnan_ps (__m512, __mmask16, __m512, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

200

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD132PD - Multiply Destination By Second Source and Add To First
Source Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vfmadd132pd zmm1 {k1}, zmm2,
98 /r
Sf 64 (zmm3/mt )

Description
Multiply loat64 vector zmm1 and loat64 vector Sf 64 (zmm3/mt ), add the result to loat64
vector zmm2, and store the inal result in
zmm1, under write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm1 and the
loat64 vector result of the swizzle/broadcast/conversion process on memory or vector
loat64 zmm3, then adds the result to loat64 vector zmm2. The inal sum is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm1[i+63:i] * tmpSrc3[i+63:i] + zmm2[i+63:i]
}
}

Reference Number: 327364-001

201

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

202

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
203

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

204

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD132PS - Multiply Destination By Second Source and Add To First
Source Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vfmadd132ps zmm1 {k1}, zmm2,
98 /r
Sf 32 (zmm3/mt )

Description
Multiply loat32 vector zmm1 and loat32 vector Sf 32 (zmm3/mt ), add the result to loat32
vector zmm2, and store the inal result in
zmm1, under write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm1 and the
loat32 vector result of the swizzle/broadcast/conversion process on memory or vector
loat32 zmm3, then adds the result to loat32 vector zmm2. The inal sum is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm1[i+31:i] * tmpSrc3[i+31:i] + zmm2[i+31:i]
}
}

Reference Number: 327364-001

205

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE
206

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmadd_ps (__m512, __m512, __m512);
_mm512_mask_fmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

207

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD213PD - Multiply First Source By Destination and Add Second
Source Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 A8 /r

Instruction
vfmadd213pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply loat64
vector
zmm2
and
loat64
vector
zmm1,
add the result to
loat64
vector
Sf 64 (zmm3/mt ),
and store the
inal result in
zmm1,
under
write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm2 and loat64
vector zmm1 and then adds the result to the loat64 vector result of the swizzle/broadcast/conversion
process on memory or vector loat64 zmm3. The inal sum is written into loat64 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] * zmm1[i+63:i] + tmpSrc3[i+63:i]
208

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

209

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
210

If a memory address referencing the SS segment is
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

211

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD213PS - Multiply First Source By Destination and Add Second
Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 A8 /r

Instruction
vfmadd213ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply loat32
vector
zmm2
and loat32 vector zmm1, add
the result to
loat32
vector
Sf 32 (zmm3/mt ),
and store the
inal result in
zmm1,
under
write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm2 and loat32
vector zmm1 and then adds the result to the loat32 vector result of the swizzle/broadcast/conversion
process on memory or vector loat32 zmm3. The inal sum is written into loat32 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * zmm1[i+31:i] + tmpSrc3[i+31:i]
212

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Reference Number: 327364-001

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

213

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmadd_ps (__m512, __m512, __m512);
_mm512_mask_fmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

214

If a memory address referencing the SS segment is
in a non-canonical form.
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

215

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD231PD - Multiply First Source By Second Source and Add To Destination Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 B8 /r

Instruction
vfmadd231pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply loat64
vector zmm2 and
loat64
vector
Sf 64 (zmm3/mt ),
add the result to
loat64
vector
zmm1, and store
the inal result
in zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm2 and the
loat64 vector result of the swizzle/broadcast/conversion process on memory or vector
loat64 zmm3, then adds the result to loat64 vector zmm1. The inal sum is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i] + zmm1[i+63:i]
}
216

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

217

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
218

If a memory address referencing the SS segment is
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

219

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD231PS - Multiply First Source By Second Source and Add To Destination Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 B8 /r

Instruction
vfmadd231ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply loat32
vector zmm2 and
loat32
vector
Sf 32 (zmm3/mt ),
add the result to
loat32
vector
zmm1, and store
the inal result
in zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm2 and the
loat32 vector result of the swizzle/broadcast/conversion process on memory or vector
loat32 zmm3, then adds the result to loat32 vector zmm1. The inal sum is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] + zmm1[i+31:i]
}
220

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Reference Number: 327364-001

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

221

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmadd_ps (__m512, __m512, __m512);
_mm512_mask_fmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

222

If a memory address referencing the SS segment is
in a non-canonical form.
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

223

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMADD233PS - Multiply First Source By Specially Swizzled Second Source
and Add To Second Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 A4 /r

Instruction
vfmadd233ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply loat32
vector zmm2 by
certain elements
of loat32 vector
Sf 32 (zmm3/mt ),
add
the
result to certain
elements
of
Sf 32 (zmm3/mt ),
and store the
inal result in
zmm1,
under
write-mask.

Description
This instruction is built around the concept of 4-element sets, of which there are four:
elements 0-3, 4-7, 8-11, and 12-15. If we refer to the loat32 vector result of the broadcast
(no conversion is supported) process on memory or the loat32 vector zmm3 (no swizzle
is supported) as t3, then:
Each element 0-3 of loat32 vector zmm2 is multiplied by element 1 of t3, the result is
added to element 0 of t3, and the inal sum is written into the corresponding element 0-3
of loat32 vector zmm1.
Each element 4-7 of loat32 vector zmm2 is multiplied by element 5 of t3, the result is
added to element 4 of t3, and the inal sum is written into the corresponding element 4-7
of loat32 vector zmm1.
Each element 8-11 of loat32 vector zmm2 is multiplied by element 9 of t3, the result is
added to element 8 of t3, and the inal sum is written into the corresponding element 8-11
of loat32 vector zmm1.
Each element 12-15 of loat32 vector zmm2 is multiplied by element 13 of t3, the result
is added to element 12 of t3, and the inal sum is written into the corresponding element
12-15 of loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
This instruction makes it possible to perform scale and bias in a single instruction without
needing to have either scale or bias already loaded in a register. This saves one vector load
for each interpolant, representing around ten percent of shader instructions.
For structure-of-arrays (SOA) operation, this instruction is intended to be used with the
{4to16} broadcast on src2, allowing all 16 scale and biases to be identical. For array-of224

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

structures (AOS) vec4 operations, no broadcast is used, allowing four different scales and
biases, one for each vec4.
No conversion or swizzling is supported for this instruction. However, all broadcasts except {1to16} are supported (i.e. 16to16 and 4to16).
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
base = ( n & ~0x03 ) * 32
scale[31:0] = tmpSrc3[base+63:base+32]
bias[31:0] = tmpSrc3[base+31:base]
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * scale[31:0] + bias[31:0]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Reference Number: 327364-001

225

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
broadcast 4 elements (x4)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {16to16} or [rax]
N/A
[rax] {4to16}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
16
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

226

_mm512_fmadd233_ps (__m512, __m512);
_mm512_mask_fmadd233_ps (__m512, __mmask16, __m512, __m512);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to 16 or 64-byte (depending on the swizzle broadcast).
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv involving data conversion, register swizzling or
{1to16} broadcast. If SwizzUpConv function is set to any
value different than "no action" or {4to16} then
an Invalid Opcode fault is raised

227

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB132PD - Multiply Destination By Second Source and Subtract
First Source Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vfmsub132pd zmm1 {k1}, zmm2,
9A /r
Sf 64 (zmm3/mt )

Description
Multiply loat64 vector zmm1 and loat64 vector Sf 64 (zmm3/mt ), subtract loat64 vector
zmm2 from the result, and store the inal result
in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat64 vector zmm1 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or vector loat64
zmm3, then subtracts loat64 vector zmm2 from the result. The inal result is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm1[i+63:i] * tmpSrc3[i+63:i] - zmm2[i+63:i]
}
}

228

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

229

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
230

If a memory address referencing the SS segment is
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

231

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB132PS - Multiply Destination By Second Source and Subtract First
Source Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vfmsub132ps zmm1 {k1}, zmm2,
9A /r
Sf 32 (zmm3/mt )

Description
Multiply loat32 vector zmm1 and loat32 vector Sf 32 (zmm3/mt ), subtract loat32 vector
zmm2 from the result, and store the inal result
in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm1 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or vector loat32
zmm3, then subtracts loat32 vector zmm2 from the result. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm1[i+31:i] * tmpSrc3[i+31:i] - zmm2[i+31:i]
}
}

232

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}
233

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmsub_ps (__m512, __m512, __m512);
_mm512_mask_fmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

234

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB213PD - Multiply First Source By Destination and Subtract Second Source Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vfmsub213pd zmm1 {k1}, zmm2,
AA /r
Sf 64 (zmm3/mt )

Description
Multiply loat64 vector zmm2 and loat64
vector zmm1, subtract
loat64 vector
Sf 64 (zmm3/mt ) from the result, and store
the inal result in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat64 vector zmm2 and loat64 vector zmm1, then subtracts the loat64 vector result of the swizzle/broadcast/conversion
process on memory or vector loat64 zmm3 from the result. The inal result is written
into loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] * zmm1[i+63:i] - tmpSrc3[i+63:i]
}
}

Reference Number: 327364-001

235

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

236

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
237

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

238

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB213PS - Multiply First Source By Destination and Subtract Second
Source Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vfmsub213ps zmm1 {k1}, zmm2,
AA /r
Sf 32 (zmm3/mt )

Description
Multiply loat32 vector zmm2 and loat32
vector zmm1, subtract
loat32 vector
Sf 32 (zmm3/mt ) from the result, and store
the inal result in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm2 and loat32 vector zmm1, then subtracts the loat32 vector result of the swizzle/broadcast/conversion
process on memory or vector loat32 zmm3 from the result. The inal result is written
into loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * zmm1[i+31:i] - tmpSrc3[i+31:i]
}
}

Reference Number: 327364-001

239

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE
240

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmsub_ps (__m512, __m512, __m512);
_mm512_mask_fmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

241

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB231PD - Multiply First Source By Second Source and Subtract
Destination Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vfmsub231pd zmm1 {k1}, zmm2,
BA /r
Sf 64 (zmm3/mt )

Description
Multiply loat64 vector zmm2 and loat64 vector Sf 64 (zmm3/mt ), subtract loat64 vector
zmm1 from the result, and store the inal result
in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or vector loat32
zmm3, then subtracts loat32 vector zmm1 from the result. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i] - zmm1[i+63:i]
}
}

242

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

243

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
244

If a memory address referencing the SS segment is
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

245

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFMSUB231PS - Multiply First Source By Second Source and Subtract Destination Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vfmsub231ps zmm1 {k1}, zmm2,
BA /r
Sf 32 (zmm3/mt )

Description
Multiply loat32 vector zmm2 and loat32 vector Sf 32 (zmm3/mt ), subtract loat32 vector
zmm1 from the result, and store the inal result
in zmm1, under write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or vector loat32
zmm3, then subtracts loat32 vector zmm1 from the result. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] - zmm1[i+31:i]
}
}

246

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}
247

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fmsub_ps (__m512, __m512, __m512);
_mm512_mask_fmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

248

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD132PD - Multiply Destination By Second Source and Subtract
From First Source Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 9C /r

Instruction
vfnmadd132pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply
loat64
vector zmm1 and
loat64 vector
Sf 64 (zmm3/mt ),
negate, and add
the result to
loat64 vector
zmm2,
and
store the inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication of loat64 vector zmm2 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or vector loat64
zmm3, then subtracts the result from loat64 vector zmm1. The inal result is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
Reference Number: 327364-001

249

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 64*n
// float64 operation
zmm1[i+63:i] = -(zmm1[i+63:i] * tmpSrc3[i+63:i]) + zmm2[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

250

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
251

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

252

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD132PS - Multiply Destination By Second Source and Subtract
From First Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 9C /r

Instruction
vfnmadd132ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply
loat32
vector zmm1 and
loat32 vector
Sf 32 (zmm3/mt ),
negate, and add
the result to
loat32 vector
zmm2,
and
store the
inal result in
zmm1,
under
write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm2 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or vector loat32
zmm3, then subtracts the result from loat32 vector zmm1. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
Reference Number: 327364-001

253

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 32*n
// float32 operation
zmm1[i+31:i] = -(zmm1[i+31:i] * tmpSrc3[i+31:i]) + zmm2[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

254

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmadd_ps (__m512, __m512, __m512);
_mm512_mask_fnmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
255

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

256

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD213PD - Multiply First Source By Destination and Subtract From
Second Source Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 AC /r

Instruction
vfnmadd213pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply loat64
vector zmm2
and
loat64
vector zmm1,
negate, and add
the result to
loat64 vector
Sf 64 (zmm3/mt ),
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication of loat64 vector zmm1 and the loat64
vector result of the swizzle/broadcast/conversion process on memory or vector loat64
zmm3, then subtracts the result from loat64 vector zmm2. The inal result is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
Reference Number: 327364-001

257

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// float64 operation
zmm1[i+63:i] = -(zmm2[i+63:i] * zmm1[i+63:i]) + tmpSrc3[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

258

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
259

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

260

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD213PS - Multiply First Source By Destination and Subtract From
Second Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 AC /r

Instruction
vfnmadd213ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply loat32
vector
zmm2
and
loat32
vector zmm1,
negate,
and
add the result
to loat32 vector
Sf 32 (zmm3/mt ),
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm1 and the loat32
vector result of the swizzle/broadcast/conversion process on memory or vector loat32
zmm3, then subtracts the result from loat32 vector zmm2. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
Reference Number: 327364-001

261

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// float32 operation
zmm1[i+31:i] = -(zmm2[i+31:i] * zmm1[i+31:i]) + tmpSrc3[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

262

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmadd_ps (__m512, __m512, __m512);
_mm512_mask_fnmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
263

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

264

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD231PD - Multiply First Source By Second Source and Subtract
From Destination Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 BC /r

Instruction
vfnmadd231pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply
loat64
vector zmm2 and
loat64 vector
Sf 64 (zmm3/mt ),
negate, and add
the result to
loat64 vector
zmm1,
and
store the inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication of loat64 vector zmm2 and loat64 vector zmm1, then subtracts the result from the loat64 vector result of the swizzle/broadcast/conversion
process on memory or vector loat64 zmm3. The inal result is written into loat64 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
Reference Number: 327364-001

265

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 64*n
// float64 operation
zmm1[i+63:i] = -(zmm2[i+63:i] * tmpSrc3[i+63:i]) + zmm1[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

266

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmadd_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmadd_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmadd_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
267

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

268

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMADD231PS - Multiply First Source By Second Source and Subtract
From Destination Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 BC /r

Instruction
vfnmadd231ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply
loat32
vector zmm2 and
loat32 vector
Sf 32 (zmm3/mt ),
negate, and add
the result to
loat32 vector
zmm1,
and
store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication of loat32 vector zmm2 and loat32 vector zmm1, then subtracts the result from the loat32 vector result of the swizzle/broadcast/conversion
process on memory or vector loat32 zmm3. The inal result is written into loat32 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
Reference Number: 327364-001

269

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 32*n
// float32 operation
zmm1[i+31:i] = -(zmm2[i+31:i] * tmpSrc3[i+31:i]) + zmm1[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

270

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmadd_ps (__m512, __m512, __m512);
_mm512_mask_fnmadd_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmadd_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
271

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

272

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB132PD - Multiply Destination By Second Source, Negate, and
Subtract First Source Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 9E /r

Instruction
vfnmsub132pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply
loat64
vector zmm1 and
loat64 vector
Sf 64 (zmm3/mt ),
negate, and subtract
loat64
vector
zmm2
from the result,
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm1 and the
loat64 vector result of the swizzle/broadcast/conversion process on memory or vector
loat64 zmm3, negates, and subtracts loat64 vector zmm2. The inal result is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.11: VFNMSUB outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

273

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = (-(zmm1[i+63:i] * tmpSrc3[i+63:i]) - zmm2[i+63:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111
274

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
275

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

276

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB132PS - Multiply Destination By Second Source, Negate, and
Subtract First Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 9E /r

Instruction
vfnmsub132ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply
loat32
vector zmm1 and
loat32 vector
Sf 32 (zmm3/mt ),
negate, and subtract
loat32
vector
zmm2
from the result,
and store the
inal result in
zmm1,
under
write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm1 and the
loat32 vector result of the swizzle/broadcast/conversion process on memory or vector
loat32 zmm3, negates, and subtracts loat32 vector zmm2. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.12: VFNMSUB outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

277

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = (-(zmm1[i+31:i] * tmpSrc3[i+31:i]) - zmm2[i+31:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111
278

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmsub_ps (__m512, __m512, __m512);
_mm512_mask_fnmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
279

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

280

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB213PD - Multiply First Source By Destination, Negate, and Subtract Second Source Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 AE /r

Instruction
vfnmsub213pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply loat64
vector zmm2
and
loat64
vector zmm1,
negate,
and
subtract
loat64 vector
Sf 64 (zmm3/mt )
from the result,
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm2 and loat64
vector zmm1, negates, and subtracts the loat64 vector result of the swizzle/broadcast/conversion
process on memory or vector loat64 zmm3. The inal sum is written into loat64 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.13: VFNMSUB outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

281

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = (-(zmm2[i+63:i] * zmm1[i+63:i]) - tmpSrc3[i+63:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111
282

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
283

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

284

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB213PS - Multiply First Source By Destination, Negate, and Subtract Second Source Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 AE /r

Instruction
vfnmsub213ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply
loat32
vector zmm2 and
loat32 vector
zmm1, negate,
and
subtract
loat32 vector
Sf 32 (zmm3/mt )
from the result,
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm2 and loat32
vector zmm1, negates, and subtracts the loat32 vector result of the swizzle/broadcast/conversion
process on memory or vector loat32 zmm3. The inal sum is written into loat32 vector
zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.14: VFNMSUB outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

285

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = (-(zmm2[i+31:i] * zmm1[i+31:i]) - tmpSrc3[i+31:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111
286

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmsub_ps (__m512, __m512, __m512);
_mm512_mask_fnmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
287

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

288

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB231PD - Multiply First Source By Second Source, Negate, and
Subtract Destination Float64 Vectors

Opcode
MVEX.NDS.512.66.0F38.W1 BE /r

Instruction
vfnmsub231pd zmm1 {k1}, zmm2, Sf 64 (zmm3/mt )

Description
Multiply
loat64
vector zmm2 and
loat64 vector
Sf 64 (zmm3/mt ),
negate,
and
subtract loat64
vector zmm1
from the result,
and store the
inal result in
zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm2 and the
loat64 vector result of the swizzle/broadcast/conversion process on memory or vector
loat64 zmm3, negates, and subtracts loat64 vector zmm1. The inal result is written into
loat64 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.15: VFMADDN outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

289

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = (-(zmm2[i+63:i] * tmpSrc3[i+63:i]) - zmm1[i+63:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111
290

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d

_mm512_fnmsub_pd (__m512d, __m512d, __m512d);
_mm512_mask_fnmsub_pd (__m512d, __mmask8, __m512d, __m512d);
_mm512_mask3_fnmsub_pd (__m512d, __m512d, __m512d, __mmask8);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
291

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

292

in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VFNMSUB231PS - Multiply First Source By Second Source, Negate, and
Subtract Destination Float32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 BE /r

Instruction
vfnmsub231ps zmm1 {k1}, zmm2, Sf 32 (zmm3/mt )

Description
Multiply
loat32
vector zmm2 and
loat32 vector
Sf 32 (zmm3/mt ),
negate, and subtract
loat32
vector
zmm1
from the result,
and store the
inal result in
zmm1,
under
write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm2 and the
loat32 vector result of the swizzle/broadcast/conversion process on memory or vector
loat32 zmm3, negates, and subtracts loat32 vector zmm1. The inal result is written into
loat32 vector zmm1.
Intermediate values are calculated to in inite precision, and are not truncated or rounded.
All operations must be performed previous to inal rounding.
x*y
+0
+0
-0
-0

z
+0
-0
+0
-0

(-0)
(-0)
(+0)
(+0)

RN/RU/RZ
+ (-0) = -0
+ (+0) = +0
+ (-0) = +0
+ (+0) = +0

(-0)
(-0)
(+0)
(+0)

RD
+ (-0)
+ (+0)
+ (-0)
+ (+0)

= -0
= -0
= -0
= +0

Table 6.16: VFMADDN outcome when adding zeros depending on rounding-mode
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

293

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = (-(zmm2[i+31:i] * tmpSrc3[i+31:i]) - zmm1[i+31:i])
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111
294

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512

_mm512_fnmsub_ps (__m512, __m512, __m512);
_mm512_mask_fnmsub_ps (__m512, __mmask16, __m512, __m512);
_mm512_mask3_fnmsub_ps (__m512, __m512, __m512, __mmask16);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
295

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

296

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPD - Gather Float64 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W1 92
/r /vsib

Instruction
vgatherdpd
Uf 64 (mvt )

zmm1

{k1},

Description
Gather loat64 vector Uf 64 (mvt ) into loat64
vector zmm1 using doubleword indices and k1
as completion mask.

Description
A set of 8 memory locations pointed by base address BASE_ADDR and doubleword
index vector V IN DEX with scale SCALE are converted to a loat64 vector. The result
is written into loat64 vector zmm1.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
The instruction will #GP fault if the destination vector zmm1 is the same as index vector
V IN DEX.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
if (ktemp[n] != 0) {
Reference Number: 327364-001

297

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
zmm1[i+63:i] = UpConvLoadf 64 (pointer)
k1[n] = 0
}
}
k1[15:8] = 0

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_i32logather_pd (__m512i, void const*, int);
_mm512_mask_i32logather_pd (__m512d, __mmask8, __m512i, void const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

298

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv
and corresponding write-mask bit is not zero.
If the destination vector is the same as the index vector [see
.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

299

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERDPS - Gather Float32 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W0 92
/r /vsib

Instruction
vgatherdps zmm1 {k1}, Uf 32 (mvt )

Description
Gather loat32 vector Uf 32 (mvt ) into loat32
vector zmm1 using doubleword indices and k1
as completion mask.

Description
A set of 16 memory locations pointed by base address BASE_ADDR and doubleword
index vector V IN DEX with scale SCALE are converted to a loat32 vector. The result
is written into loat32 vector zmm1.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
The instruction will #GP fault if the destination vector zmm1 is the same as index vector
V IN DEX.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
300

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
zmm1[i+31:i] = UpConvLoadf 32 (pointer)
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
Invalid.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512
__m512

_mm512_i32gather_ps (__m512i, void const*, int);
_mm512_mask_i32gather_ps (__m512, __mmask16, __m512i, void const*, int);
_mm512_i32extgather_ps (__m512i, void const*, _MM_UPCONV_PS_ENUM, int,
int);
_mm512_mask_i32extgather_ps (__m512, __mmask16, __m512i, void const*,
_MM_UPCONV_PS_ENUM, int, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode

Reference Number: 327364-001

301

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

302

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv
and corresponding write-mask bit is not zero.
If the destination vector is the same as the index vector [see
.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERPF0DPS - Gather Prefetch Float32 Vector With Signed Dword
Indices Into L1

Opcode
MVEX.512.66.0F38.W0 C6
/1 /vsib

Instruction
vgatherpf0dps Uf 32 (mvt ) {k1}

Description
Gather Prefetch loat32 vector Uf 32 (mvt ), using
doubleword indices with T0 hint, under writemask.

Description
A set of 16 loat32 memory locations pointed by base address BASE_ADDR and doubleword index vector V IN DEX with scale SCALE are prefetched from memory to L1 level
of cache. If any memory access causes any type of memory exception, the memory access
will be considered as completed (destination mask updated) and the exception ignored.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
prefetch sequence have been prefetched and hence, the write-mask bits all are zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
Note that both gather and scatter prefetches set the access bit (A) in the related TLB page
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D).

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
exclusive = 0
evicthintpre = MVEX.EH
Reference Number: 327364-001

303

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
FetchL1cacheLine(pointer, exclusive, evicthintpre)
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

304

_mm512_prefetch_i32gather_ps (__m512i, void const*, int, int);
_mm512_mask_prefetch_i32gather_ps (__m512i, __mmask16, void const*, int,
int);
_mm512_prefetch_i32extgather_ps
(__m512i,
void
const*,
_MM_UPCONV_PS_ENUM, int, int);
_mm512_mask_prefetch_i32extgather_ps ( __m512i, __mmask16, void const*,
_MM_UPCONV_PS_ENUM, int, int);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

305

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERPF0HINTDPD - Gather Prefetch Float64 Vector Hint With Signed
Dword Indices

Opcode
MVEX.512.66.0F38.W1 C6
/0 /vsib

Instruction
vgatherpf0hintdpd
{k1}

Uf 64 (mvt )

Description
Gather Prefetch loat64 vector Uf 64 (mvt ), using
doubleword indices with T0 hint, under writemask.

Description
The instruction speci ies a set of 8 loat64 memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX with scale SCALE as a performance hint that a real gather instruction with the same set of sources will be invoked. A
programmer may execute this instruction before a real gather instruction to improve its
performance.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults. This instructions does not
modify any kind of architectural state (including the write-mask).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.

Operation
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
HintPointer(pointer)
}
}

SIMD Floating-Point Exceptions
None.

306

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
None

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM
#UD

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

307

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERPF0HINTDPS - Gather Prefetch Float32 Vector Hint With Signed
Dword Indices

Opcode
MVEX.512.66.0F38.W0 C6
/0 /vsib

Instruction
vgatherpf0hintdps
{k1}

Uf 32 (mvt )

Description
Gather Prefetch loat32 vector Uf 32 (mvt ), using
doubleword indices with T0 hint, under writemask.

Description
The instruction speci ies a set of 16 loat32 memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX with scale SCALE as a performance hint that a real gather instruction with the same set of sources will be invoked. A
programmer may execute this instruction before a real gather instruction to improve its
performance.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults. This instructions does not
modify any kind of architectural state (including the write-mask).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.

Operation
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
HintPointer(pointer)
}
}

SIMD Floating-Point Exceptions
None.

308

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
None

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM
#UD

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

309

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGATHERPF1DPS - Gather Prefetch Float32 Vector With Signed Dword
Indices Into L2

Opcode
MVEX.512.66.0F38.W0 C6
/2 /vsib

Instruction
vgatherpf1dps Uf 32 (mvt ) {k1}

Description
Gather Prefetch loat32 vector Uf 32 (mvt ), using
doubleword indices with T1 hint, under writemask.

Description
A set of 16 loat32 memory locations pointed by base address BASE_ADDR and doubleword index vector V IN DEX with scale SCALE are prefetched from memory to L2 level
of cache. If any memory access causes any type of memory exception, the memory access
will be considered as completed (destination mask updated) and the exception ignored.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
prefetch sequence have been prefetched and hence, the write-mask bits all are zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
Note that both gather and scatter prefetches set the access bit (A) in the related TLB page
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D).

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
exclusive = 0
evicthintpre = MVEX.EH
310

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
FetchL2cacheLine(pointer, exclusive, evicthintpre)
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_prefetch_i32gather_ps (__m512i, void const*, int, int);
_mm512_mask_prefetch_i32gather_ps (__m512i, __mmask16, void const*, int,
int);
_mm512_prefetch_i32extgather_ps
(__m512i,
void
const*,
_MM_UPCONV_PS_ENUM, int, int);
_mm512_mask_prefetch_i32extgather_ps ( __m512i, __mmask16, void const*,
_MM_UPCONV_PS_ENUM, int, int);

Reference Number: 327364-001

311

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

312

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGETEXPPD - Extract Float64 Vector of Exponents from Float64 Vector

Opcode
MVEX.512.66.0F38.W1 42
/r

Instruction
vgetexppd
zmm1
Sf 64 (zmm2/mt )

{k1},

Description
Extract loat64 vector of exponents from vector
Sf 64 (zmm2/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element exponent extraction from the Float64 vector result of
the swizzle/broadcast/conversion process on memory or Float64 vector zmm2. The result is written into Float64 vector zmm1.
GetExp() returns the (un-biased) exponent n in loating-point format. That is, when X =
1/16, GetExp() returns the value −4, represented as C0800000 in IEEE single precision
(for the single-precision version of the instruction). If the source is denormal, VGETEXP
will normalize it prior to exponent extraction (unless DAZ=1).
GetExp() function follows Table 6.17 when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
quietized input NaN
+∞
−∞
−∞
+∞

Table 6.17: GetExp() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = GetExp(tmpSrc2[i+63:i])
Reference Number: 327364-001

313

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
314

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_getexp_pd (__m512d);
_mm512_mask_getexp_pd (__m512d, __mmask8, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

315

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGETEXPPS - Extract Float32 Vector of Exponents from Float32 Vector

Opcode
MVEX.512.66.0F38.W0 42
/r

Instruction
vgetexpps
zmm1
Sf 32 (zmm2/mt )

{k1},

Description
Extract loat32 vector of exponents from vector
Sf 32 (zmm2/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element exponent extraction from the Float32 vector result of
the swizzle/broadcast/conversion process on memory or Float32 vector zmm2. The result is written into Float32 vector zmm1.
GetExp() returns the (un-biased) exponent n in loating-point format. That is, when X =
1/16, GetExp() returns the value −4, represented as C0800000 in IEEE single precision
(for the single-precision version of the instruction). If the source is denormal, VGETEXP
will normalize it prior to exponent extraction (unless DAZ=1).
GetExp() function follows Table 6.18 when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
quietized input NaN
+∞
−∞
−∞
+∞

Table 6.18: GetExp() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = GetExp(tmpSrc2[i+31:i])
316

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}
317

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_getexp_ps (__m512);
_mm512_mask_getexp_ps (__m512, __mmask16, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

318

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGETMANTPD - Extract Float64 Vector of Normalized Mantissas from
Float64 Vector

Opcode
MVEX.512.66.0F3A.W1 26
/r ib

Instruction
vgetmantpd
zmm1
Sf 64 (zmm2/mt ), imm8

{k1},

Description
Get Normalized Mantissa from loat64 vector
Sf 64 (zmm2/mt ) and store the result in zmm1,
using imm8 for sign control and mantissa interval normalization, under write-mask.

Description
Performs an element-by-element conversion of the Float64 vector result of the swizzle/broadcast/conversion process on memory or Float64 vector zmm2 to Float64 values
with the mantissa normalized to the interval speci ied by interv and sign dictated by the
sign control parameter sc. The result is written into Float64 vector zmm1. Denormal values are explicitly normalized.
The formula for the operation is:
GetM ant(x) = ±2k |x.signif icand|
where:
1 <= |x.signif icand| < 2
Exponent k is dependent on the interval range de ined by interv and whether the exponent of the source is even or odd. The sign of the inal result is determined by sc and the
source sign.
GetMant() function follows Table 6.19 when dealing with loating-point special numbers.
Input
NaN
+∞
+0
-0
−∞
<0

Result
QNaN(SRC)
+∞
+0.0
(SC[0])? +0.0 : −0.0
(SC[0])? +∞ : −∞

Exceptions/comments
Raises #I if sNaN
ignore interv
ignore interv
ignore interv, set NaN/raise #I if SC[1]=1
ignore interv, set NaN/raise #I if SC[1]=1
set NaN/raise #I if SC[1]=1

Table 6.19: GetMant() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

319

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Immediate Format
Normalization Interval
[1,2)
[1/2,2)
[1/2,1)
[3/4,3/2)

I1
0
0
1
1

I0
0
1
0
1

Sign Control
sign = sign(SRC)
sign = 0
DEST = NaN (#I) if sign(SRC) = 1

I3
0
0
1

I2
0
1
x

Operation
GetNormalizedMantissa(SRC , SignCtrl, Interv)
{
// Extracting the SRC sign, exponent and mantissa fields
SIGN = (SignCtrl[0])? 0 : SRC[63];
EXP
= SRC[63:52];
FRACT = (DAZ && (EXP == 0))? 0 : SRC[51:0];
// Check for NaN operand
if(IsNaN(SRC)) {
if(IsSNaN(SRC)) *set I flag*
return QNaN(SRC)
}
// If SignCtrl[1] is set to 1, return NaN and set
// exception flag if the operand is negative.
// Note that -0.0 is included
if( SignCtrl[1] && (SRC[63] == 1) )
*set I flag*
return QNaN_Indefinite
}

{

// Check for +/-INF and +/-0
if( ( EXP == 0x7FF && FRACTION == 0 )
|| ( EXP == 0 && FRACTION == 0 ) ) {
DEST[63:0] = (SIGN << 63) | (EXP[11:0] << 52) | FRACT[51:0];
return DEST
}
// Normalize denormal operands
//
note that denormal operands are treated as zero if
//
DAZ is set to 1
if((EXP == 0) && (FRACTION !=0) {
// JBIT is the hidden integral bit
JBIT = 0;
// Zero in case of denormal operands
EXP = 03FFh;
// Set exponent to BIAS
320

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

While(JBIT == 0) {
JBIT = FRACT[51];
FRACT = FRACT << 1;
EXP--;
}
*set D flag*

// Obtain fraction MSB
// Normalize mantissa
// and adjust exponent

}
// Apply normalization intervals
UNBIASED_EXP = EXP - 03FFh;
IS_ODD_EXP
= UNBIASED_EXP[0];

// get exponent in unbiased form
// if the unbiased exponent odd?

if( (Interv == 10b)
|| ( (Interv == 01b) && IS_ODD_EXP)
|| ( (Interv == 11b) && (FRACT[51]==1)) ) {
EXP = 03FEh;
// Set exponent to -1 (unbiased)
}
else {
EXP = 03FFh;
// Set exponent to 0 (unbiased)
}
// form the final destination
DEST[63:0] = (SIGN << 63) | (EXP[11:0] << 52) | FRACT[51:0];
return DEST
}

sc = IMM8[3:2]
interv = IMM8[1:0]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = GetNormalizedMantissa(tmpSrc2[i+63:i], sc, interv)
}
}

Reference Number: 327364-001

321

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

322

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_getmant_pd
(__m512d,
_MM_MANTISSA_NORM_ENUM,
_MM_MANTISSA_SIGN_ENUM);
_mm512_mask_getmant_pd
(__m512d,
__mmask8,
__m512d,
_MM_MANTISSA_NORM_ENUM, _MM_MANTISSA_SIGN_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

323

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGETMANTPS - Extract Float32 Vector of Normalized Mantissas from
Float32 Vector

Opcode
MVEX.512.66.0F3A.W0 26
/r ib

Instruction
vgetmantps
zmm1
Sf 32 (zmm2/mt ), imm8

{k1},

Description
Get Normalized Mantissa from loat32 vector
Sf 32 (zmm2/mt ) and store the result in zmm1,
using imm8 for sign control and mantissa interval normalization, under write-mask.

Description
Performs an element-by-element conversion of the Float32 vector result of the swizzle/broadcast/conversion process on memory or Float32 vector zmm2 to Float32 values
with the mantissa normalized to the interval speci ied by interv and sign dictated by the
sign control parameter sc. The result is written into Float32 vector zmm1. Denormal values are explicitly normalized.
The formula for the operation is:
GetM ant(x) = ±2k |x.signif icand|
where:
1 <= |x.signif icand| < 2
Exponent k is dependent on the interval range de ined by interv and whether the exponent of the source is even or odd. The sign of the inal result is determined by sc and the
source sign.
GetMant() function follows Table 6.20 when dealing with loating-point special numbers.
Input
NaN
+∞
+0
-0
−∞
<0

Result
QNaN(SRC)
+∞
+0.0
(SC[0])? +0.0 : −0.0
(SC[0])? +∞ : −∞

Exceptions/comments
Raises #I if sNaN
ignore interv
ignore interv
ignore interv, set NaN/raise #I if SC[1]=1
ignore interv, set NaN/raise #I if SC[1]=1
set NaN/raise #I if SC[1]=1

Table 6.20: GetMant() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

324

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Immediate Format
Normalization Interval
[1,2)
[1/2,2)
[1/2,1)
[3/4,3/2)

I1
0
0
1
1

I0
0
1
0
1

Sign Control
sign = sign(SRC)
sign = 0
DEST = NaN (#I) if sign(SRC) = 1

I3
0
0
1

I2
0
1
x

Operation
GetNormalizedMantissa(SRC , SignCtrl, Interv)
{
// Extracting the SRC sign, exponent and mantissa fields
SIGN = (SignCtrl[0])? 0 : SRC[31];
EXP
= SRC[30:23];
FRACT = (DAZ && (EXP == 0))? 0 : SRC[22:0];
// Check for NaN operand
if(IsNaN(SRC)) {
if(IsSNaN(SRC)) *set I flag*
return QNaN(SRC)
}
// If SignCtrl[1] is set to 1, return NaN and set
// exception flag if the operand is negative.
// Note that -0.0 is included
if( SignCtrl[1] && (SRC[31] == 1) )
*set I flag*
return QNaN_Indefinite
}

{

// Check for +/-INF and +/-0
if( ( EXP == 0xFF && FRACTION == 0 )
|| ( EXP == 0 && FRACTION == 0 ) ) {
DEST[31:0] = (SIGN << 31) | (EXP[7:0] << 23) | FRACT[22:0];
return DEST
}
// Apply normalization intervals
UNBIASED_EXP = EXP - 07Fh;
IS_ODD_EXP
= UNBIASED_EXP[0];

// get exponent in unbiased form
// if the unbiased exponent odd?

if( (Interv == 10b)
|| ( (Interv == 01b) && IS_ODD_EXP)
|| ( (Interv == 11b) && (FRACT[22]==1)) ) {
Reference Number: 327364-001

325

CHAPTER 6. INSTRUCTION DESCRIPTIONS

EXP = 07Eh;
}
else {
EXP = 07Fh;
}

// Set exponent to -1 (unbiased)

// Set exponent to 0 (unbiased)

// form the final destination
DEST[31:0] = (SIGN << 31) | (EXP[7:0] << 23) | FRACT[22:0];
return DEST
}

sc = IMM8[3:2]
interv = IMM8[1:0]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = GetNormalizedMantissa(tmpSrc2[i+31:i], sc, interv)
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
Not Applicable

326

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_getmant_ps
(__m512,
_MM_MANTISSA_NORM_ENUM,
_MM_MANTISSA_SIGN_ENUM);
_mm512_mask_getmant_ps
(__m512,
__mmask16,
__m512,
_MM_MANTISSA_NORM_ENUM, _MM_MANTISSA_SIGN_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
327

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

328

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGMAXABSPS - Absolute Maximum of Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vgmaxabsps zmm1 {k1}, zmm2,
51 /r
Sf 32 (zmm3/mt )

Description
Determine the maximum of the absolute values of loat32 vector zmm2 and loat32 vector
Sf 32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Determines the maximum of the absolute values of each pair of corresponding elements
in loat32 vector zmm2 and the loat32 vector result of the swizzle/broadcast/conversion
process on memory or loat32 vector zmm3. The result is written into loat32 vector
zmm1.
Abs() returns the absolute value of one loat32 argument. FpMax() returns the bigger
of the two loat32 arguments, following IEEE in general. NaN has special handling: If
one source operand is NaN, then the other source operand is returned (choice made percomponent). If both are NaN, then the unchanged NaN from the irst source (here zmm2)
is returned. Please note that if irst source is a SNaN it won't be quietized, it will be returned without any modi ication. This differs from the new IEEE 754-08 rules, which
states that in case of an input SNaN, its quietized version should be returned instead of
the other value.
Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recommends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the
result of comparing zeros to be dependent on the order of parameters, using a comparison
that ignores the signs.
This instruction treats input denormals as zeros according to the DAZ control bit, but it
does not lush tiny results to zero.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
FpMaxAbs(A,B)
{
if ((A == NaN) && (B == NaN))
return Abs(A);
else if (A == NaN)
return Abs(B);
else if (B == NaN)
Reference Number: 327364-001

329

CHAPTER 6. INSTRUCTION DESCRIPTIONS

return Abs(A);
else if ((Abs(A) == +inf) || (Abs(B) == +inf))
return +inf;
else if (Abs(A) >= Abs(B))
return Abs(A);
else
return Abs(B);
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = FpMaxAbs(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111
330

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_gmaxabs_ps (__m512, __m512);
_mm512_mask_gmaxabs_p s(__m512, __mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
331

CHAPTER 6. INSTRUCTION DESCRIPTIONS

332

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGMAXPD - Maximum of Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vgmaxpd zmm1 {k1},
53 /r
Sf 64 (zmm3/mt )

zmm2,

Description
Determine the maximum of loat64 vector
zmm2 and loat64 vector Sf 64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the maximum value of each pair of corresponding elements in loat64 vector zmm2 and the loat64 vector result of the swizzle/broadcast/conversion process on
memory or loat64 vector zmm3. The result is written into loat64 vector zmm1.
FpMax() returns the bigger of the two loat32 arguments, following IEEE in general. NaN
has special handling: If one source operand is NaN, then the other source operand is returned (choice made per-component). If both are NaN, then the unchanged NaN from the
irst source (here zmm2) is returned. Please note that if irst source is a SNaN it won't be
quietized, it will be returned without any modi ication. This differs from the new IEEE
754-08 rules, which states that in case of an input SNaN, its quietized version should be
returned instead of the other value.
Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recommends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the
result of comparing zeros to be dependent on the order of parameters, using a comparison
that ignores the signs.
This instruction treats input denormals as zeros according to the DAZ control bit, but it
does not lush tiny results to zero.
The following table describes exception lags priority:
Input 1
SNAN
denormal
QNAN
denormal
normal
denormal
denormal

Input 2
denormal
SNAN
denormal
QNAN
denormal
normal
denormal

Flags
#I
#I
none
none
#D
#D
#D

Comments
#I priority over #D
#I priority over #D
QNaN rule priority over #D
QNaN rule priority over #D
only if DAZ=0
only if DAZ=0
only if DAZ=0

Table 6.21: Max exception lags priority

This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.
Reference Number: 327364-001

333

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
FpMax(A,B)
{
if ((A == -0.0) && (B == +0.0))
if ((A == +0.0) && (B == -0.0))
if ((A == NaN) && (B == NaN))
if (A == NaN)
if (B == NaN)
if (A == -inf)
if (B == -inf)
if (A == +inf)
if (B == +inf)
if (A >= B)

return
return
return
return
return
return
return
return
return
return

B;
A;
A;
B;
A;
B;
A;
A;
B;
A;

return B;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = FpMax(zmm2[i+63:i] , tmpSrc3[i+63:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

334

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_gmax_pd (__m512d, __m512d);
_mm512_mask_gmax_pd (__m512d, __mmask8,__m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
335

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

336

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGMAXPS - Maximum of Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vgmaxps zmm1 {k1},
53 /r
Sf 32 (zmm3/mt )

zmm2,

Description
Determine the maximum of loat32 vector
zmm2 and loat32 vector Sf 32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the maximum value of each pair of corresponding elements in loat32 vector zmm2 and the loat32 vector result of the swizzle/broadcast/conversion process on
memory or loat32 vector zmm3. The result is written into loat32 vector zmm1.
FpMax() returns the bigger of the two loat32 arguments, following IEEE in general. NaN
has special handling: If one source operand is NaN, then the other source operand is returned (choice made per-component). If both are NaN, then the unchanged NaN from the
irst source (here zmm2) is returned. Please note that if irst source is a SNaN it won't be
quietized, it will be returned without any modi ication. This differs from the new IEEE
754-08 rules, which states that in case of an input SNaN, its quietized version should be
returned instead of the other value.
Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recommends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the
result of comparing zeros to be dependent on the order of parameters, using a comparison
that ignores the signs.
This instruction treats input denormals as zeros according to the DAZ control bit, but it
does not lush tiny results to zero.
The following table describes exception lags priority:
Input 1
SNAN
denormal
QNAN
denormal
normal
denormal
denormal

Input 2
denormal
SNAN
denormal
QNAN
denormal
normal
denormal

Flags
#I
#I
none
none
#D
#D
#D

Comments
#I priority over #D
#I priority over #D
QNaN rule priority over #D
QNaN rule priority over #D
only if DAZ=0
only if DAZ=0
only if DAZ=0

Table 6.22: Max exception lags priority

This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.
Reference Number: 327364-001

337

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
FpMax(A,B)
{
if ((A == -0.0) && (B == +0.0))
if ((A == +0.0) && (B == -0.0))
if ((A == NaN) && (B == NaN))
if (A == NaN)
if (B == NaN)
if (A == -inf)
if (B == -inf)
if (A == +inf)
if (B == +inf)
if (A >= B)

return
return
return
return
return
return
return
return
return
return

B;
A;
A;
B;
A;
B;
A;
A;
B;
A;

return B;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = FpMax(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

338

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_gmax_ps (__m512, __m512);
_mm512_mask_gmax_ps (__m512, __mmask16, __m512, __m512);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

339

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

340

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGMINPD - Minimum of Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vgminpd zmm1 {k1},
52 /r
Sf 64 (zmm3/mt )

zmm2,

Description
Determine the minimum of loat64 vector
zmm2 and loat64 vector Sf 64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the minimum value of each pair of corresponding elements in loat64 vector zmm2 and the loat64 vector result of the swizzle/broadcast/conversion process on
memory or loat64 vector zmm3. The result is written into loat64 vector zmm1.
FpMin() returns the smaller of the two loat32 arguments, following IEEE in general. NaN
has special handling: If one source operand is NaN, then the other source operand is returned (choice made per-component). If both are NaN, then the unchanged NaN from the
irst source (here zmm2) is returned. Please note that if irst source is a SNaN it won't be
quietized, it will be returned without any modi ication. This differs from the new IEEE
754-08 rules, which states that in case of an input SNaN, its quietized version should be
returned instead of the other value.
Another new IEEE 754-08 rule is that min(-0,+0) == min(+0,-0) == -0, which honors the
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recommends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the
result of comparing zeros to be dependent on the order of parameters, using a comparison
that ignores the signs.
This instruction treats input denormals as zeros according to the DAZ control bit, but it
does not lush tiny results to zero.
The following table describes exception lags priority:
Input 1
SNAN
denormal
QNAN
denormal
normal
denormal
denormal

Input 2
denormal
SNAN
denormal
QNAN
denormal
normal
denormal

Flags
#I
#I
none
none
#D
#D
#D

Comments
#I priority over #D
#I priority over #D
QNaN rule priority over #D
QNaN rule priority over #D
only if DAZ=0
only if DAZ=0
only if DAZ=0

Table 6.23: Min exception lags priority

This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.
Reference Number: 327364-001

341

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
FpMin(A,B)
{
if ((A == -0.0) && (B == +0.0))
if ((A == +0.0) && (B == -0.0))
if ((A == NaN) && (B == NaN))
if (A == NaN)
if (B == NaN)
if (A == -inf)
if (B == -inf)
if (A == +inf)
if (B == +inf)
if (A < B)

return
return
return
return
return
return
return
return
return
return

A;
B;
A;
B;
A;
A;
B;
B;
A;
A;

return B;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = FpMin(zmm2[i+63:i] , tmpSrc3[i+63:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

342

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_gmin_pd (__m512d, __m512d);
_mm512_mask_gmin_pd (__m512d, __mmask8, __m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD
Reference Number: 327364-001

Instruction not available in these modes
343

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

344

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VGMINPS - Minimum of Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vgminps zmm1
52 /r
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Determine the minimum of loat32 vector
zmm2 and loat32 vector Sf 32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the minimum value of each pair of corresponding elements in loat32 vector zmm2 and the loat32 vector result of the swizzle/broadcast/conversion process on
memory or loat32 vector zmm3. The result is written into loat32 vector zmm1.
FpMin() returns the smaller of the two loat32 arguments, following IEEE in general. NaN
has special handling: If one source operand is NaN, then the other source operand is returned (choice made per-component). If both are NaN, then the unchanged NaN from the
irst source (here zmm2) is returned. Please note that if irst source is a SNaN it won't be
quietized, it will be returned without any modi ication. This differs from the new IEEE
754-08 rules, which states that in case of an input SNaN, its quietized version should be
returned instead of the other value.
Another new IEEE 754-08 rule is that min(-0,+0) == min(+0,-0) == -0, which honors the
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recommends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the
result of comparing zeros to be dependent on the order of parameters, using a comparison
that ignores the signs.
This instruction treats input denormals as zeros according to the DAZ control bit, but it
does not lush tiny results to zero.
The following table describes exception lags priority:
Input 1
SNAN
denormal
QNAN
denormal
normal
denormal
denormal

Input 2
denormal
SNAN
denormal
QNAN
denormal
normal
denormal

Flags
#I
#I
none
none
#D
#D
#D

Comments
#I priority over #D
#I priority over #D
QNaN rule priority over #D
QNaN rule priority over #D
only if DAZ=0
only if DAZ=0
only if DAZ=0

Table 6.24: Min exception lags priority

This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.
Reference Number: 327364-001

345

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
FpMin(A,B)
{
if ((A == -0.0) && (B == +0.0))
if ((A == +0.0) && (B == -0.0))
if ((A == NaN) && (B == NaN))
if (A == NaN)
if (B == NaN)
if (A == -inf)
if (B == -inf)
if (A == +inf)
if (B == +inf)
if (A < B)

return
return
return
return
return
return
return
return
return
return

A;
B;
A;
B;
A;
A;
B;
B;
A;
A;

return B;
}
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = FpMin(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

346

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_gmin_ps(__m512, __m512);
_mm512_mask_gmin_ps(__m512, __mmask16, __m512, __m512);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

347

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

348

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKHD - Load Unaligned High And Unpack To Doubleword Vector

Opcode
MVEX.512.0F38.W0 D4 /r

Instruction
vloadunpackhd
Ui32 (mt )

zmm1

{k1},

Description
Load high 64-byte-aligned portion of unaligned
doubleword stream Ui32 (mt - 64), unpack
mask-enabled elements that fall in that portion,
and store those elements in doubleword vector
zmm1, under write-mask.

Description
The high-64-byte portion of the byte/word/doubleword stream starting at the elementaligned address (mt −64) is loaded, converted and expanded into the write-mask-enabled
elements of doubleword vector zmm1. The number of set bits in the write-mask determines the length of the converted doubleword stream, as each converted doubleword is
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked
elements of zmm1.
This instruction only transfers those converted doublewords (if any) in the stream that
occur at or after the irst 64-byte-aligned address following (mt − 64) (that is, in the high
cache line of the memory stream for the current implementation). Elements in zmm1
that don't map to those stream doublewords are left unchanged. The vloadunpackld instruction is used to load the part of the stream before the irst 64-byte-aligned address
preceding mt .
In conjunction with vloadunpackld, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpackld, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFFFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpackld v0 {k1}, [rax]
vloadunpackhd v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
Reference Number: 327364-001

349

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOfi32 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 32*n
zmm1[i+31:i] = UpConvLoadi32 (pointer + upSize*loadOffset)
}
loadOffset++
}
}

Flags Affected
None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111
350

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i

_mm512_extloadunpackhi_epi32
(__m512i,
void
const*,
_MM_UPCONV_EPI32_ENUM, int);
_mm512_mask_extloadunpackhi_epi32 (__m512i, __mmask16, void const*,
_MM_UPCONV_EPI32_ENUM, int);
_mm512_loadunpackhi_epi32 (__m512i, void const*);
_mm512_mask_loadunpackhi_epi32 (__m512i, __mmask16, void const*);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

351

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKHPD - Load Unaligned High And Unpack To Float64 Vector

Opcode
MVEX.512.0F38.W1 D5 /r

Instruction
vloadunpackhpd
Uf 64 (mt )

zmm1

{k1},

Description
Load high 64-byte-aligned portion of unaligned
loat64 stream Uf 64 (mt - 64), unpack maskenabled elements that fall in that portion, and
store those elements in loat64 vector zmm1,
under write-mask.

Description
The high-64-byte portion of the quadword stream starting at the element-aligned address
(mt − 64) is loaded, converted and expanded into the write-mask-enabled elements of
quadword vector zmm1. The number of set bits in the write-mask determines the length
of the converted quadword stream, as each converted quadword is mapped to exactly one
of the quadword elements in zmm1, skipping over write-masked elements of zmm1.
This instruction only transfers those converted quadwords (if any) in the stream that occur at or after the irst 64-byte-aligned address following (mt − 64) (that is, in the high
cache line of the memory stream for the current implementation). Elements in zmm1
that don't map to those stream quadwords are left unchanged. The vloadunpacklpd instruction is used to load the part of the stream before the irst 64-byte-aligned address
preceding mt .
In conjunction with vloadunpacklpd, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpacklpd, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklpd v0 {k1}, [rax]
vloadunpackhpd v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
352

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

boundary. Additionally, A/D bits in the page table will not be updated.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation

loadOffset = 0
upSize = UpConvLoadSizeOff 64 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 64*n
zmm1[i+63:i] = UpConvLoadf 64 (pointer + upSize*loadOffset)
}
loadOffset++
}
}

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A
353

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d
__m512d

_mm512_extloadunpackhi_pd (__m512d, void const*, _MM_UPCONV_PD_ENUM,
int);
_mm512_mask_extloadunpackhi_pd (__m512d, __mmask8, void const*,
_MM_UPCONV_PD_ENUM, int);
_mm512_loadunpackhi_pd (__m512d, void const*);
_mm512_mask_loadunpackhi_pd (__m512d, __mmask8, void const*);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

354

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKHPS - Load Unaligned High And Unpack To Float32 Vector

Opcode
MVEX.512.0F38.W0 D5 /r

Instruction
vloadunpackhps
Uf 32 (mt )

zmm1

{k1},

Description
Load high 64-byte-aligned portion of unaligned
loat32 stream Uf 32 (mt - 64), unpack maskenabled elements that fall in that portion, and
store those elements in loat32 vector zmm1,
under write-mask.

Description
The high-64-byte portion of the byte/word/doubleword stream starting at the elementaligned address (mt −64) is loaded, converted and expanded into the write-mask-enabled
elements of doubleword vector zmm1. The number of set bits in the write-mask determines the length of the converted doubleword stream, as each converted doubleword is
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked
elements of zmm1.
This instruction only transfers those converted doublewords (if any) in the stream that
occur at or after the irst 64-byte-aligned address following (mt − 64) (that is, in the high
cache line of the memory stream for the current implementation). Elements in zmm1
that don't map to those stream doublewords are left unchanged. The vloadunpacklps instruction is used to load the part of the stream before the irst 64-byte-aligned address
preceding mt .
In conjunction with vloadunpacklps, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpacklps, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFFFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklps v0 {k1}, [rax]
vloadunpackhps v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
Reference Number: 327364-001

355

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOff 32 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 32*n
zmm1[i+31:i] = UpConvLoadf 32 (pointer + upSize*loadOffset)
}
loadOffset++
}
}

SIMD Floating-Point Exceptions
Invalid.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111
356

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512
__m512

_mm512_extloadunpackhi_ps (__m512, void const*, _MM_UPCONV_PS_ENUM,
int);
_mm512_mask_extloadunpackhi_ps (__m512, __mmask16, void const*,
_MM_UPCONV_PS_ENUM, int);
_mm512_loadunpackhi_ps (__m512, void const*);
_mm512_mask_loadunpackhi_ps (__m512, __mmask16, void const*);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

357

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKHQ - Load Unaligned High And Unpack To Int64 Vector

Opcode
MVEX.512.0F38.W1 D4 /r

Instruction
vloadunpackhq
Ui64 (mt )

zmm1

{k1},

Description
Load high 64-byte-aligned portion of unaligned
int64 stream Ui64 (mt - 64), unpack maskenabled elements that fall in that portion, and
store those elements in int64 vector zmm1, under write-mask.

Description
The high-64-byte portion of the quadword stream starting at the element-aligned address
(mt − 64) is loaded, converted and expanded into the write-mask-enabled elements of
quadword vector zmm1. The number of set bits in the write-mask determines the length
of the converted quadword stream, as each converted quadword is mapped to exactly one
of the quadword elements in zmm1, skipping over write-masked elements of zmm1.
This instruction only transfers those converted quadwords (if any) in the stream that occur at or after the irst 64-byte-aligned address following (mt − 64) (that is, in the high
cache line of the memory stream for the current implementation). Elements in zmm1 that
don't map to those stream quadwords are left unchanged. The vloadunpacklq instruction
is used to load the part of the stream before the irst 64-byte-aligned address preceding
mt .
In conjunction with vloadunpacklq, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpacklq, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklq v0 {k1}, [rax]
vloadunpackhq v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
358

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOfi64 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 64*n
zmm1[i+63:i] = UpConvLoadi64 (pointer + upSize*loadOffset)
}
loadOffset++
}
}

Flags Affected
None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

359

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i

_mm512_extloadunpackhi_epi64
(__m512i,
void
const*,
_MM_UPCONV_EPI64_ENUM, int);
_mm512_mask_extloadunpackhi_epi64 (__m512i, __mmask8, void const*,
_MM_UPCONV_EPI64_ENUM, int);
_mm512_loadunpackhi_epi64 (__m512i, void const*);
_mm512_mask_loadunpackhi_epi64 (__m512i, __mmask8, void const*);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

360

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKLD - Load Unaligned Low And Unpack To Doubleword Vector

Opcode
MVEX.512.0F38.W0 D0 /r

Instruction
vloadunpackld
Ui32 (mt )

zmm1

{k1},

Description
Load low 64-byte-aligned portion of unaligned
doubleword stream Ui32 (mt ), unpack maskenabled elements that fall in that portion, and
store those elements in doubleword vector
zmm1, under write-mask.

Description
The low-64-byte portion of the byte/word/doubleword stream starting at the elementaligned address mt is loaded, converted and expanded into the write-mask-enabled elements of doubleword vector zmm1. The number of set bits in the write-mask determines the length of the converted doubleword stream, as each converted doubleword is
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked
elements of zmm1.
This instruction only transfers those converted doublewords (if any) in the stream that
occur before the irst 64-byte-aligned address following mt (that is, in the low cache line of
the memory stream in the current implementation). Elements in zmm1 that don't map to
those converted stream doublewords are left unchanged. The vloadunpackhd instruction
is used to load the part of the stream at or after the irst 64-byte-aligned address preceding
mt .
In conjunction with vloadunpackhd, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpackhd, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFFFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpackld v0 {k1}, [rax]
vloadunpackhd v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.
Reference Number: 327364-001

361

CHAPTER 6. INSTRUCTION DESCRIPTIONS

This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOfi32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
i = 32*n
if (k1[n] != 0) {
zmm1[i+31:i] = UpConvLoadi32 (mt +upSize*loadOffset)
loadOffset++
if ( ( (mt + upSize*loadOffset) % 64) == 0) {
break
}
}
}

Flags Affected
None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i
362

_mm512_extloadunpacklo_epi32
(__m512i,
void
const*,
_MM_UPCONV_EPI32_ENUM, int);
_mm512_mask_extloadunpacklo_epi32 (__m512i, __mmask16, void const*,
_MM_UPCONV_EPI32_ENUM, int);
_mm512_loadunpacklo_epi32 (__m512i, void const*);
_mm512_mask_loadunpacklo_epi32 (__m512i, __mmask16, void const*);
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

363

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKLPD - Load Unaligned Low And Unpack To Float64 Vector

Opcode
MVEX.512.0F38.W1 D1 /r

Instruction
vloadunpacklpd
Uf 64 (mt )

zmm1

{k1},

Description
Load low 64-byte-aligned portion of unaligned
loat64 stream Uf 64 (mt ), unpack mask-enabled
elements that fall in that portion, and store
those elements in loat64 vector zmm1, under
write-mask.

Description
The low-64-byte portion of the quadword stream starting at the element-aligned address
mt is loaded, converted and expanded into the write-mask-enabled elements of quadword
vector zmm1. The number of set bits in the write-mask determines the length of the converted quadword stream, as each converted quadword is mapped to exactly one of the
quadword elements in zmm1, skipping over write-masked elements of zmm1.
This instruction only transfers those converted quadwords (if any) in the stream that occur before the irst 64-byte-aligned address following mt (that is, in the low cache line of
the memory stream in the current implementation). Elements in zmm1 that don't map to
those converted stream quadwords are left unchanged. The vloadunpackhq instruction is
used to load the part of the stream at or after the irst 64-byte-aligned address preceding
mt .
In conjunction with vloadunpackhpd, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpackhpd, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklpd v0 {k1}, [rax]
vloadunpackhpd v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
364

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOff 64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
i = 64*n
if (k1[n] != 0) {
zmm1[i+63:i] = UpConvLoadf 64 (mt +upSize*loadOffset)
loadOffset++
if ( ( (mt + upSize*loadOffset) % 64) == 0) {
break
}
}
}

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d
__m512d
__m512d

_mm512_extloadunpacklo_pd (__m512d, void const*, _MM_UPCONV_PD_ENUM,
int);
_mm512_mask_extloadunpacklo_pd (__m512d, __mmask8, void const*,
_MM_UPCONV_PD_ENUM, int);
_mm512_loadunpacklo_pd (__m512d, void const*);
_mm512_mask_loadunpacklo_pd (__m512d, __mmask8, void const*);

Reference Number: 327364-001

365

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

366

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKLPS - Load Unaligned Low And Unpack To Float32 Vector

Opcode
MVEX.512.0F38.W0 D1 /r

Instruction
vloadunpacklps
Uf 32 (mt )

zmm1

{k1},

Description
Load low 64-byte-aligned portion of unaligned
loat32 stream Uf 32 (mt ), unpack mask-enabled
elements that fall in that portion, and store
those elements in loat32 vector zmm1, under
write-mask.

Description
The low-64-byte portion of the byte/word/doubleword stream starting at the elementaligned address mt is loaded, converted and expanded into the write-mask-enabled elements of doubleword vector zmm1. The number of set bits in the write-mask determines the length of the converted doubleword stream, as each converted doubleword is
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked
elements of zmm1.
This instruction only transfers those converted doublewords (if any) in the stream that
occur before the irst 64-byte-aligned address following mt (that is, in the low cache line of
the memory stream in the current implementation). Elements in zmm1 that don't map to
those converted stream doublewords are left unchanged. The vloadunpackhd instruction
is used to load the part of the stream at or after the irst 64-byte-aligned address preceding
mt .
In conjunction with vloadunpackhps, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpackhps, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFFFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklps v0 {k1}, [rax]
vloadunpackhps v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.
This instruction is write-masked, so only those elements with the corresponding bit set in
Reference Number: 327364-001

367

CHAPTER 6. INSTRUCTION DESCRIPTIONS

vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOff 32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
i = 32*n
if (k1[n] != 0) {
zmm1[i+31:i] = UpConvLoadf 32 (mt +upSize*loadOffset)
loadOffset++
if ( ( (mt + upSize*loadOffset) % 64) == 0) {
break
}
}
}

SIMD Floating-Point Exceptions
Invalid.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512
__m512
__m512
368

_mm512_extloadunpacklo_ps (__m512, void const*, _MM_UPCONV_PS_ENUM,
int);
_mm512_mask_extloadunpacklo_ps (__m512, __mmask16, void const*,
_MM_UPCONV_PS_ENUM, int);
_mm512_loadunpacklo_ps (__m512, void const*);
_mm512_mask_loadunpacklo_ps (__m512, __mmask16, void const*);
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

369

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOADUNPACKLQ - Load Unaligned Low And Unpack To Int64 Vector

Opcode
MVEX.512.0F38.W1 D0 /r

Instruction
vloadunpacklq
Ui64 (mt )

zmm1

{k1},

Description
Load low 64-byte-aligned portion of unaligned
int64 stream Ui64 (mt ), unpack mask-enabled
elements that fall in that portion, and store
those elements in int64 vector zmm1, under
write-mask.

Description
The low-64-byte portion of the quadword stream starting at the element-aligned address
mt is loaded, converted and expanded into the write-mask-enabled elements of quadword
vector zmm1. The number of set bits in the write-mask determines the length of the converted quadword stream, as each converted quadword is mapped to exactly one of the
quadword elements in zmm1, skipping over write-masked elements of zmm1.
This instruction only transfers those converted quadwords (if any) in the stream that occur before the irst 64-byte-aligned address following mt (that is, in the low cache line of
the memory stream in the current implementation). Elements in zmm1 that don't map to
those converted stream quadwords are left unchanged. The vloadunpackhq instruction is
used to load the part of the stream at or after the irst 64-byte-aligned address preceding
mt .
In conjunction with vloadunpackhq, this instruction is useful for re-expanding data that
was packed into a queue. Also in conjunction with vloadunpackhq, it allows unaligned
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned);
use a mask of 0xFF or no write-mask for this purpose. The typical instruction sequence
to perform an unaligned vector load would be:
// assume memory location is pointed by register rax
vloadunpacklq v0 {k1}, [rax]
vloadunpackhq v0 {k1}, [rax+64]
This instruction does not have broadcast support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note that this instruction will always access 64 bytes of memory. The memory region
accessed will always be between linear_address & (∼0x3F) and (linear_address & (∼0x3F))
+ 63 boundaries.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are modi ied in zmm1. Elements in zmm1 with the corresponding
370

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

bit clear in vector mask1 retain their previous values. However, see above for unusual
aspects of the write-mask's operation with this instruction.

Operation
loadOffset = 0
upSize = UpConvLoadSizeOfi64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
i = 64*n
if (k1[n] != 0) {
zmm1[i+63:i] = UpConvLoadi64 (mt +upSize*loadOffset)
loadOffset++
if ( ( (mt + upSize*loadOffset) % 64) == 0) {
break
}
}
}

Flags Affected
None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i

_mm512_extloadunpacklo_epi64
(__m512i,
void
const*,
_MM_UPCONV_EPI64_ENUM, int);
_mm512_mask_extloadunpacklo_epi64 (__m512i, __mmask8, void const*,
_MM_UPCONV_EPI64_ENUM, int);
_mm512_loadunpacklo_epi64 (__m512i, void const*);
_mm512_mask_loadunpacklo_epi64 (__m512i, __mmask8, void const*);

Reference Number: 327364-001

371

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

372

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the second operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VLOG2PS - Vector Logarithm Base-2 of Float32 Vector

Opcode
MVEX.512.66.0F38.W0 C9
/r

Instruction
vlog2ps zmm1 {k1}, zmm2/mt

Description
Calculate logarithm from loat32 vector
zmm2/mt and store the result in zmm1, under
write-mask.

Description
Computes the element-by-element logarithm base-2 of the loat32 vector on memory or
loat32 vector zmm2. The result is written into loat32 vector zmm1.
1. 4ulp of relative error when the source value is within the intervals (0, 0.5) or (2, ∞]
2. absolute error less than 2−21 within the interval [0.5, 2]
For an input value of +/ − 0 the instruction returns −∞ and sets the Divide-By-Zero
lag (#Z). Negative numbers (including −∞) should return the canonical NaN and set the
Invalid lag (#I). Note however that this instruction treats input denormals as zeros of
the same sign, so for denormal negative inputs it returns −∞ and sets the Divide-By-Zero
status lag. If any source element is NaN, the quietized NaN source value is returned for
that element (and #I is raised for input sNaNs).
Current implementation of this instruction does not support any SwizzUpConv setting
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result
in an Invalid Opcode exception.
log2_DX() function follows Table 6.25 when dealing with loating-point special numbers.
Input
NaN
+∞
+0
−0
<0
−∞
2n

Result
input qNaN
+∞
−∞
−∞
NaN
NaN
n

Comments
Raise #I lag if sNaN
Raise #Z lag
Raise #Z lag
Raise #I lag
Raise #I lag
Exact integral result

Table 6.25: vlog2_DX() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Reference Number: 327364-001

373

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
tmpSrc2[511:0] = zmm2/mt
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags()
// SAE
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = vlog2_DX(tmpSrc2[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Zero.

Denormal Handling
Treat Input Denormals As Zeros :
YES
Flush Tiny Results To Zero :
YES

Register Swizzle
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
374

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Usage
, {sae}
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_log2_ps (__m512);
_mm512_mask_log2_ps (__m512, __mmask16, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

375

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVAPD - Move Aligned Float64 Vector

Opcode
MVEX.512.66.0F.W1 28 /r

Instruction
vmovapd zmm1 {k1}, Uf 64 (mt )

MVEX.512.66.0F.W1 28 /r

vmovapd zmm1 {k1}, Sf 64 (zmm2)

MVEX.512.66.0F.W1 29 /r

vmovapd mt {k1}, Df 64 (zmm1)

Description
Move loat64 vector Uf 64 (mt ) into vector
zmm1, under write-mask.
Move loat64 vector Sf 64 (zmm2) into vector
zmm1, under write-mask.
Move loat64 vector Df 64 (zmm1) into mt , under write-mask.

Description
Moves loat64 vector result of the swizzle/broadcast/conversion process on memory or
loat64 vector zmm2 into loat64 vector zmm1 or down-converts and stores loat64 vector
zmm2 into destination memory.
This instruction is write-masked, so only those elements with the corresponding bit(s) set
in the vector mask (k1) register are computed and stored into register/memory. Elements
in register/memory with the corresponding bit(s) clear in the vector mask register are
maintained with the previous value.

Operation
DESTINATION IS A VECTOR OPERAND
if(source is a register operand) {
if(MVEX.EH==1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2)
}
} else {
tmpSrc2[511:0] = UpConvLoadf 64 (mt )
}
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = tmpSrc2[i+63:i])
}
}
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
376

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (k1[n] != 0) {
i = 64*n
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*n) = tmp[63:0]
}
}
}

SIMD Floating-Point Exceptions
DESTINATION IS A VECTOR OPERAND: None.
DESTINATION IS A MEMORY OPERAND: None.

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

377

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d

_mm512_mask_mov_pd (__m512d, __mmask8, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

378

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVAPS - Move Aligned Float32 Vector

Opcode
MVEX.512.0F.W0 28 /r

Instruction
vmovaps zmm1 {k1}, Uf 32 (mt )

MVEX.512.0F.W0 28 /r

vmovaps zmm1 {k1}, Sf 32 (zmm2)

MVEX.512.0F.W0 29 /r

vmovaps mt {k1}, Df 32 (zmm1)

Description
Move loat32 vector Uf 32 (mt ) into vector
zmm1, under write-mask.
Move loat32 vector Sf 32 (zmm2) into vector
zmm1, under write-mask.
Move loat32 vector Df 32 (zmm1) into mt , under write-mask.

Description
Moves loat32 vector result of the swizzle/broadcast/conversion process on memory or
loat32 vector zmm2 into loat32 vector zmm1 or down-converts and stores loat32 vector
zmm2 into destination memory.
This instruction is write-masked, so only those elements with the corresponding bit(s) set
in the vector mask (k1) register are computed and stored into register/memory. Elements
in register/memory with the corresponding bit(s) clear in the vector mask register are
maintained with the previous value.

Operation
DESTINATION IS A VECTOR OPERAND
if(source is a register operand) {
if(MVEX.EH==1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2)
}
} else {
tmpSrc2[511:0] = UpConvLoadf 32 (mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = tmpSrc2[i+31:i])
}
}
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
Reference Number: 327364-001

379

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (k1[n] != 0) {
i = 32*n
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*n) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*n) = tmp[15:0]
} else if(downSize == 1) {
MemStore(mt +n) = tmp[7:0]
}
}
}

SIMD Floating-Point Exceptions
DESTINATION IS A VECTOR OPERAND: Invalid.
DESTINATION IS A MEMORY OPERAND: Over low, Under low, Invalid, Precision, Denormal.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
N/A
N/A
32
16
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
380

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Down-conversion: Df 32
Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

S2 S1 S0
000
001
010
011
100
101
110
111

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
64
N/A
N/A
32
16
16
32
32

Intel® C/C++ Compiler Intrinsic Equivalent
__m512

_mm512_mask_mov_ps (__m512, __mmask16, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

381

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVDQA32 - Move Aligned Int32 Vector

Opcode
MVEX.512.66.0F.W0 6F /r

Instruction
vmovdqa32 zmm1 {k1}, Ui32 (mt )

MVEX.512.66.0F.W0 6F /r

vmovdqa32
zmm1
{k1},
Si32 (zmm2)
vmovdqa32 mt {k1}, Di32 (zmm1)

MVEX.512.66.0F.W0 7F /r

Description
Move int32 vector Ui32 (mt ) into vector zmm1,
under write-mask.
Move int32 vector Si32 (zmm2) into vector
zmm1, under write-mask.
Move int32 vector Di32 (zmm1) into mt , under
write-mask.

Description
Moves int32 vector result of the swizzle/broadcast/conversion process on memory or
int32 vector zmm2 into int32 vector zmm1 or down-converts and stores int32 vector
zmm2 into destination memory.
This instruction is write-masked, so only those elements with the corresponding bit(s) set
in the vector mask (k1) register are computed and stored into register/memory. Elements
in register/memory with the corresponding bit(s) clear in the vector mask register are
maintained with the previous value.

Operation
DESTINATION IS A VECTOR OPERAND
if(source is a register operand) {
if(MVEX.EH==1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2)
}
} else {
tmpSrc2[511:0] = UpConvLoadi32 (mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = tmpSrc2[i+31:i])
}
}
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOfi32 (SSS[2:0])

382

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

for(n = 0 ;n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tmp = DownConvStorei32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*n) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*n) = tmp[15:0]
} else if(downSize == 1) {
MemStore(mt +n) = tmp[7:0]
}
}
}

Flags Affected
DESTINATION IS A VECTOR OPERAND: None.
DESTINATION IS A MEMORY OPERAND: None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
N/A
N/A
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
383

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Down-conversion: Di32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint32 to uint8
sint32 to sint8
uint32 to uint16
sint32 to sint16

Usage
zmm1
N/A
N/A
N/A
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
64
N/A
N/A
N/A
16
16
32
32

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i

_mm512_mask_mov_epi32 (__m512i, __mmask16, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

384

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVDQA64 - Move Aligned Int64 Vector

Opcode
MVEX.512.66.0F.W1 6F /r

Instruction
vmovdqa64 zmm1 {k1}, Ui64 (mt )

MVEX.512.66.0F.W1 6F /r

vmovdqa64
zmm1
{k1},
Si64 (zmm2)
vmovdqa64 mt {k1}, Di64 (zmm1)

MVEX.512.66.0F.W1 7F /r

Description
Move int64 vector Ui64 (mt ) into vector zmm1,
under write-mask.
Move int64 vector Si64 (zmm2) into vector
zmm1, under write-mask.
Move int64 vector Di64 (zmm1) into mt , under
write-mask.

Description
Moves int64 vector result of the swizzle/broadcast/conversion process on memory or
int64 vector zmm2 into int64 vector zmm1 or down-converts and stores int64 vector
zmm2 into destination memory.
This instruction is write-masked, so only those elements with the corresponding bit(s) set
in the vector mask (k1) register are computed and stored into register/memory. Elements
in register/memory with the corresponding bit(s) clear in the vector mask register are
maintained with the previous value.

Operation
DESTINATION IS A VECTOR OPERAND
if(source is a register operand) {
if(MVEX.EH==1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi64 (zmm2)
}
} else {
tmpSrc2[511:0] = UpConvLoadi64 (mt )
}
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = tmpSrc2[i+63:i])
}
}
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOfi64 (SSS[2:0])

Reference Number: 327364-001

385

CHAPTER 6. INSTRUCTION DESCRIPTIONS

for(n = 0 ;n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tmp = DownConvStorei64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*n) = tmp[63:0]
}
}
}

Flags Affected
DESTINATION IS A VECTOR OPERAND: None.
DESTINATION IS A MEMORY OPERAND: None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

386

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Down-conversion: Di64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i

_mm512_mask_mov_epi64 (__m512i, __mmask8, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

387

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVNRAPD - Store Aligned Float64 Vector With No-Read Hint

Opcode
MVEX.512.F3.0F.W1.EH0
29 /r

Instruction
vmovnrapd m {k1}, Df 64 (zmm1)

Description
Store with No-Read hint loat64 vector
Df 64 (zmm1) into m, under write-mask.

Description
Stores loat64 vector zmm1 (or a down-converted version of it) into destination memory
with a No-Read hint for the case the whole vector is going to be written into memory. This
instruction is intended to speed up the case of stores in streaming kernels where we want
to avoid wasting memory bandwidth by being forced to read the original content of entire
cache lines from memory when we overwrite their whole contents completely.
In Intel® Xeon Phi™ coprocessor, this instruction is able to optimize memory bandwidth
in case of a cache miss and avoid reading the original contents of the memory destination
operand if the following conditions hold true:
• The instruction does not use a write-mask (MVEX.aaa=000).
• The instruction does not perform any kind of down-conversion (MVEX.SSS=000).
Note that this instruction is encoded by forcing MVEX.EH bit to 0. The Eviction Hint does
not have any effect on this instruction.
The No-Read directive is intended as a performance hint and could be ignored by a given
processor implementation.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are stored to memory. Elements in the destination memory
vector with the corresponding bit clear in k1 register retain their previous value.

Operation
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*n) = tmp[63:0]
}
}
}

388

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm512_storenr_pd(void*, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

389

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVNRAPS - Store Aligned Float32 Vector With No-Read Hint

Opcode
MVEX.512.F2.0F.W0.EH0
29 /r

Instruction
vmovnraps m {k1}, Df 32 (zmm1)

Description
Store with No-Read hint loat32 vector
Df 32 (zmm1) into m, under write-mask.

Description
Stores loat32 vector zmm1 (or a down-converted version of it) into destination memory
with a No-Read hint for the case the whole vector is going to be written into memory. This
instruction is intended to speed up the case of stores in streaming kernels where we want
to avoid wasting memory bandwidth by being forced to read the original content of entire
cache lines from memory when we overwrite their whole contents completely.
In Intel® Xeon Phi™ coprocessor, this instruction is able to optimize memory bandwidth
in case of a cache miss and avoid reading the original contents of the memory destination
operand if the following conditions hold true:
• The instruction does not use a write-mask (MVEX.aaa=000).
• The instruction does not perform any kind of down-conversion (MVEX.SSS=000).
Note that this instruction is encoded by forcing MVEX.EH bit to 0. The Eviction Hint does
not have any effect on this instruction.
The No-Read directive is intended as a performance hint and could be ignored by a given
processor implementation.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are stored to memory. Elements in the destination memory
vector with the corresponding bit clear in k1 register retain their previous value.

Operation
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*n) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*n) = tmp[15:0]
} else if(downSize == 1) {
390

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

MemStore(mt +n) = tmp[7:0]
}
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
64
N/A
N/A
32
16
16
32
32

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm512_storenr_ps(void*, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
391

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#PF(fault-code)
#NM

392

If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVNRNGOAPD - Non-globally Ordered Store Aligned Float64 Vector
With No-Read Hint

Opcode
MVEX.512.F3.0F.W1.EH1
29 /r

Instruction
vmovnrngoapd
Df 64 (zmm1)

m

{k1},

Description
Non-ordered Store with No-Read hint loat64
vector Df 64 (zmm1) into m, under write-mask.

Description
Stores loat64 vector zmm1 (or a down-converted version of it) into destination memory
with a No-Read hint for the case the whole vector is going to be written into memory,
using a weakly-ordered memory consistency model (i.e. stores performed with these instruction are not globally ordered, and subsequent stores from the same thread can be
observed before them).
This instruction is intended to speed up the case of stores in streaming kernels where we
want to avoid wasting memory bandwidth by being forced to read the original content of
entire cache lines from memory when we overwrite their whole contents completely. This
instruction takes advantage of the weakly-ordered memory consistency model to increase
the throughput at which this type of write operations can be performed. Due to the same
reason, a fencing operation should be used in conjunction with this instruction if multiple
threads are reading/writing the memory operand location. Though CPUID can be used as
the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or
XCHG (which combines a store and a fence).
In the Intel® Xeon Phi™ coprocessor, this instruction is able to optimize memory bandwidth in case of a cache miss and avoid reading the original contents of the memory destination operand if the following conditions hold true:
• The instruction does not use a write-mask (MVEX.aaa=000).
• The instruction does not perform any kind of down-conversion (MVEX.SSS=000).
Note that this instruction is encoded by forcing MVEX.EH bit to 1. The Eviction Hint does
not have any effect on this instruction.
The No-Read directive is intended as a performance hint and could be ignored by a given
processor implementation.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are stored to memory. Elements in the destination memory
vector with the corresponding bit clear in k1 register retain their previous value.

Reference Number: 327364-001

393

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*n) = tmp[63:0]
}
}
}

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm512_storenrngo_pd(void*, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

394

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

395

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMOVNRNGOAPS - Non-globally Ordered Store Aligned Float32 Vector
With No-Read Hint

Opcode
MVEX.512.F2.0F.W0.EH1
29 /r

Instruction
vmovnrngoaps
Df 32 (zmm1)

m

{k1},

Description
Non-ordered Store with No-Read hint loat32
vector Df 32 (zmm1) into m, under write-mask.

Description
Stores loat32 vector zmm1 (or a down-converted version of it) into destination memory
with a No-Read hint for the case the whole vector is going to be written into memory,
using a weakly-ordered memory consistency model (i.e. stores performed with these instruction are not globally ordered, and subsequent stores from the same thread can be
observed before them).
This instruction is intended to speed up the case of stores in streaming kernels where we
want to avoid wasting memory bandwidth by being forced to read the original content of
entire cache lines from memory when we overwrite their whole contents completely. This
instruction takes advantage of the weakly-ordered memory consistency model to increase
the throughput at which this type of write operations can be performed. Due to the same
reason, a fencing operation should be used in conjunction with this instruction if multiple
threads are reading/writing the memory operand location. Though CPUID can be used as
the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or
XCHG (which combines a store and a fence).
In the Intel® Xeon Phi™ coprocessor, this instruction is able to optimize memory bandwidth in case of a cache miss and avoid reading the original contents of the memory destination operand if the following conditions hold true:
• The instruction does not use a write-mask (MVEX.aaa=000).
• The instruction does not perform any kind of down-conversion (MVEX.SSS=000).
Note that this instruction is encoded by forcing MVEX.EH bit to 1. The Eviction Hint does
not have any effect on this instruction.
The No-Read directive is intended as a performance hint and could be ignored by a given
processor implementation.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are stored to memory. Elements in the destination memory
vector with the corresponding bit clear in k1 register retain their previous value.

396

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
DESTINATION IS A MEMORY OPERAND
downSize = DownConvStoreSizeOff 32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*n) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*n) = tmp[15:0]
} else if(downSize == 1) {
MemStore(mt +n) = tmp[7:0]
}
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
64
N/A
N/A
32
16
16
32
32

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm512_storenrngo_ps(void*, __m512);

Reference Number: 327364-001

397

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

398

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMULPD - Multiply Float64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
59 /r

Instruction
vmulpd zmm1
Sf 64 (zmm3/mt )

{k1},

zmm2,

Description
Multiply loat64 vector zmm2 and loat64 vector Sf 64 (zmm3/mt ) and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element multiplication between loat64 vector zmm2 and the
loat64 vector result of the swizzle/broadcast/conversion process on memory or loat64
vector zmm3. The result is written into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Reference Number: 327364-001

399

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

400

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_mul_pd (__m512d, __m512d);
_mm512_mask_mul_pd (__m512d, __mmask8, __m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

401

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VMULPS - Multiply Float32 Vectors

Opcode
MVEX.NDS.512.0F.W0 59 /r

Instruction
vmulps zmm1
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Multiply loat32 vector zmm2 and loat32 vector Sf 32 (zmm3/mt ) and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element multiplication between loat32 vector zmm2 and the
loat32 vector result of the swizzle/broadcast/conversion process on memory or loat32
vector zmm3. The result is written into loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

402

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

403

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_mul_ps (__m512, __m512);
_mm512_mask_mul_ps (__m512, __mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

404

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTOREHD - Pack And Store Unaligned High From Int32 Vector

Opcode
MVEX.512.66.0F38.W0 D4
/r

Instruction
vpackstorehd
Di32 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of int32 vector
zmm1 to form an unaligned int32 stream,
down-convert it and logically map the stream
starting at mt − 64, and store that portion
of the stream that maps to the high 64-bytealigned portion of the memory destination, under write-mask.

Description
Packs and down-converts the mask-enabled elements of int32 vector zmm1 into a byte/word/doubleword
stream logically mapped starting at element-aligned address (mt − 64), and stores the
high-64-byte elements of that stream (those elements of the stream that map at or after
the irst 64-byte-aligned address following (mt − 64), the high cache line in the current
implementation). The length of the stream depends on the number of enabled masks, as
elements disabled by the mask are not added to the stream.
The vpackstoreld instruction is used to store the part of the stream before the irst 64byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFFFF for this instruction. For that reason,
the notation and encoding are the same as for a write-mask.
In conjunction with vpackstoreld, this instruction is useful for packing data into a queue.
Also in conjunction with vpackstoreld, it allows unaligned vector stores (that is, vector
stores that are only element-wise , not vector-wise, aligned); just use a mask of 0xFFFF or
no write-mask for this purpose. The typical instruction sequence to perform an unaligned
vector store would be:
// assume memory location is pointed by register rax
vpackstoreld [rax]
{k1}, v0
vpackstorehd [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
Reference Number: 327364-001

405

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOfi32 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( (pointer + (storeOffset+1)*downSize) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 32*n
tmp = DownConvStorei32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(pointer + storeOffset*4) = tmp[31:0]
} else if(downSize == 2) {
MemStore(pointer + storeOffset*2) = tmp[15:0]
} else if(downSize == 1) {
MemStore(pointer + storeOffset) = tmp[7:0]
}
}
storeOffset++
}
}

Flags Affected
None.

Memory Down-conversion: Di32
S2 S1 S0
000
001
010
011
100
101
110
111

406

Function:
no conversion
reserved
reserved
reserved
uint32 to uint8
sint32 to sint8
uint32 to uint16
sint32 to sint16

Usage
zmm1
N/A
N/A
N/A
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorehi_epi32
(void*,
_MM_DOWNCONV_EPI32_ENUM, int);
_mm512_mask_extpackstorehi_epi32
(void*,
__mmask16,
_MM_DOWNCONV_EPI32_ENUM, int);
_mm512_packstorehi_epi32 (void*, __m512i);
_mm512_mask_packstorehi_epi32 (void*, __mmask16, __m512i);

__m512i,
__m512i,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

407

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTOREHPD - Pack And Store Unaligned High From Float64 Vector

Opcode
MVEX.512.66.0F38.W1 D5
/r

Instruction
vpackstorehpd
Df 64 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of loat64 vector
zmm1 to form an unaligned loat64 stream,
down-convert it and logically map the stream
starting at mt − 64, and store that portion
of the stream that maps to the high 64-bytealigned portion of the memory destination, under write-mask.

Description
Packs and down-converts the mask-enabled elements of loat64 vector zmm1 into a
loat64 stream logically mapped starting at element-aligned address (mt −64), and stores
the high-64-byte elements of that stream (those elements of the stream that map at or after the irst 64-byte-aligned address following (mt −64), the high cache line in the current
implementation). The length of the stream depends on the number of enabled masks, as
elements disabled by the mask are not added to the stream.
The vpackstorelpd instruction is used to store the part of the stream before the irst 64byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFF for this instruction. For that reason, the
notation and encoding are the same as for a write-mask.
In conjunction with vpackstorelpd, this instruction is useful for packing data into a queue.
Also in conjunction with vpackstorelpd, it allows unaligned vector stores (that is, vector
stores that are only element-wise , not vector-wise, aligned); just use a mask of 0xFF or
no write-mask for this purpose. The typical instruction sequence to perform an unaligned
vector store would be:
// assume memory location is pointed by register rax
vpackstorelpd [rax]
{k1}, v0
vpackstorehpd [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
408

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOff 64 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( (pointer + (storeOffset+1)*downSize) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 64*n
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(pointer + storeOffset*8) = tmp[63:0]
}
}
storeOffset++
}
}

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

409

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorehi_pd (void*, __m512d, _MM_DOWNCONV_PD_ENUM, int);
_mm512_mask_extpackstorehi_pd
(void*,
__mmask8,
__m512d,
_MM_DOWNCONV_PD_ENUM, int);
_mm512_packstorehi_pd (void*, __m512d);
_mm512_mask_packstorehi_pd (void*, __mmask8, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

410

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTOREHPS - Pack And Store Unaligned High From Float32 Vector

Opcode
MVEX.512.66.0F38.W0 D5
/r

Instruction
vpackstorehps
Df 32 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of loat32 vector
zmm1 to form an unaligned loat32 stream,
down-convert it and logically map the stream
starting at mt − 64, and store that portion
of the stream that maps to the high 64-bytealigned portion of the memory destination, under write-mask.

Description
Packs and down-converts the mask-enabled elements of loat32 vector zmm1 into a
byte/word/doubleword stream logically mapped starting at element-aligned address
(mt − 64), and stores the high-64-byte elements of that stream (those elements of the
stream that map at or after the irst 64-byte-aligned address following (mt − 64), the high
cache line in the current implementation). The length of the stream depends on the number of enabled masks, as elements disabled by the mask are not added to the stream.
The vpackstorelps instruction is used to store the part of the stream before the irst 64byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFFFF for this instruction. For that reason,
the notation and encoding are the same as for a write-mask.
In conjunction with vpackstorelps, this instruction is useful for packing data into a queue.
Also in conjunction with vpackstorelps, it allows unaligned vector stores (that is, vector
stores that are only element-wise , not vector-wise, aligned); just use a mask of 0xFFFF or
no write-mask for this purpose. The typical instruction sequence to perform an unaligned
vector store would be:
// assume memory location is pointed by register rax
vpackstorelps [rax]
{k1}, v0
vpackstorehps [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
Reference Number: 327364-001

411

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOff 32 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( (pointer + (storeOffset+1)*downSize) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 32*n
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(pointer + storeOffset*4) = tmp[31:0]
} else if(downSize == 2) {
MemStore(pointer + storeOffset*2) = tmp[15:0]
} else if(downSize == 1) {
MemStore(pointer + storeOffset) = tmp[7:0]
}
}
storeOffset++
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

412

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorehi_ps (void*, __m512, _MM_DOWNCONV_PS_ENUM, int);
_mm512_mask_extpackstorehi_ps
(void*,
__mmask16,
__m512,
_MM_DOWNCONV_PS_ENUM, int);
_mm512_packstorehi_ps (void*, __m512);
_mm512_mask_packstorehi_ps (void*, __mmask16, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

413

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTOREHQ - Pack And Store Unaligned High From Int64 Vector

Opcode
MVEX.512.66.0F38.W1 D4
/r

Instruction
vpackstorehq
Di64 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of int64 vector
zmm1 to form an unaligned int64 stream,
down-convert it and logically map the stream
starting at mt − 64, and store that portion
of the stream that maps to the high 64-bytealigned portion of the memory destination, under write-mask.

Description
Packs and down-converts the mask-enabled elements of int64 vector zmm1 into a int64
stream logically mapped starting at element-aligned address (mt − 64), and stores the
high-64-byte elements of that stream (those elements of the stream that map at or after
the irst 64-byte-aligned address following (mt − 64), the high cache line in the current
implementation). The length of the stream depends on the number of enabled masks, as
elements disabled by the mask are not added to the stream.
The vpackstorelq instruction is used to store the part of the stream before the irst 64byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFF for this instruction. For that reason, the
notation and encoding are the same as for a write-mask.
In conjunction with vpackstorelq, this instruction is useful for packing data into a queue.
Also in conjunction with vpackstorelq, it allows unaligned vector stores (that is, vector
stores that are only element-wise , not vector-wise, aligned); just use a mask of 0xFF or
no write-mask for this purpose. The typical instruction sequence to perform an unaligned
vector store would be:
// assume memory location is pointed by register rax
vpackstorelq [rax]
{k1}, v0
vpackstorehq [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand. The instruction will not produce any #GP or
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte
boundary. Additionally, A/D bits in the page table will not be updated.
414

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOfi64 (SSS[2:0])
foundNext64BytesBoundary = false
pointer = mt - 64
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
if (foundNext64BytesBoundary == false) {
if ( ( (pointer + (storeOffset+1)*downSize) % 64) == 0 ) {
foundNext64BytesBoundary = true
}
} else {
i = 64*n
tmp = DownConvStorei64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(pointer + storeOffset*8) = tmp[63:0]
}
}
storeOffset++
}
}

Flags Affected
None.

Memory Down-conversion: Di64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

415

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorehi_epi64
(void*,
_MM_DOWNCONV_EPI64_ENUM, int);
_mm512_mask_extpackstorehi_epi64
(void*,
__mmask8,
_MM_DOWNCONV_EPI64_ENUM, int);
_mm512_packstorehi_epi64 (void*, __m512i);
_mm512_mask_packstorehi_epi64 (void*, __mmask8, __m512i);

__m512i,
__m512i,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

416

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTORELD - Pack and Store Unaligned Low From Int32 Vector

Opcode
MVEX.512.66.0F38.W0 D0
/r

Instruction
vpackstoreld
Di32 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of int32 vector
zmm1 to form an unaligned int32 stream,
down-convert it and logically map the stream
starting at mt , and store that portion of the
stream that maps to the low 64-byte-aligned
portion of the memory destination, under
write-mask.

Description
Packs and down-converts the mask-enabled elements of int32 vector zmm1 into a byte/word/doubleword
stream logically mapped starting at element-aligned address mt , and stores the low-64byte elements of that stream (those elements of the stream that map before the irst 64byte-aligned address following mt , the low cache line in the current implementation). The
length of the stream depends on the number of enabled masks, as elements disabled by
the mask are not added to the stream.
The vpackstorehd instruction is used to store the part of the stream at or after the irst
64-byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFFFF for this instruction. For that reason,
the notation and encoding are the same as for a write-mask.
In conjunction with vpackstorehd, this instruction is useful for packing data into into a
queue. Also in conjunction with vpackstorehd, it allows unaligned vector stores (that is,
vector stores that are only element-wise, not vector-wise, aligned); just use a mask of
0xFFFF or no write-mask for this purpose. The typical instruction sequence to perform
an unaligned vector store would be:
// assume memory location is pointed by register rax
vpackstoreld [rax]
{k1}, v0
vpackstorehd [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.

Reference Number: 327364-001

417

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOfi32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tmp = DownConvStorei32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*storeOffset) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*storeOffset) = tmp[15:0]
} else if(downSize == 1) {
MemStore(mt +storeOffset) = tmp[7:0]
}
storeOffset++
if (((mt + downSize*storeOffset) % 64) == 0) {
break
}
}
}

Flags Affected
None.

Memory Down-conversion: Di32
S2 S1 S0
000
001
010
011
100
101
110
111

418

Function:
no conversion
reserved
reserved
reserved
uint32 to uint8
sint32 to sint8
uint32 to uint16
sint32 to sint16

Usage
zmm1
N/A
N/A
N/A
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorelo_epi32
(void*,
_MM_DOWNCONV_EPI32_ENUM, int);
_mm512_mask_extpackstorelo_epi32
(void*,
__mmask16,
_MM_DOWNCONV_EPI32_ENUM, int);
_mm512_packstorelo_epi32 (void*, __m512i);
_mm512_mask_packstorelo_epi32 (void*, __mmask16, __m512i);

__m512i,
__m512i,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

419

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTORELPD - Pack and Store Unaligned Low From Float64 Vector

Opcode
MVEX.512.66.0F38.W1 D1
/r

Instruction
vpackstorelpd
Df 64 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of loat64 vector
zmm1 to form an unaligned loat64 stream,
down-convert it and logically map the stream
starting at mt , and store that portion of the
stream that maps to the low 64-byte-aligned
portion of the memory destination, under
write-mask.

Description
Packs and down-converts the mask-enabled elements of loat64 vector zmm1 into a
loat64 stream logically mapped starting at element-aligned address mt , and stores the
low-64-byte elements of that stream (those elements of the stream that map before the
irst 64-byte-aligned address following mt , the low cache line in the current implementation). The length of the stream depends on the number of enabled masks, as elements
disabled by the mask are not added to the stream.
The vpackstorehpd instruction is used to store the part of the stream at or after the irst
64-byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFF for this instruction. For that reason, the
notation and encoding are the same as for a write-mask.
In conjunction with vpackstorehpd, this instruction is useful for packing data into into a
queue. Also in conjunction with vpackstorehpd, it allows unaligned vector stores (that
is, vector stores that are only element-wise, not vector-wise, aligned); just use a mask of
0xFF or no write-mask for this purpose. The typical instruction sequence to perform an
unaligned vector store would be:
// assume memory location is pointed by register rax
vpackstorelpd [rax]
{k1}, v0
vpackstorehpd [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.

420

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOff 64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*storeOffset) = tmp[63:0]
}
storeOffset++
if (((mt + downSize*storeOffset) % 64) == 0) {
break
}
}
}

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorelo_pd (void*, __m512d, _MM_DOWNCONV_PD_ENUM, int);
_mm512_mask_extpackstorelo_pd
(void*,
__mmask8,
__m512d,
_MM_DOWNCONV_PD_ENUM, int);
_mm512_packstorelo_pd (void*, __m512d);
_mm512_mask_packstorelo_pd (void*, __mmask8, __m512d);

Reference Number: 327364-001

421

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

422

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTORELPS - Pack and Store Unaligned Low From Float32 Vector

Opcode
MVEX.512.66.0F38.W0 D1
/r

Instruction
vpackstorelps
Df 32 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of loat32 vector
zmm1 to form an unaligned loat32 stream,
down-convert it and logically map the stream
starting at mt , and store that portion of the
stream that maps to the low 64-byte-aligned
portion of the memory destination, under
write-mask.

Description
Packs and down-converts the mask-enabled elements of loat32 vector zmm1 into a
byte/word/doubleword stream logically mapped starting at element-aligned address mt ,
and stores the low-64-byte elements of that stream (those elements of the stream that
map before the irst 64-byte-aligned address following mt , the low cache line in the current implementation). The length of the stream depends on the number of enabled masks,
as elements disabled by the mask are not added to the stream.
The vpackstorehps instruction is used to store the part of the stream at or after the irst
64-byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFFFF for this instruction. For that reason,
the notation and encoding are the same as for a write-mask.
In conjunction with vpackstorehps, this instruction is useful for packing data into into a
queue. Also in conjunction with vpackstorehps, it allows unaligned vector stores (that
is, vector stores that are only element-wise, not vector-wise, aligned); just use a mask of
0xFFFF or no write-mask for this purpose. The typical instruction sequence to perform
an unaligned vector store would be:
// assume memory location is pointed by register rax
vpackstorelps [rax]
{k1}, v0
vpackstorehps [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.

Reference Number: 327364-001

423

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOff 32 (SSS[2:0])
for(n = 0 ;n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(downSize == 4) {
MemStore(mt +4*storeOffset) = tmp[31:0]
} else if(downSize == 2) {
MemStore(mt +2*storeOffset) = tmp[15:0]
} else if(downSize == 1) {
MemStore(mt +storeOffset) = tmp[7:0]
}
storeOffset++
if (((mt + downSize*storeOffset) % 64) == 0) {
break
}
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void
424

_mm512_extpackstorelo_ps (void*, __m512, _MM_DOWNCONV_PS_ENUM, int);
_mm512_mask_extpackstorelo_ps
(void*,
__mmask16,
__m512,
_MM_DOWNCONV_PS_ENUM, int);
_mm512_packstorelo_ps (void*, __m512);
_mm512_mask_packstorelo_ps (void*, __mmask16, __m512);
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

425

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPACKSTORELQ - Pack and Store Unaligned Low From Int64 Vector

Opcode
MVEX.512.66.0F38.W1 D0
/r

Instruction
vpackstorelq
Di64 (zmm1)

mt

{k1},

Description
Pack mask-enabled elements of int64 vector
zmm1 to form an unaligned int64 stream,
down-convert it and logically map the stream
starting at mt , and store that portion of the
stream that maps to the low 64-byte-aligned
portion of the memory destination, under
write-mask.

Description
Packs and down-converts the mask-enabled elements of int64 vector zmm1 into a int64
stream logically mapped starting at element-aligned address mt , and stores the low-64byte elements of that stream (those elements of the stream that map before the irst 64byte-aligned address following mt , the low cache line in the current implementation). The
length of the stream depends on the number of enabled masks, as elements disabled by
the mask are not added to the stream.
The vpackstorehq instruction is used to store the part of the stream at or after the irst
64-byte-aligned address preceding mt .
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector, choosing which elements are added to the stream. The one similarity
to a write-mask as used in the rest of this document is that the no-write-mask option
(encoding 0) is available to select a mask of 0xFF for this instruction. For that reason, the
notation and encoding are the same as for a write-mask.
In conjunction with vpackstorehq, this instruction is useful for packing data into into a
queue. Also in conjunction with vpackstorehq, it allows unaligned vector stores (that is,
vector stores that are only element-wise, not vector-wise, aligned); just use a mask of
0xFF or no write-mask for this purpose. The typical instruction sequence to perform an
unaligned vector store would be:
// assume memory location is pointed by register rax
vpackstorelq [rax]
{k1}, v0
vpackstorehq [rax+64] {k1}, v0
This instruction does not have subset support.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note that the address reported by a page fault is the beggining of the 64-byte cache line
boundary containing the memory operand.

426

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
storeOffset = 0
downSize = DownConvStoreSizeOfi64 (SSS[2:0])
for(n = 0 ;n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
tmp = DownConvStorei64 (zmm1[i+63:i], SSS[2:0])
if(downSize == 8) {
MemStore(mt +8*storeOffset) = tmp[63:0]
}
storeOffset++
if (((mt + downSize*storeOffset) % 64) == 0) {
break
}
}
}

Flags Affected
None.

Memory Down-conversion: Di64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_extpackstorelo_epi64
(void*,
_MM_DOWNCONV_EPI64_ENUM, int);
_mm512_mask_extpackstorelo_epi64
(void*,
__mmask8,
_MM_DOWNCONV_EPI64_ENUM, int);
_mm512_packstorelo_epi64 (void*, __m512i);
_mm512_mask_packstorelo_epi64 (void*, __mmask8, __m512i);

Reference Number: 327364-001

__m512i,
__m512i,

427

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

428

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If the ist operand is not a memory location.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPADCD - Add Int32 Vectors with Carry

Opcode
Instruction
zmm1
MVEX.NDS.512.66.0F38.W0 vpadcd
5C /r
Si32 (zmm3/mt )

{k1},

k2,

Description
Add int32 vector Si32 (zmm3/mt ), vector mask
register k2 and int32 vector zmm1 and store
the result in zmm1, and the carry of the sum in
k2, under write-mask.

Description
Performs an element-by-element three-input addition between int32 vector zmm1, the
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3, and the corresponding bit of k2. The result is written into int32 vector zmm1.
In addition, the carry from the sum for the n-th element is written into the n-th bit of
vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
tmpCarry = Carry(zmm1[i+31:i] + k2[n] + tmpSrc3[i+31:i])
zmm1[i+31:i] = zmm1[i+31:i] + k2[n] + tmpSrc3[i+31:i]
k2[n] = tmpCarry
}
}

Flags Affected
None.
Reference Number: 327364-001

429

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_adc_epi32(__m512i, __mmask16, __m512i, __mmask16*);
_mm512_mask_adc_epi32(__m512i,
__mmask16,
__mmask16,
__mmask16*);

__m512i,

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

430

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

431

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPADDD - Add Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
FE /r

Instruction
vpaddd zmm1
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Add int32 vector zmm2 and int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element addition between int32 vector zmm2 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i]
}
}

Flags Affected
None.

432

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_add_epi32 (__m512i, __m512i);
_mm512_mask_add_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

433

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

434

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPADDSETCD - Add Int32 Vectors and Set Mask to Carry

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpaddsetcd zmm1
5D /r
Si32 (zmm3/mt )

{k1},

k2,

Description
Add int32 vector zmm1 and int32 vector
Si32 (zmm3/mt ) and store the sum in zmm1
and the carry from the sum in k2, under writemask.

Description
Performs an element-by-element addition between int32 vector zmm1 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
In addition, the carry from the sum for the n-th element is written into the n-th bit of
vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
k2[n] = Carry(zmm1[i+31:i] + tmpSrc3[i+31:i])
zmm1[i+31:i] = zmm1[i+31:i] + tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

435

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_addsetc_epi32 (__m512i, __m512i, __mmask16*);
_mm512_mask_addsetc_epi32 (__m512i, __mmask16,__mmask16,
__mmask16*);

__m512i,

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

436

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

437

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPADDSETSD - Add Int32 Vectors and Set Mask to Sign

Opcode
MVEX.NDS.512.66.0F38.W0 CD /r

Instruction
vpaddsetsd zmm1 {k1}, zmm2, Si32 (zmm3/mt )

Description
Add int32 vector
zmm2
and
int32
vector
Si32 (zmm3/mt )
and store the sum in
zmm1 and the sign
from the sum in k1,
under write-mask.

Description
Performs an element-by-element addition between int32 vector zmm2 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
In addition, the sign of the result for the n-th element is written into the n-th bit of vector
mask k1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i]
k1[n] = zmm1[i+31]
}
}

438

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_addsets_epi32 (__m512i, __m512i, __mmask16*);
_mm512_mask_addsets_epi32 (__m512i, __mmask16, __m512i,
__mmask16*);

__m512i,

Exceptions

Real-Address Mode and Virtual-8086
#UD

Reference Number: 327364-001

Instruction not available in these modes

439

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

440

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If any memory operand linear address is not aligned to 4-byte
data granularity.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPANDD - Bitwise AND Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
DB /r

Instruction
vpandd zmm1
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise AND between int32 vector
zmm2 and int32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise AND between int32 vector zmm2 and the int32
vector result of the swizzle/broadcast/conversion process on memory or int32 vector
zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = zmm2[i+31:i] & tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

441

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_and_epi32(__m512i, __m512i);
_mm512_mask_and_epi32(__m512i, __mmask16,__m512i, __m512i);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
442

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

443

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPANDND - Bitwise AND NOT Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
DF /r

Instruction
vpandnd zmm1 {k1},
Si32 (zmm3/mt )

zmm2,

Description
Perform a bitwise AND between NOT int32 vector zmm2 and int32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise AND between NOT int32 vector zmm2 and the
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = (~(zmm2[i+31:i])) & tmpSrc3[i+31:i]
}
}

Flags Affected
None.

444

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_andnot_epi32 (__m512i, __m512i);
_mm512_mask_andnot_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

445

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

446

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPANDNQ - Bitwise AND NOT Int64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
DF /r

Instruction
vpandnq zmm1 {k1},
Si64 (zmm3/mt )

zmm2,

Description
Perform a bitwise AND between NOT int64 vector zmm2 and int64 vector Si64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise AND between NOT int64 vector zmm2 and the
int64 vector result of the swizzle/broadcast/conversion process on memory or int64 vector zmm3. The result is written into int64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = (~(zmm2[i+63:i])) & tmpSrc3[i+63:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

447

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_andnot_epi64(__m512i, __m512i);
_mm512_mask_andnot_epi64(__m512i, __mmask8, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
448

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

449

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPANDQ - Bitwise AND Int64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
DB /r

Instruction
vpandq zmm1
Si64 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise AND between int64 vector
zmm2 and int64 vector Si64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise AND between int64 vector zmm2 and the int64
vector result of the swizzle/broadcast/conversion process on memory or int64 vector
zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = zmm2[i+63:i] & tmpSrc3[i+63:i]
}
}

Flags Affected
None.

450

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_and_epi64(__m512i, __m512i);
_mm512_mask_and_epi64(__m512i, __mmask8, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

451

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

452

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPBLENDMD - Blend Int32 Vectors using the Instruction Mask

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpblendmd zmm1 {k1}, zmm2,
64 /r
Si32 (zmm3/mt )

Description
Blend int32 vector zmm2 and int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element blending between int32 vector zmm2 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3,
using the instruction mask as selector. The result is written into int32 vector zmm1.
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector: every element of the destination is conditionally selected between irst
source or second source using the value of the related mask bit (0 for irst source, 1 for
second source ).

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = tmpSrc3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (tmpSrc3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n]==1 or *no write-mask*) {
zmm1[i+31:i] = tmpSrc3[i+31:i]
} else {
zmm1[i+31:i] = zmm2[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

453

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i

_mm512_mask_blend_epi32 (__mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode

454

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

455

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPBLENDMQ - Blend Int64 Vectors using the Instruction Mask

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vpblendmq zmm1 {k1}, zmm2,
64 /r
Si64 (zmm3/mt )

Description
Blend int64 vector zmm2 and int64 vector
Si64 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element blending between int64 vector zmm2 and the int64 vector result of the swizzle/broadcast/conversion process on memory or int64 vector zmm3,
using the instruction mask as selector. The result is written into int64 vector zmm1.
The mask is not used as a write-mask for this instruction. Instead, the mask is used as an
element selector: every element of the destination is conditionally selected between irst
source or second source using the value of the related mask bit (0 for irst source, 1 for
second source ).

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = tmpSrc3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi64 (tmpSrc3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n]==1 or *no write-mask*) {
zmm1[i+63:i] = tmpSrc3[i+63:i]
} else {
zmm1[i+63:i] = zmm2[i+63:i]
}
}

Flags Affected
None.

456

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i

_mm512_mask_blend_epi64 (__mmask8, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode

Reference Number: 327364-001

457

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

458

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPBROADCASTD - Broadcast Int32 Vector

Opcode
MVEX.512.66.0F38.W0 58
/r

Instruction
vpbroadcastd
Ui32 (mt )

zmm1

{k1},

Description
Broadcast int32 vector Ui32 (mt ) into vector
zmm1, under write-mask.

Description
The 1, 2, or 4 bytes (depending on the conversion and broadcast in effect) at memory
address mt are broadcast and/or converted to a int32 vector. The result is written into
int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {1to16}
tmpSrc2[31:0] = UpConvLoadi32 (mt )
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = tmpSrc2[31:0]
}
}

Flags Affected
None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Reference Number: 327364-001

Usage
[rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2
459

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_extload_epi32(void
const*,_MM_UPCONV_EPI32_ENUM,
_MM_BROADCAST32_ENUM, int);
_mm512_mask_extload_epi32(__m512i,
__mmask16,
void
const*,_MM_UPCONV_EPI32_ENUM, _MM_BROADCAST32_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

460

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPBROADCASTQ - Broadcast Int64 Vector

Opcode
MVEX.512.66.0F38.W1 59
/r

Instruction
vpbroadcastq
Ui64 (mt )

zmm1

{k1},

Description
Broadcast int64 vector Ui64 (mt ) into vector
zmm1, under write-mask.

Description
The 8 bytes at memory address mt are broadcast to a int64 vector. The result is written
into int64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
// {1to8}
tmpSrc2[63:0] = UpConvLoadi64 (mt )
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = tmpSrc2[63:0]
}
}

Flags Affected
None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Reference Number: 327364-001

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A
461

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_extload_epi64(void
const*,_MM_UPCONV_EPI64_ENUM,
_MM_BROADCAST64_ENUM, int);
_mm512_mask_extload_epi64(__m512i,
__mmask16,
void
const*,_MM_UPCONV_EPI64_ENUM, _MM_BROADCAST64_ENUM, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

462

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPCMPD - Compare Int32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F3A.W0 1F /r ib

Instruction
vpcmpd k2 {k1}, zmm1, Si32 (zmm2/mt ), imm8

Description
Compare
between
int32
vector
zmm1
and int32 vector
Si32 (zmm2/mt )
and store the result in k2, under
write-mask.

Description
Performs an element-by-element comparison between int32 vector zmm1 and the int32
vector result of the swizzle/broadcast/conversion from memory or int32 vector zmm2.
The result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Immediate Format

eq
lt
le
neq
nlt
nle

Comparison Type
Equal
Less than
Less than or Equal
Not Equal
Not Less than
Not Less than or Equal

I2
0
0
0
1
1
1

I1
0
0
1
0
0
1

I0
0
1
0
0
1
0

Operation
switch (IMM8[2:0]) {
case 0: OP ← EQ; break;
case 1: OP ← LT; break;
case 2: OP ← LE; break;
case 4: OP ← NEQ; break;
case 5: OP ← NLT; break;
Reference Number: 327364-001

463

CHAPTER 6. INSTRUCTION DESCRIPTIONS

case 6: OP ← NLE; break;
default: Reserved; break;
}
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// signed integer operation
k2[n] = (zmm1[i+31:i] OP tmpSrc2[i+31:i]) ? 1 : 0
}
}

Instruction Pseudo-ops
Compilers and assemblers may implement the following pseudo-ops in addition to the
standard instruction op:
Pseudo-Op
vpcmpeqd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpltd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpled k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpneqd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpnltd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpnled k2 {k1}, zmm1, Si (zmm2/mt )

Implementation
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {eq}
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {lt}
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {le}
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {neq}
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {nlt}
vcmpd k2 {k1}, zmm1, Si (zmm2/mt ), {nle}

Flags Affected
None.

464

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_cmp_epi32_mask(__m512i, __m512i, const _MM_CMPINT_ENUM);
_mm512_mask_cmp_epi32_mask(__mmask16,
__m512i,
__m512i,
const
_MM_CMPINT_ENUM);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

465

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

466

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPCMPEQD - Compare Equal Int32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F.W0 76 /r

Instruction
vpcmpeqd k2 {k1}, zmm1, Si32 (zmm2/mt )

Description
Compare
Equal
between
int32
vector
zmm1 and int32 vector
Si32 (zmm2/mt ), and set
vector mask k2 to re lect
the zero/non-zero status of
each element of the result,
under write-mask.

Description
Performs an element-by-element compare for equality between int32 vector zmm1 and
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector
zmm2. The result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// signed integer operation
k2[n] = (zmm1[i+31:i] == tmpSrc2[i+31:i]) ? 1 : 0
}
}

Reference Number: 327364-001

467

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_cmpeq_epi32_mask (__m512i, __m512i);
_mm512_mask_cmpeq_epi32_mask (__mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
468

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

469

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPCMPGTD - Compare Greater Than Int32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F.W0 66 /r

Instruction
vpcmpgtd k2 {k1}, zmm1, Si32 (zmm2/mt )

Description
Compare Greater between
int32 vector zmm1 and int32
vector Si32 (zmm2/mt ), and
set vector mask k2 to re lect
the zero/non-zero status of
each element of the result,
under write-mask.

Description
Performs an element-by-element compare for the greater value of int32 vector zmm1 and
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector
zmm2. The result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// signed integer operation
k2[n] = (zmm1[i+31:i] > tmpSrc2[i+31:i]) ? 1 : 0
}
}

470

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_cmpgt_epi32_mask (__m512i, __m512i);
_mm512_mask_cmpgt_epi32_mask (__mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
Reference Number: 327364-001

471

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

472

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPCMPLTD - Compare Less Than Int32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F38.W0 74 /r

Instruction
vpcmpltd k2 {k1}, zmm1, Si32 (zmm2/mt )

Description
Compare
Less
between
int32
vector
zmm1 and int32 vector
Si32 (zmm2/mt ), and set
vector mask k2 to re lect
the zero/non-zero status
of each element of the
result, under write-mask.

Description
Performs an element-by-element compare for the lesser value of int32 vector zmm1 and
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector
zmm2. The result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// signed integer operation
k2[n] = (zmm1[i+31:i] < tmpSrc2[i+31:i]) ? 1 : 0
}
}

Reference Number: 327364-001

473

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_cmplt_epi32_mask (__m512i, __m512i);
_mm512_mask_cmplt_epi32_mask (__mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
474

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

475

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPCMPUD - Compare Uint32 Vectors and Set Vector Mask

Opcode
MVEX.NDS.512.66.0F3A.W0 1E /r ib

Instruction
vpcmpud k2 {k1}, zmm1, Si32 (zmm2/mt ), imm8

Description
Compare
between
uint32
vector zmm1 and
uint32
vector
Si32 (zmm2/mt )
and store the result in k2, under
write-mask.

Description
Performs an element-by-element comparison between uint32 vector zmm1 and the
uint32 vector result of the swizzle/broadcast/conversion from memory or uint32 vector
zmm2. The result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Immediate Format

eq
lt
le
neq
nlt
nle

Comparison Type
Equal
Less than
Less than or Equal
Not Equal
Not Less than
Not Less than or Equal

I2
0
0
0
1
1
1

I1
0
0
1
0
0
1

I0
0
1
0
0
1
0

Operation
switch (IMM8[2:0]) {
case 0: OP ← EQ; break;
case 1: OP ← LT; break;
case 2: OP ← LE; break;
case 4: OP ← NEQ; break;
476

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

case 5: OP ← NLT; break;
case 6: OP ← NLE; break;
default: Reserved; break;
}
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// unsigned integer operation
k2[n] = (zmm1[i+31:i] OP tmpSrc2[i+31:i]) ? 1 : 0
}
}

Instruction Pseudo-ops
Compilers and assemblers may implement the following pseudo-ops in addition to the
standard instruction op:
Pseudo-Op
vpcmpequd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpltud k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpleud k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpnequd k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpnltud k2 {k1}, zmm1, Si (zmm2/mt )
vpcmpnleud k2 {k1}, zmm1, Si (zmm2/mt )

Implementation
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {eq}
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {lt}
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {le}
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {neq}
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {nlt}
vcmpud k2 {k1}, zmm1, Si (zmm2/mt ), {nle}

Flags Affected
None.

Reference Number: 327364-001

477

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_cmp_epi32_mask(__m512i, __m512i, const _MM_CMPINT_ENUM);
_mm512_mask_cmp_epi32_mask(__mmask16,
__m512i,
__m512i,
const
_MM_CMPINT_ENUM);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

478

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

479

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPERMD - Permutes Int32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpermd zmm1
36 /r
zmm3/mt

{k1},

zmm2,

Description
Element permute vector zmm3/mt using vector zmm2 as indices and store the inal result in
zmm1, under write-mask.

Description
Performs an element permutation of elements from int32 vector read from memory or
vector zmm3, using int32 vector zmm2 element as source indices. The result of the permutation is written into int32 vector zmm1. Note that index values may be repeated so
that the same vector source element can be replicated across many vector destination
elements.
No swizzle, broadcast, or conversion is performed by this instruction.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
src[511:0] = zmm3/mt
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
j = zmm2[i+3:i]*32
zmm1[i+31:i] = src[j+31:j]
}
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

480

_mm512_permutevar_epi32 (__m512i, __m512i);
_mm512_mask_permutevar_epi32 (__m512i, __mmask16, __m512i, __m512i);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

481

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPERMF32X4 - Shufe Vector Dqwords

Opcode
MVEX.512.66.0F3A.W0 07 /r ib

Instruction
vpermf32x4 zmm1 {k1}, zmm2/mt , imm8

Description
4xFloat32 shuf le element
vector zmm2/mt and store
the result in zmm1, using
imm8 , under write-mask.

Description
Shuf les 128-bit blocks of the vector read from memory or vector zmm2/mem using index
bits in immediate. The result of the shuf le is written into vector zmm1.
No swizzle, broadcast, or conversion is performed by this instruction.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format
imm8

128-bit level permutation vector {3210}

I7

I6

I5

I4

I3

I2

I1

I0

Operation
src[511:0] = zmm2/mt

// Inter-lane shuffle
for (n = 0; n < 16/4; n++) {
i = 128*n
j = 128*((perm128 >> 2*n) & 0x3)
tmp[i+127:i] = src[j+127:j]
}
// Writemasking
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
zmm1[i+31:i] = tmp[i+31:i]
}
}

482

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512
__m512

_mm512_permute4f128_epi32 (__m512i, _MM_PERM_ENUM);
_mm512_mask_permute4f128_epi32
(__m512i,
__mmask16,
_MM_PERM_ENUM);
_mm512_permute4f128_ps (__m512, _MM_PERM_ENUM);
_mm512_mask_permute4f128_ps
(__m512,
__mmask16,
_MM_PERM_ENUM);

__m512i,

__m512,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

483

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPGATHERDD - Gather Int32 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W0 90
/r /vsib

Instruction
vpgatherdd zmm1 {k1}, Ui32 (mvt )

Description
Gather int32 vector Ui32 (mvt ) into int32 vector zmm1 using doubleword indices and k1 as
completion mask.

Description
A set of 16 memory locations pointed by base address BASE_ADDR and doubleword
index vector V IN DEX with scale SCALE are converted to a int32 vector. The result is
written into int32 vector zmm1.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
The instruction will #GP fault if the destination vector zmm1 is the same as index vector
V IN DEX.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
484

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
zmm1[i+31:i] = UpConvLoadi32 (pointer)
k1[n] = 0
}
}

Flags Affected
None.

Memory Up-conversion: Ui32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax]
N/A
N/A
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i

_mm512_i32gather_epi32 (__m512i, void const*, int);
_mm512_mask_i32gather_epi32 (__m512i, __mmask16, __m512i, void const*,
int);
_mm512_i32extgather_epi32 (__m512i, void const*, _MM_UPCONV_EPI32_ENUM,
int, int);
_mm512_mask_i32extgather_epi32 (__m512i, __mmask16, __m512i, void const*,
_MM_UPCONV_EPI32_ENUM, int, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
Reference Number: 327364-001

485

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

486

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv
and corresponding write-mask bit is not zero.
If the destination vector is the same as the index vector [see
.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPGATHERDQ - Gather Int64 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W1 90
/r /vsib

Instruction
vpgatherdq zmm1 {k1}, Ui64 (mvt )

Description
Gather int64 vector Ui64 (mvt ) into int64 vector zmm1 using doubleword indices and k1 as
completion mask.

Description
A set of 8 memory locations pointed by base address BASE_ADDR and doubleword
index vector V IN DEX with scale SCALE are converted to a int64 vector. The result is
written into int64 vector zmm1.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Note that accessed element by will always access 64 bytes of memory. The memory region
accessed by each element will always be between elemen_linear_address & (∼0x3F) and
(element_linear_address & (∼0x3F)) + 63 boundaries.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully loaded.
The instruction will #GP fault if the destination vector zmm1 is the same as index vector
V IN DEX.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
if (ktemp[n] != 0) {
Reference Number: 327364-001

487

CHAPTER 6. INSTRUCTION DESCRIPTIONS

i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
zmm1[i+63:i] = UpConvLoadi64 (pointer)
k1[n] = 0
}
}
k1[15:8] = 0

Flags Affected
None.

Memory Up-conversion: Ui64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i
__m512i

_mm512_i32logather_epi64 (__m512i, void const*, int);
_mm512_mask_i32logather_epi64 (__m512i, __mmask8, __m512i, void const*,
int);
_mm512_i32loextgather_epi64
(__m512i,
void
const*,
_MM_UPCONV_EPI64_ENUM, int, int);
_mm512_mask_i32loextgather_epi64 (__m512i, __mmask8, __m512i, void const*,
_MM_UPCONV_EPI64_ENUM, int, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

488

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the UpConv
and corresponding write-mask bit is not zero.
If the destination vector is the same as the index vector [see
.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

489

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMADD231D - Multiply First Source By Second Source and Add To Destination Int32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 B5 /r

Instruction
vpmadd231d zmm1 {k1}, zmm2, Si32 (zmm3/mt )

Description
Multiply
int32
vector
zmm2
and int32 vector
Si32 (zmm3/mt ),
add the result
to int32 vector
zmm1, and store
the inal result
in zmm1, under
write-mask.

Description
Performs an element-by-element multiplication between int32 vector zmm2 and the
int32 vector result of the swizzle/broadcast/conversion process on memory or vector
int32 zmm3, then adds the result to int32 vector zmm1. The inal sum is written into
int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] + zmm1[i+31:i]
}
}

490

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i
__m512i

_mm512_fmadd_epi32 (__m512i, __m512i, __m512i);
_mm512_mask_fmadd_epi32 (__m512i, __mmask16, __m512i, __m512i);
_mm512_mask3_fmadd_epi32 (__m512i, __m512i, __m512i, __mmask16);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Reference Number: 327364-001

Instruction not available in these modes

491

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

492

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMADD233D - Multiply First Source By Specially Swizzled Second Source
and Add To Second Source Int32 Vectors

Opcode
MVEX.NDS.512.66.0F38.W0 B4 /r

Instruction
vpmadd233d zmm1 {k1}, zmm2, Si32 (zmm3/mt )

Description
Multiply
int32
vector zmm2 by
certain elements
of int32 vector
Si32 (zmm3/mt ),
add
the
result to certain
elements
of
Si32 (zmm3/mt ),
and store the inal
result in zmm1,
under write-mask.

Description
This instruction is built around the concept of 4-element sets, of which there are four:
elements 0-3, 4-7, 8-11, and 12-15. If we refer to the int32 vector result of the broadcast
(no conversion is supported) process on memory or the int32 vector zmm3 (no swizzle
is supported) as t3, then:
Each element 0-3 of int32 vector zmm2 is multiplied by element 1 of t3, the result is added
to element 0 of t3, and the inal sum is written into the corresponding element 0-3 of int32
vector zmm1.
Each element 4-7 of int32 vector zmm2 is multiplied by element 5 of t3, the result is added
to element 4 of t3, and the inal sum is written into the corresponding element 4-7 of int32
vector zmm1.
Each element 8-11 of int32 vector zmm2 is multiplied by element 9 of t3, the result is
added to element 8 of t3, and the inal sum is written into the corresponding element
8-11 of int32 vector zmm1.
Each element 12-15 of int32 vector zmm2 is multiplied by element 13 of t3, the result is
added to element 12 of t3, and the inal sum is written into the corresponding element
12-15 of int32 vector zmm1.
This instruction makes it possible to perform scale and bias in a single instruction without
needing to have either scale or bias already loaded in a register. This saves one vector load
for each interpolant, representing around ten percent of shader instructions.
For structure-of-arrays (SOA) operation, this instruction is intended to be used with the
{4to16} broadcast on src2, allowing all 16 scale and biases to be identical. For array-ofstructures (AOS) vec4 operations, no broadcast is used, allowing four different scales and
biases, one for each vec4.
Reference Number: 327364-001

493

CHAPTER 6. INSTRUCTION DESCRIPTIONS

No conversion or swizzling is supported for this instruction. However, all broadcasts except {1to16} are supported (i.e. 16to16 and 4to16).
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
base = ( n & ~0x03 ) * 32
scale[31:0] = tmpSrc3[base+63:base+32]
bias[31:0] = tmpSrc3[base+31:base]
// integer operation
zmm1[i+31:i] = zmm2[i+31:i] * scale[31:0] + bias[31:0]
}
}

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

494

Function:
no conversion
reserved
broadcast 4 elements (x4)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {16to16} or [rax]
N/A
[rax] {4to16}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
N/A
16
N/A
N/A
N/A
N/A
N/A

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_fmadd233_epi32 (__m512i, __m512i);
_mm512_mask_fmadd233_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to 16 or 64-byte (depending on the swizzle broadcast).
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv involving data conversion, register swizzling or
{1to16} broadcast. If SwizzUpConv function is set to any
value different than "no action" or {4to16} then
495

CHAPTER 6. INSTRUCTION DESCRIPTIONS

an Invalid Opcode fault is raised

496

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMAXSD - Maximum of Int32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpmaxsd zmm1 {k1},
3D /r
Si32 (zmm3/mt )

zmm2,

Description
Determine the maximum of int32 vector zmm2
and int32 vector Si32 (zmm3/mt ) and store the
result in zmm1, under write-mask.

Description
Determines the maximum value of each pair of corresponding elements in int32 vector
zmm2 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = IMax(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

Flags Affected
None.

Reference Number: 327364-001

497

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_max_epi32 (__m512i, __m512i);
_mm512_mask_max_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
498

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

499

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMAXUD - Maximum of Uint32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpmaxud zmm1 {k1},
3F /r
Si32 (zmm3/mt )

zmm2,

Description
Determine the maximum of uint32 vector
zmm2 and uint32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the maximum value of each pair of corresponding elements in uint32 vector zmm2 and the uint32 vector result of the swizzle/broadcast/conversion process on
memory or uint32 vector zmm3. The result is written into uint32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// unsigned integer operation
zmm1[i+31:i] = UMax(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

Flags Affected
None.

500

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_max_epu32 (__m512i,__m512i);
_mm512_mask_max_epu32 (__m512i, __mmask16, __m512i,__m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

501

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

502

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMINSD - Minimum of Int32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpminsd zmm1 {k1},
39 /r
Si32 (zmm3/mt )

zmm2,

Description
Determine the minimum of int32 vector zmm2
and int32 vector Si32 (zmm3/mt ) and store the
result in zmm1, under write-mask.

Description
Determines the minimum value of each pair of corresponding elements in int32 vector
zmm2 and the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = (zmm2[i+31:i] < tmpSrc3[i+31:i]) ?
zmm2[i+31:i] : tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

503

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_min_epi32 (__m512i, __m512i);
_mm512_mask_min_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
504

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

505

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMINUD - Minimum of Uint32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpminud zmm1 {k1},
3B /r
Si32 (zmm3/mt )

zmm2,

Description
Determine the minimum of uint32 vector
zmm2 and uint32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Determines the minimum value of each pair of corresponding elements in uint32 vector
zmm2 and the uint32 vector result of the swizzle/broadcast/conversion process on memory or uint32 vector zmm3. The result is written into uint32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation

if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// unsigned integer operation
zmm1[i+31:i] = UMin(zmm2[i+31:i] , tmpSrc3[i+31:i])
}
}

Flags Affected
None.

506

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_min_epu32 (__m512i, __m512i);
_mm512_mask_min_epu32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

507

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

508

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMULHD - Multiply Int32 Vectors And Store High Result

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpmulhd zmm1 {k1},
87 /r
Si32 (zmm3/mt )

zmm2,

Description
Multiply int32 vector zmm2 and int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element multiplication between int32 vector zmm2 and the
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. The high 32 bits of the result are written into int32 zmm1 vector.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
tmp[63:0] = zmm2[i+31:i] * tmpSrc3[i+31:i]
zmm1[i+31:i] = tmp[63:32]
}
}

Flags Affected
None.

Reference Number: 327364-001

509

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_mulhi_epi32 (__m512i, __m512i);
_mm512_mask_mulhi_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
510

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

511

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMULHUD - Multiply Uint32 Vectors And Store High Result

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpmulhud zmm1 {k1}, zmm2,
86 /r
Si32 (zmm3/mt )

Description
Multiply uint32 vector zmm2 and uint32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element multiplication between uint32 vector zmm2 and the
uint32 vector result of the swizzle/broadcast/conversion process on memory or uint32
vector zmm3. The high 32 bits of the result are written into uint32 zmm1 vector.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// unsigned integer operation
tmp[63:0] = zmm2[i+31:i] * tmpSrc3[i+31:i]
zmm1[i+31:i] = tmp[63:32]
}
}

Flags Affected
None.

512

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_mulhi_epu32 (__m512i, __m512i);
_mm512_mask_mulhi_epu32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

513

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

514

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPMULLD - Multiply Int32 Vectors And Store Low Result

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpmulld zmm1
40 /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Multiply int32 vector zmm2 and int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element multiplication between int32 vector zmm2 and the
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3, and the low 32 bits of the result are written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

515

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_mullo_epi32 (__m512i, __m512i);
_mm512_mask_mullo_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
516

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

517

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPORD - Bitwise OR Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
EB /r

Instruction
vpord zmm1
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise OR between int32 vector
zmm2 and int32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise OR between int32 vector zmm2 and the int32
vector result of the swizzle/broadcast/conversion process on memory or int32 vector
zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = zmm2[i+31:i] | tmpSrc3[i+31:i]
}
}

Flags Affected
None.

518

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_or_epi32 (__m512i, __m512i);
_mm512_mask_or_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

519

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

520

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPORQ - Bitwise OR Int64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
EB /r

Instruction
vporq zmm1
Si64 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise OR between int64 vector
zmm2 and int64 vector Si64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise OR between int64 vector zmm2 and the int64
vector result of the swizzle/broadcast/conversion process on memory or int64 vector
zmm3. The result is written into int64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = zmm2[i+63:i] | tmpSrc3[i+63:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

521

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_or_epi64 (__m512i, __m512i);
_mm512_mask_or_epi64 (__m512i, __mmask8, __m512i, __m512i);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
522

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

523

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSBBD - Subtract Int32 Vectors with Borrow

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsbbd
zmm1
5E /r
Si32 (zmm3/mt )

{k1},

k2,

Description
Subtract int32 vector Si32 (zmm3/mt ) and vector mask register k2 from int32 vector zmm1
and store the result in zmm1, and the borrow
of the subtraction in k2, under write-mask.

Description
Performs an element-by-element three-input subtraction of the int32 vector result of the
swizzle/broadcast/conversion process on memory or int32 vector zmm3, as well as the
corresponding bit of k2, from int32 vector zmm1. The result is written into int32 vector
zmm1.
In addition, the borrow from the subtraction difference for the n-th element is written
into the n-th bit of vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
tmpBorrow = Borrow(zmm1[i+31:i] - k2[n] - tmpSrc3[i+31:i])
zmm1[i+31:i] = zmm1[i+31:i] - k2[n] - tmpSrc3[i+31:i]
k2[n] = tmpBorrow
}
}

524

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_sbb_epi32 (__m512i, __mmask16, __m512i, __mmask16*);
_mm512_mask_sbb_epi32 (__m512i, __mmask16, __mmask16,
__mmask16*);

__m512i,

Exceptions

Real-Address Mode and Virtual-8086
#UD

Reference Number: 327364-001

Instruction not available in these modes

525

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

526

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSBBRD - Reverse Subtract Int32 Vectors with Borrow

Opcode
Instruction
zmm1
MVEX.NDS.512.66.0F38.W0 vpsbbrd
6E /r
Si32 (zmm3/mt )

{k1},

k2,

Description
Subtract int32 vector zmm1 and vector mask
register k2 from int32 vector Si32 (zmm3/mt ),
and store the result in zmm1, and the borrow
of the subtraction in k2, under write-mask.

Description
Performs an element-by-element three-input subtraction of int32 vector zmm1, as well as
the corresponding bit of k2, from the int32 vector result of the swizzle/broadcast/conversion
process on memory or int32 vector zmm3. The result is written into int32 vector zmm1.
In addition, the borrow from the subtraction for the n-th element is written into the n-th
bit of vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
tmpBorrow = Borrow(tmpSrc3[i+31:i] - k2[n] - zmm1[i+31:i])
zmm1[i+31:i] = tmpSrc3[i+31:i] - k2[n] - zmm1[i+31:i]
k2[n] = tmpBorrow
}
}

Flags Affected
None.
Reference Number: 327364-001

527

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_sbbr_epi32 (__m512i, __mmask16, __m512i, __mmask16*);
_mm512_mask_sbbr_epi32 (__m512i, __mmask16, __mmask16,
__mmask16*);

__m512i,

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

528

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

529

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSCATTERDD - Scatter Int32 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W0 A0
/r /vsib

Instruction
vpscatterdd
Di32 (zmm1)

mvt

{k1},

Description
Scatter int32 vector Di32 (zmm1) to vector
memory locations mvt using doubleword indices and k1 as completion mask.

Description
Down-converts and stores all 16 elements in int32 vector UNDEF to the memory locations
pointed by base address BASE_ADDR and doubleword index vector V IN DEX, with
scale SCALE.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Writes to overlapping destination memory locations are guaranteed to be ordered with
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB
of the source registers). Writes that are not overlapped may happen in any order. Memory
ordering with other instructions follows the Intel-64 memory ordering model. Note that
this does not account for non-overlapping indices that map into the same physical address
locations.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
530

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
tmp = DownConvStorei32 (UNDEF[i+31:i], SSS[2:0])
if(DownConvStoreSizeOfi32 (SSS[2:0]) == 4) {
MemStore(pointer) = tmp[31:0]
} else if(DownConvStoreSizeOfi32 (SSS[2:0]) == 2) {
MemStore(pointer) = tmp[15:0]
} else if(DownConvStoreSizeOfi32 (SSS[2:0]) == 1) {
MemStore(pointer) = tmp[7:0]
}
k1[n] = 0
}
}

Flags Affected
None.

Memory Down-conversion: Di32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
uint32 to uint8
sint32 to sint8
uint32 to uint16
sint32 to sint16

Usage
zmm1
N/A
N/A
N/A
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
N/A
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void

_mm512_i32scatter_epi32 (void*, __m512i, __m512i, int);
_mm512_mask_i32scatter_epi32 (void*, __mmask16, __m512i, __m512i, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Reference Number: 327364-001

Instruction not available in these modes

531

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

532

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv
mode, and corresponding write-mask bit is not zero.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSCATTERDQ - Scatter Int64 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W1 A0
/r /vsib

Instruction
vpscatterdq
Di64 (zmm1)

mvt

{k1},

Description
Scatter int64 vector Di64 (zmm1) to vector
memory locations mvt using doubleword indices and k1 as completion mask.

Description
Down-converts and stores all 8 elements in int64 vector UNDEF to the memory locations
pointed by base address BASE_ADDR and doubleword index vector V IN DEX, with
scale SCALE.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Writes to overlapping destination memory locations are guaranteed to be ordered with
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB
of the source registers). Writes that are not overlapped may happen in any order. Memory
ordering with other instructions follows the Intel-64 memory ordering model. Note that
this does not account for non-overlapping indices that map into the same physical address
locations.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
Reference Number: 327364-001

533

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (ktemp[n] != 0) {
i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
tmp = DownConvStorei64 (UNDEF[i+63:i], SSS[2:0])
if(DownConvStoreSizeOfi64 (SSS[2:0]) == 8) {
MemStore(pointer) = tmp[63:0]
}
k1[n] = 0
}
}
k1[15:8] = 0

Flags Affected
None.

Memory Down-conversion: Di64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void
void

_mm512_i32loscatter_epi64 (void*, __m512i, __m512i, int);
_mm512_mask_i32loscatter_epi64 (void*, __mmask8, __m512i, __m512i, int);

Exceptions

Real-Address Mode and Virtual-8086
#UD

534

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv
mode, and corresponding write-mask bit is not zero.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

535

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSHUFD - Shufe Vector Doublewords

Opcode
MVEX.512.66.0F.W0 70 /r ib

Instruction
vpshufd zmm1 {k1}, zmm2/mt , imm8

Description
Dword shuf le int32 vector
zmm2/mt and store the result
in zmm1, using imm8 , under
write-mask.

Description
Shuf les 32 bit blocks of the vector read from memory or vector zmm2/mem using index
bits in immediate. The result of the shuf le is written into vector zmm1.
No swizzle, broadcast, or conversion is performed by this instruction.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format
imm8

32 bit level permutation vector {dcba}

I7

I6

I5

I4

I3

I2

I1

I0

Operation
src[511:0] = zmm2/mt

// Intra-lane shuffle
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
// offset within 128-bit chunk
j = 32*((perm32 >> 2*(n & 0x3)) & 0x3)
// 128-bit level offset
j = j + 128*(n >> 2)
zmm1[i+31:i] = src[j+31:j]
}
}

536

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_shuf le_epi32 (__m512i, _MM_PERM_ENUM);
_mm512_mask_shuf le_epi32
(__m512i,
__mmask16,
_MM_PERM_ENUM);

__m512i,

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

537

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSLLD - Shift Int32 Vector Immediate Left Logical

Opcode
MVEX.NDD.512.66.0F.W0
72 /6 ib

Instruction
vpslld
zmm1
Si32 (zmm2/mt ), imm8

{k1},

Description
Shift left int32 vector Si32 (zmm2/mt ) and store
the result in zmm1, using imm8, under writemask.

Description
Performs an element-by-element logical left shift of the result of the swizzle/broadcast/conversion
process on memory or vector int32 zmm2, shifting by the number of bits speci ied in immediate ield. The result is stored in int32 vector zmm1.
If the value speci ied by the shift operand is greater than 31 then the destination operand
is set to all 0s.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
zmm1[i+31:i] = tmpSrc2[i+31:i] << IMM8[7:0]
}
}

Flags Affected
None.

538

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_slli_epi32 (__m512i, unsigned int);
_mm512_mask_slli_epi32 (__m512i, __mmask16, __m512i, unsigned int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

539

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

540

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSLLVD - Shift Int32 Vector Left Logical

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsllvd zmm1
47 /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Shift left int32 vector zmm2 and int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element left shift of int32 vector zmm2, shifting by the number
of bits speci ied by the int32 vector result of the swizzle/broadcast/conversion process
on memory or vector int32 zmm3. The result is stored in int32 vector zmm1.
If the value speci ied by the shift operand is greater than 31 then the destination operand
is set to all 0s.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] << tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

541

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_sllv_epi32 (__m512i, __m512i);
_mm512_mask_sllv_epi32 (__m512i, __mmask16, __m512i,__m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
542

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

543

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSRAD - Shift Int32 Vector Immediate Right Arithmetic

Opcode
MVEX.NDD.512.66.0F.W0
72 /4 ib

Instruction
vpsrad
zmm1
Si32 (zmm2/mt ), imm8

{k1},

Description
Shift
right
arithmetic
int32
vector
Si32 (zmm2/mt ) and store the result in zmm1,
using imm8, under write-mask.

Description
Performs an element-by-element arithmetic right shift of the result of the swizzle/broadcast/conversion
process on memory or vector int32 zmm2, shifting by the number of bits speci ied in immediate ield. The result is stored in int32 vector zmm1.
An arithmetic right shift leaves the sign bit unchanged after each shift count, so the inal
result has the i+1 msbs set to the original sign bit, where i is the number of bits by which
to shift right.
If the value speci ied by the shift operand is greater than 31 each destination data element
is illed with the initial value of the sign bit of the element.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = tmpSrc2[i+31:i] >> IMM8[7:0]
}
}

Flags Affected
None.
544

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_srai_epi32 (__m512i, unsigned int);
_mm512_mask_srai_epi32 (__m512i, __mmask16, __m512i, unsigned int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

545

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

546

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSRAVD - Shift Int32 Vector Right Arithmetic

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsravd zmm1
46 /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Shift right arithmetic int32 vector zmm2 and
int32 vector Si32 (zmm3/mt ) and store the result in zmm1, under write-mask.

Description
Performs an element-by-element arithmetic right shift of int32 vector zmm2, shifting by
the number of bits speci ied by the int32 vector result of the swizzle/broadcast/conversion
process on memory or vector int32 zmm3. The result is stored in int32 vector zmm1.
An arithmetic right shift leaves the sign bit unchanged after each shift count, so the inal
result has the i+1 msbs set to the original sign bit, where i is the number of bits by which
to shift right.
If the value speci ied by the shift operand is greater than 31 each destination data element
is illed with the initial value of the sign bit of the element.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] >> tmpSrc3[i+31:i]
}
}

Flags Affected
None.
Reference Number: 327364-001

547

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_srav_epi32 (__m512i, __m512i);
_mm512_mask_srav_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
548

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

549

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSRLD - Shift Int32 Vector Immediate Right Logical

Opcode
MVEX.NDD.512.66.0F.W0
72 /2 ib

Instruction
vpsrld
zmm1
Si32 (zmm2/mt ), imm8

{k1},

Description
Shift right logical int32 vector Si32 (zmm2/mt )
and store the result in zmm1, using imm8, under write-mask.

Description
Performs an element-by-element logical right shift of the result of the swizzle/broadcast/conversion
process on memory or vector int32 zmm2, shifting by the number of bits speci ied in immediate ield. The result is stored in int32 vector zmm1.
A logical right shift shifts a 0-bit into the msb for each shift count, so the inal result has
the i msbs set to 0, where i is the number of bits by which to shift right.
If the value speci ied by the shift operand is greater than 31 then the destination operand
is set to all 0s.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = tmpSrc2[i+31:i] >> IMM8[7:0]
}
}

Flags Affected
None.
550

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_srli_epi32 (__m512i, unsigned int);
_mm512_mask_srli_epi32 (__m512i, __mmask16, __m512i, unsigned int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

551

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

552

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSRLVD - Shift Int32 Vector Right Logical

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsrlvd zmm1
45 /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Shift right logical int32 vector zmm2 and int32
vector Si32 (zmm3/mt ) and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element logical right shift of int32 vector zmm2, shifting by the
number of bits speci ied by the int32 vector result of the swizzle/broadcast/conversion
process on memory or vector int32 zmm3. The result is stored in int32 vector zmm1.
A logical right shift shifts a 0-bit into the msb for each shift count, so the inal result has
the i msbs set to 0, where i is the number of bits by which to shift right.
If the value speci ied by the shift operand is greater than 31 then the destination operand
is set to all 0s.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] >> tmpSrc3[i+31:i]
}
}

Flags Affected
None.
Reference Number: 327364-001

553

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_srlv_epi32 (__m512i, __m512i);
_mm512_mask_srlv_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
554

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

555

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSUBD - Subtract Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
FA /r

Instruction
vpsubd zmm1
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Subtract int32 vector Si32 (zmm3/mt ) from
int32 vector zmm2 and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element subtraction from int32 vector zmm2 of the int32 vector
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
zmm1[i+31:i] = zmm2[i+31:i] - tmpSrc3[i+31:i]
}
}

Flags Affected
None.

556

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_sub_epi32 (__m512i, __m512i);
_mm512_mask_sub_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

557

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

558

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSUBRD - Reverse Subtract Int32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsubrd zmm1
6C /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Subtract int32 vector zmm2 from int32 vector
Si32 (zmm3/mt ) and store the result in zmm1,
under write-mask.

Description
Performs an element-by-element subtraction of int32 vector zmm2 from the int32 vector
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
zmm1[i+31:i] = -zmm2[i+31:i] + tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

559

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_subr_epi32 (__m512i, __m512i);
_mm512_mask_subr_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
560

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

561

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSUBRSETBD - Reverse Subtract Int32 Vectors and Set Borrow

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsubrsetbd zmm1 {k1},
6F /r
Si32 (zmm3/mt )

k2,

Description
Subtract int32 vector zmm1 from int32 vector
Si32 (zmm3/mt ) and store the subtraction in
zmm1 and the borrow from the subtraction in
k2, under write-mask.

Description
Performs an element-by-element subtraction of int32 vector zmm1 from the int32 vector
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3.
The result is written into int32 vector zmm1.
In addition, the borrow from the subtraction for the n-th element is written into the n-th
bit of vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
k2[n] = Borrow(tmpSrc3[i+31:i] - zmm1[i+31:i])
zmm1[i+31:i] = tmpSrc3[i+31:i] - zmm1[i+31:i]
}
}

Flags Affected
None.

562

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_subrsetb_epi32 (__m512i, __m512i, __mmask16*);
_mm512_mask_subrsetb_epi32 (__m512i, __mmask16, __mmask16, __m512i,
__mmask16*);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Reference Number: 327364-001

Instruction not available in these modes

563

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

564

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPSUBSETBD - Subtract Int32 Vectors and Set Borrow

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vpsubsetbd zmm1
5F /r
Si32 (zmm3/mt )

{k1},

k2,

Description
Subtract int32 vector Si32 (zmm3/mt ) from
int32 vector zmm1 and store the subtraction in
zmm1 and the borrow from the subtraction in
k2, under write-mask.

Description
Performs an element-by-element subtraction of the int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3 from int32 vector
zmm1. The result is written into int32 vector zmm1.
In addition, the borrow from the subtraction for the n-th element is written into the n-th
bit of vector mask k2.
This instruction is write-masked, so only those elements with the corresponding bit set in
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1
and k2 with the corresponding bit clear in k1 retain their previous value.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// integer operation
k2[n] = Borrow(zmm1[i+31:i] - tmpSrc3[i+31:i])
zmm1[i+31:i] = zmm1[i+31:i] - tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

565

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_subsetb_epi32 (__m512i, __m512i, __mmask16*);
_mm512_mask_subsetb_epi32 (__m512i, __mmask16, __mmask16, __m512i,
__mmask16*);

Exceptions

Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

566

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

567

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPTESTMD - Logical AND Int32 Vectors and Set Vector Mask

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vptestmd k2 {k1},
27 /r
Si32 (zmm2/mt )

zmm1,

Description
Perform a bitwise AND between int32 vector
zmm1 and int32 vector Si32 (zmm2/mt ), and
set vector mask k2 to re lect the zero/nonzero status of each element of the result, under
write-mask.

Description
Performs an element-by-element bitwise AND between int32 vector zmm1 and the int32
vector result of the swizzle/broadcast/conversion process on memory or int32 vector
zmm2, and uses the result to construct a 16 bit vector mask, with a 0-bit for each element
for which the result of the AND was 0, and a 1-bit where the result of the AND was not
zero. The inal result is written into vector mask k2.
The write-mask does not perform the normal write-masking function for this instruction.
While it does enable/disable comparisons, it does not block updating of the destination;
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonetheless, the operation is similar enough so that it makes sense to use the usual write-mask
notation. This mode of operation is desirable because the result will be used directly as a
write-mask, rather than the normal case where the result is used with a separate writemask that keeps the masked elements inactive.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadi32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
k2[n] = 0
if(k1[n] != 0) {
i = 32*n
// signed integer operation
if ((zmm1[i+31:i] & tmpSrc2[i+31:i]) != 0) {
k2[n] = 1
}
}
}

568

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__mmask16
__mmask16

_mm512_test_epi32_mask (__m512i, __m512i);
_mm512_mask_test_epi32_mask (__mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
Reference Number: 327364-001

569

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

570

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPXORD - Bitwise XOR Int32 Vectors

Opcode
MVEX.NDS.512.66.0F.W0
EF /r

Instruction
vpxord zmm1
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise XOR between int32 vector
zmm2 and int32 vector Si32 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise XOR between int32 vector zmm2 and the int32
vector result of the swizzle/broadcast/conversion process on memory or int32 vector
zmm3. The result is written into int32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// signed integer operation
zmm1[i+31:i] = zmm2[i+31:i] ^ tmpSrc3[i+31:i]
}
}

Flags Affected
None.

Reference Number: 327364-001

571

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_xor_epi32 (__m512i, __m512i);
_mm512_mask_xor_epi32 (__m512i, __mmask16, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
572

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

573

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VPXORQ - Bitwise XOR Int64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
EF /r

Instruction
vpxorq zmm1
Si64 (zmm3/mt )

{k1},

zmm2,

Description
Perform a bitwise XOR between int64 vector
zmm2 and int64 vector Si64 (zmm3/mt ) and
store the result in zmm1, under write-mask.

Description
Performs an element-by-element bitwise XOR between int64 vector zmm2 and the int64
vector result of the swizzle/broadcast/conversion process on memory or int64 vector
zmm3. The result is written into int64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
tmpSrc3[511:0] = zmm3[511:0]
} else {
tmpSrc3[511:0] = SwizzUpConvLoadi64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
zmm1[i+63:i] = zmm2[i+63:i] ^ tmpSrc3[i+63:i]
}
}

Flags Affected
None.

574

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Si64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Si64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512i
__m512i

_mm512_xor_epi64 (__m512i, __m512i);
_mm512_mask_xor_epi64 (__m512i, __mmask8, __m512i, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
Reference Number: 327364-001

575

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#SS(0)
#GP(0)

#PF(fault-code)
#NM

576

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VRCP23PS - Reciprocal of Float32 Vector

Opcode
MVEX.512.66.0F38.W0 CA
/r

Instruction
vrcp23ps zmm1 {k1}, zmm2/mt

Description
Compute the approximate reciprocals loat32
vector zmm2/mt and store the result in zmm1,
under write-mask.

Description
Computes the element-by-element reciprocal approximation of the loat32 vector on
memory or loat32 vector zmm2 with 0.912ULP (relative error). The result is written
into loat32 vector zmm1.
If any source element is NaN, the quietized NaN source value is returned for that element.
If any source element is ±∞, 0.0 is returned for that element. Also, if any source element
is ±0.0, ±∞ is returned for that element.
Current implementation of this instruction does not support any SwizzUpConv setting
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result
in an Invalid Opcode exception.
recip_1ulp() function follows Table 6.26 when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞
2n

Result
input qNaN
+0
+∞
−∞
−0
2−n

Comment
raise #I lag if sNaN
raise #Z lag
raise #Z lag
exact result

Table 6.26: recip_1ulp() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
tmpSrc2[511:0] = zmm2/mt
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags()
// SAE
}
for (n = 0; n < 16; n++) {
Reference Number: 327364-001

577

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = recip_1ulp(tmpSrc2[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Zero.

Denormal Handling
Treat Input Denormals As Zeros :
YES
Flush Tiny Results To Zero :
YES

Register Swizzle
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

578

_mm512_rcp23_ps (__m512);
_mm512_mask_rcp23_ps (__m512, __mmask16, __m512);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

579

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VRNDFXPNTPD - Round Float64 Vector

Opcode
MVEX.512.66.0F3A.W1 52 /r ib

Instruction
vrndfxpntpd zmm1 {k1}, Sf 64 (zmm2/mt ), imm8

Description
Round loat64 vector
Sf 64 (zmm2/mt ) and
store the result in
zmm1, using imm8,
under write-mask.

Description
Performs an element-by-element rounding of the result of the swizzle/broadcast/conversion
from memory or loat64 vector zmm2. The rounding result for each element is a loat64
containing an integer or ixed-point value, depending on the value of expadj; the direction
of rounding depends on the value of RC. The result is written into loat64 vector zmm1.
This instruction doesn't actually convert the result to an int64; the results are loat64s,
just like the input, but are loat64s containing the integer or ixed-point values that result
from the speci ied rounding and scaling.
RoundToInt() function follows Table 6.27 when dealing with loating-point special number.
Input
NaN
+∞
+0
-0
−∞

Result
quietized input NaN
+∞
+0
−0
−∞

Table 6.27: RoundToInt() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

580

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Immediate Format

rn
rd
ru
rz

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (64.0 - no exponent adjustment)
24 (60.4)
25 (59.5)
28 (56.8)
216 (48.16)
224 (40.24)
231 (33.31)
232 (32.32)
*must UD*

I1
0
0
1
1
I7
0
0
0
0
0
0
0
0
1

I0
0
1
0
1
I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
RoundingMode = IMM8[1:0]
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 64 (zmm2/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] =
RoundToInt(tmpSrc2[i+63:i] * EXPADJ_TABLE[expadj], RoundingMode) /
EXPADJ_TABLE[expadj]
}
}

SIMD Floating-Point Exceptions
Invalid, Precision.

Reference Number: 327364-001

581

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

582

_mm512_roundfxpnt_adjust_pd (__m512d, int, _MM_EXP_ADJ_ENUM);
_mm512_mask_roundfxpnt_adjust_pd (__m512d, __mmask8, __m512d, int ,
_MM_EXP_ADJ_ENUM);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

583

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VRNDFXPNTPS - Round Float32 Vector

Opcode
MVEX.512.66.0F3A.W0 52 /r ib

Instruction
vrndfxpntps zmm1 {k1}, Sf 32 (zmm2/mt ), imm8

Description
Round loat32 vector
Sf 32 (zmm2/mt ) and
store the result in
zmm1, using imm8,
under write-mask.

Description
Performs an element-by-element rounding of the result of the swizzle/broadcast/conversion
from memory or loat32 vector zmm2. The rounding result for each element is a loat32
containing an integer or ixed-point value, depending on the value of expadj; the direction
of rounding depends on the value of RC. The result is written into loat32 vector zmm1.
This instruction doesn't actually convert the result to an int32; the results are loat32s,
just like the input, but are loat32s containing the integer or ixed-point values that result
from the speci ied rounding and scaling.
RoundToInt() function follows Table 6.28 when dealing with loating-point special number.
This instruction treats input denormals as zeros according to the DAZ control bit, but does
not lush tiny results to zero.
Input
NaN
+∞
+0
-0
−∞

Result
quietized input NaN
+∞
+0
−0
−∞

Table 6.28: RoundToInt() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Immediate Format

rn
rd
ru
rz
584

Rounding Mode
Round to Nearest (even)
Round Down (Round toward Negative In inity)
Round Up (Round toward Positive In inity)
Round toward Zero

I1
0
0
1
1

I0
0
1
0
1
Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exponent Adjustment
0
4
5
8
16
24
31
32
reserved

value
20 (32.0 - no exponent adjustment)
24 (28.4)
25 (27.5)
28 (24.8)
216 (16.16)
224 (8.24)
231 (1.31)
232 (0.32)
*must UD*

I7
0
0
0
0
0
0
0
0
1

I6
0
0
0
0
1
1
1
1
x

I5
0
0
1
1
0
0
1
1
x

I4
0
1
0
1
0
1
0
1
x

Operation
RoundingMode = IMM8[1:0]
expadj = IMM8[6:4]
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
tmpSrc2[511:0] = zmm2[511:0]
} else {
tmpSrc2[511:0] = SwizzUpConvLoadf 32 (zmm2/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] =
RoundToInt(tmpSrc2[i+31:i] * EXPADJ_TABLE[expadj], RoundingMode) /
EXPADJ_TABLE[expadj]
}
}

SIMD Floating-Point Exceptions
Invalid, Precision.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
NO

Reference Number: 327364-001

585

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_roundfxpnt_adjust_ps (__m512, int , _MM_EXP_ADJ_ENUM);
_mm512_mask_roundfxpnt_adjust_ps (__m512, __mmask16, __m512, int ,
_MM_EXP_ADJ_ENUM);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

586

Instruction not available in these modes

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM
#UD

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

587

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VRSQRT23PS - Vector Reciprocal Square Root of Float32 Vector

Opcode
MVEX.512.66.0F38.W0 CB
/r

Instruction
vrsqrt23ps zmm1 {k1}, zmm2/mt

Description
Reciprocal square root
loat32 vector
zmm2/mt and store the result in zmm1,
under write-mask.

Description
Computes the element-by-element reciprocal square root of the loat32 vector on memory
or loat32 vector zmm2 with a precision of 0.775ULP (relative error). The result is written
into loat32 vector zmm1.
If any source element is NaN, the quietized NaN source value is returned for that element.
Negative source numbers, as well as −∞, return the canonical NaN and set the Invalid
Flag (#I).
Current implementation of this instruction does not support any SwizzUpConv setting
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result
in an Invalid Opcode exception.
rsqrt_1ulp() function follows Table 6.29 when dealing with loating-point special number.
For an input value of +/ − 0 the instruction returns −∞ and sets the Divide-By-Zero lag
(#Z). Negative numbers should return NaN and set the Invalid lag (#I). Note however that
this instruction treats input denormals as zeros of the same sign, so for denormal negative
inputs it returns −∞ and sets the Divide-By-Zero status lag.
Input
NaN
+∞
+0
−0
<0
−∞
22n

Result
input qNaN
+0
+∞
−∞
NaN
NaN
2−n

Comments
Raise #I lag if sNaN
Raise #Z lag
Raise #Z lag
Raise #I lag
Raise #I lag
exact result

Table 6.29: rsqrt_1ulp() special loating-point values behavior
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

588

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Operation
tmpSrc2[511:0] = zmm2/mt
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags()
// SAE
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
zmm1[i+31:i] = rsqrt_1ulp(tmpSrc2[i+31:i])
}
}

SIMD Floating-Point Exceptions
Invalid, Zero.

Denormal Handling
Treat Input Denormals As Zeros :
YES
Flush Tiny Results To Zero :
YES

Register Swizzle
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
reserved
010
reserved
011
reserved
100
reserved
101
reserved
110
reserved
111
reserved
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
1xx
SAE (Supress-All-Exceptions)
Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Usage
, {sae}
589

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

__ICL_INTRINCC _mm512_rsqrt23_ps (__m512);
__ICL_INTRINCC _mm512_mask_rsqrt23_ps (__m512, __mmask16, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)
#PF(fault-code)
#NM

590

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
This instruction does not support any
SwizzUpConv different from the default value (no broadcast,
no conversion). If SwizzUpConv function is set to any value
different than "no action", then an Invalid Opcode fault is
raised. This includes register swizzles.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCALEPS - Scale Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vscaleps zmm1
84 /r
Si32 (zmm3/mt )

{k1},

zmm2,

Description
Multiply loat32 vector zmm2 by 2 raised to the
int32 vector Si32 (zmm3/mt ) and store the result in zmm1, under write-mask.

Description
Performs an element-by-element scale of loat32 vector zmm2 by multiplying it by 2exp ,
where exp is the vector int32 result of the swizzle/broadcast/conversion process on
memory or vector int32 zmm3. The result is written into vector loat32 zmm1.
This instruction is needed for scaling u and v coordinates according to the mipmap size,
which is 2mipmap_level , and for the evaluation of Exp2.
Cases where the exponent would go out of range are handled as if multiplication (via
vmulps) of zmm2 by 2zmm3 had been performed.
If the result cannot be represented with a loat32, then the properly signed ∞ (for positive
scaling operand) or 0 (for negative scaling operand) will be returned.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadI(zmm3/mt )
}
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
exp[31:0] = tmpSrc3[i+31:i]
// signed int scale operation. float32 multiplication
zmm1[i+31:i] = zmm2[i+31:i] * 2exp[31:0]
}
}

Reference Number: 327364-001

591

CHAPTER 6. INSTRUCTION DESCRIPTIONS

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Si32
S2 S1 S0
000
001
010
011
100
101
110
111

592

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
reserved
uint8 to uint32
sint8 to sint32
uint16 to uint32
sint16 to sint32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
N/A
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
N/A
16
16
32
32

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Register Swizzle: Si32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_scale_ps (__m512, __m512i);
_mm512_mask_scale_ps (__m512, __mmask16, __m512, __m512i);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
593

CHAPTER 6. INSTRUCTION DESCRIPTIONS

#GP(0)

#PF(fault-code)
#NM

594

If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERDPD - Scatter Float64 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W1 A2
/r /vsib

Instruction
vscatterdpd
Df 64 (zmm1)

mvt

{k1},

Description
Scatter loat64 vector Df 64 (zmm1) to vector
memory locations mvt using doubleword indices and k1 as completion mask.

Description
Down-converts and stores all 8 elements in loat64 vector zmm1 to the memory locations
pointed by base address BASE_ADDR and doubleword index vector V IN DEX, with
scale SCALE.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Writes to overlapping destination memory locations are guaranteed to be ordered with
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB
of the source registers). Writes that are not overlapped may happen in any order. Memory
ordering with other instructions follows the Intel-64 memory ordering model. Note that
this does not account for non-overlapping indices that map into the same physical address
locations.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
Reference Number: 327364-001

595

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (ktemp[n] != 0) {
i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
tmp = DownConvStoref 64 (zmm1[i+63:i], SSS[2:0])
if(DownConvStoreSizeOff 64 (SSS[2:0]) == 8) {
MemStore(pointer) = tmp[63:0]
}
k1[n] = 0
}
}
k1[15:8] = 0

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
zmm1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

596

_mm512_i32loextscatter_pd
(void*,
__m512i,
__m512d,
_MM_DOWNCONV_PD_ENUM, int, int);
_mm512_mask_i32loextscatter_pd (void*, __mmask8, __m512i, __m512d,
_MM_DOWNCONV_PD_ENUM, int, int);
_mm512_i32loscatter_pd (void*, __m512i, __m512d, int);
_mm512_mask_i32loscatter_pd (void*, __mmask8, __m512i, __m512d, int);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv
mode, and corresponding write-mask bit is not zero.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

597

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERDPS - Scatter Float32 Vector With Signed Dword Indices

Opcode
MVEX.512.66.0F38.W0 A2
/r /vsib

Instruction
vscatterdps
Df 32 (zmm1)

mvt

{k1},

Description
Scatter loat32 vector Df 32 (zmm1) to vector
memory locations mvt using doubleword indices and k1 as completion mask.

Description
Down-converts and stores all 16 elements in loat32 vector zmm1 to the memory locations pointed by base address BASE_ADDR and doubleword index vector V IN DEX,
with scale SCALE.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are
zero).
Writes to overlapping destination memory locations are guaranteed to be ordered with
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB
of the source registers). Writes that are not overlapped may happen in any order. Memory
ordering with other instructions follows the Intel-64 memory ordering model. Note that
this does not account for non-overlapping indices that map into the same physical address
locations.
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
598

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
tmp = DownConvStoref 32 (zmm1[i+31:i], SSS[2:0])
if(DownConvStoreSizeOff 32 (SSS[2:0]) == 4) {
MemStore(pointer) = tmp[31:0]
} else if(DownConvStoreSizeOff 32 (SSS[2:0]) == 2) {
MemStore(pointer) = tmp[15:0]
} else if(DownConvStoreSizeOff 32 (SSS[2:0]) == 1) {
MemStore(pointer) = tmp[7:0]
}
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_i32extscatter_ps
(void*,
__m512i,
__m512,
_MM_DOWNCONV_PS_ENUM, int, int);
_mm512_mask_i32extscatter_ps (void*, __mmask16, __m512i, __m512,
_MM_DOWNCONV_PS_ENUM, int, int);
_mm512_i32scatter_ps (void*, __m512i, __m512, int);
_mm512_mask_i32scatter_ps (void*, __mmask16, __m512i, __m512, int);

Reference Number: 327364-001

599

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

600

If a memory address referencing the SS segment is
in a non-canonical form, and corresponding write-mask bit is not zero.
If a memory address is in a non-canonical form,
and corresponding write-mask bit is not zero.
If a memory operand linear address is not aligned
to element-wise data granularity dictated by the DownConv
mode, and corresponding write-mask bit is not zero.
If a memory operand linear address produces a page fault
and corresponding write-mask bit is not zero.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERPF0DPS - Scatter Prefetch Float32 Vector With Signed Dword
Indices Into L1

Opcode
MVEX.512.66.0F38.W0 C6
/5 /vsib

Instruction
vscatterpf0dps Uf 32 (mvt ) {k1}

Description
Scatter Prefetch loat32 vector Uf 32 (mvt ), using doubleword indices with T0 hint, under
write-mask.

Description
Prefetches into the L1 level of cache the memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX, with scale SCALE, with request for ownership (exclusive). Up-conversion operand speci ies the granularity used
by compilers to better encode the instruction if a displacement, using disp8*N feature, is
provided when specifying the address. If any memory access causes any type of memory exception, the memory access will be considered as completed (destination mask updated) and the exception ignored. Up-conversion parameter is optional, and it is used to
correctly encode disp8*N.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
prefetch sequence have been prefetched and hence, the write-mask bits all are zero).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after up-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.
Note that both gather and scatter prefetches set the access bit (A) in the related TLB page
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D).

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
exclusive = 1
evicthintpre = MVEX.EH
Reference Number: 327364-001

601

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
FetchL1cacheLine(pointer, exclusive, evicthintpre)
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
None.

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

602

_mm512_prefetch_i32extscatter_ps (void*, __m512i, _MM_UPCONV_PS_ENUM,
int, int);
_mm512_mask_prefetch_i32extscatter_ps(void*,
__mmask16,
__m512i,
_MM_UPCONV_PS_ENUM, int, int);
_mm512_prefetch_i32scatter_ps(void*, __m512i, int, int);
_mm512_mask_prefetch_i32scatter_ps(void*, __mmask16, __m512i, int, int);

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

603

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERPF0HINTDPD - Scatter Prefetch Float64 Vector Hint With Signed
Dword Indices

Opcode
MVEX.512.66.0F38.W1 C6
/4 /vsib

Instruction
vscatterpf0hintdpd
{k1}

Uf 64 (mvt )

Description
Scatter Prefetch loat64 vector Uf 64 (mvt ), using doubleword indices with T0 hint, under
write-mask.

Description
The instruction speci ies a set of 8 loat64 memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX with scale SCALE as a performance hint that a real scatter instruction with the same set of sources will be invoked. A
programmer may execute this instruction before a real scatter instruction to improve its
performance.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults. This instructions does not
modify any kind of architectural state (including the write-mask).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.

Operation
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 8; n++) {
if (k1[n] != 0) {
i = 64*n
j = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE)
pointer[63:0] = mvt [n]
HintPointer(pointer)
}
}

SIMD Floating-Point Exceptions
None.

604

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Uf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Usage
[rax]
N/A
N/A
N/A
N/A
N/A
N/A
N/A

disp8*N
8
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Intel® C/C++ Compiler Intrinsic Equivalent
None

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM
#UD

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

605

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERPF0HINTDPS - Scatter Prefetch Float32 Vector Hint With Signed
Dword Indices

Opcode
MVEX.512.66.0F38.W0 C6
/4 /vsib

Instruction
vscatterpf0hintdps
{k1}

Uf 32 (mvt )

Description
Scatter Prefetch loat32 vector Uf 32 (mvt ), using doubleword indices with T0 hint, under
write-mask.

Description
The instruction speci ies a set of 16 loat32 memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX with scale SCALE as a performance hint that a real scatter instruction with the same set of sources will be invoked. A
programmer may execute this instruction before a real scatter instruction to improve its
performance.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults. This instructions does not
modify any kind of architectural state (including the write-mask).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element before up-conversion.

Operation
// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (k1[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
HintPointer(pointer)
}
}

SIMD Floating-Point Exceptions
None.

606

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Memory Up-conversion: Uf 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat16 to loat32
uint8 to loat32
sint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax]
N/A
N/A
[rax] { loat16}
[rax] {uint8}
[rax] {sint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
None

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM
#UD

Reference Number: 327364-001

If CR0.TS[bit 3]=1.
If processor model does not implement the speci ic instruction.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

607

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSCATTERPF1DPS - Scatter Prefetch Float32 Vector With Signed Dword
Indices Into L2

Opcode
MVEX.512.66.0F38.W0 C6
/6 /vsib

Instruction
vscatterpf1dps Uf 32 (mvt ) {k1}

Description
Scatter Prefetch loat32 vector Uf 32 (mvt ), using doubleword indices with T1 hint, under
write-mask.

Description
Prefetches into the L2 level of cache the memory locations pointed by base address
BASE_ADDR and doubleword index vector V IN DEX, with scale SCALE, with request for ownership (exclusive). Down-conversion operand speci ies the granularity used
by compilers to better encode the instruction if a displacement, using disp8*N feature, is
provided when specifying the address. If any memory access causes any type of memory exception, the memory access will be considered as completed (destination mask updated) and the exception ignored. Down-conversion parameter is optional, and it is used
to correctly encode disp8*N.
Note the special mask behavior as only a subset of the active elements of write mask k1
are actually operated on (as denoted by function SELECT _SU BSET ). There are only
two guarantees about the function: (a) the destination mask is a subset of the source mask
(identity is included), and (b) on a given invocation of the instruction, at least one element
(the least signi icant enabled mask bit) will be selected from the source mask.
Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
prefetch sequence have been prefetched and hence, the write-mask bits all are zero).
This instruction has special disp8*N and alignment rules. N is considered to be the size
of a single vector element after down-conversion.
Note also the special mask behavior as the corresponding bits in write mask k1 are reset
with each destination element being updated according to the subset of write mask k1.
This is useful to allow conditional re-trigger of the instruction until all the elements from
a given write mask have been successfully stored.
Note that both gather and scatter prefetches set the access bit (A) in the related TLB page
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D).

Operation
// instruction works over a subset of the write mask
ktemp = SELECT_SUBSET(k1)
exclusive = 1
evicthintpre = MVEX.EH
608

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

// Use mvt as vector memory operand (VSIB)
for (n = 0; n < 16; n++) {
if (ktemp[n] != 0) {
i = 32*n
// mvt [n] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE)
pointer[63:0] = mvt [n]
FetchL2cacheLine(pointer, exclusive, evicthintpre)
k1[n] = 0
}
}

SIMD Floating-Point Exceptions
None.

Memory Down-conversion: Df 32
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
reserved
reserved
loat32 to loat16
loat32 to uint8
loat32 to sint8
loat32 to uint16
loat32 to sint16

Usage
zmm1
N/A
N/A
zmm1 { loat16}
zmm1 {uint8}
zmm1 {sint8}
zmm1 {uint16}
zmm1 {sint16}

disp8*N
4
N/A
N/A
2
1
1
2
2

Intel® C/C++ Compiler Intrinsic Equivalent
void
void
void
void

_mm512_prefetch_i32extscatter_ps (void*, __m512i, _MM_UPCONV_PS_ENUM,
int, int);
_mm512_mask_prefetch_i32extscatter_ps(void*,
__mmask16,
__m512i,
_MM_UPCONV_PS_ENUM, int, int);
_mm512_prefetch_i32scatter_ps(void*, __m512i, int, int);
_mm512_mask_prefetch_i32scatter_ps(void*, __mmask16, __m512i, int, int);

Reference Number: 327364-001

609

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#NM

610

If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If using a 16 bit effective address.
If ModRM.rm is different than 100b.
If no write mask is provided or selected write-mask is k0.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSUBPD - Subtract Float64 Vectors

Opcode
MVEX.NDS.512.66.0F.W1
5C /r

Instruction
vsubpd zmm1
Sf 64 (zmm3/mt )

{k1},

zmm2,

Description
Subtract loat64 vector Sf 64 (zmm3/mt ) from
loat64 vector zmm2 and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element subtraction from loat64 vector zmm2 of the loat64
vector result of the swizzle/broadcast/conversion process on memory or loat64 vector
zmm3. The result is written into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = zmm2[i+63:i] - tmpSrc3[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Reference Number: 327364-001

611

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

612

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_sub_pd (__m512d, __m512d);
_mm512_mask_sub_pd (__m512d, __mmask8, __m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

613

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSUBPS - Subtract Float32 Vectors

Opcode
MVEX.NDS.512.0F.W0 5C /r

Instruction
vsubps zmm1
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Subtract loat32 vector Sf 32 (zmm3/mt ) from
loat32 vector zmm2 and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element subtraction from loat32 vector zmm2 of the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3. The result is written into loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = zmm2[i+31:i] - tmpSrc3[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

614

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

615

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_sub_ps (__m512, __m512);
_mm512_mask_sub_ps (__m512, __mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

616

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSUBRPD - Reverse Subtract Float64 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W1 vsubrpd zmm1
6D /r
Sf 64 (zmm3/mt )

{k1},

zmm2,

Description
Subtract loat64 vector zmm2 from loat64 vector Sf 64 (zmm3/mt ) and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element subtraction of loat64 vector zmm2 from the loat64
vector result of the swizzle/broadcast/conversion process on memory or loat64 vector
zmm3. The result is written into loat64 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 64 (zmm3/mt )
}
for (n = 0; n < 8; n++) {
if(k1[n] != 0) {
i = 64*n
// float64 operation
zmm1[i+63:i] = -zmm2[i+63:i] + tmpSrc3[i+63:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

Reference Number: 327364-001

617

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 64
S2 S1 S0
000
001
010
011
100
101
110
111

Function:
no conversion
broadcast 1 element (x8)
broadcast 4 elements (x2)
reserved
reserved
reserved
reserved
reserved

Usage
[rax] {8to8} or [rax]
[rax] {1to8}
[rax] {4to8}
N/A
N/A
N/A
N/A
N/A

disp8*N
64
8
32
N/A
N/A
N/A
N/A
N/A

Register Swizzle: Sf 64
MVEX.EH=0
S2 S1 S0 Function: 4 x 64 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

618

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512d
__m512d

_mm512_subr_pd (__m512d, __m512d);
_mm512_mask_subr_pd (__m512d, __mmask8, __m512d, __m512d);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

619

CHAPTER 6. INSTRUCTION DESCRIPTIONS

VSUBRPS - Reverse Subtract Float32 Vectors

Opcode
Instruction
MVEX.NDS.512.66.0F38.W0 vsubrps zmm1
6D /r
Sf 32 (zmm3/mt )

{k1},

zmm2,

Description
Subtract loat32 vector zmm2 from loat32 vector Sf 32 (zmm3/mt ) and store the result in
zmm1, under write-mask.

Description
Performs an element-by-element subtraction of loat32 vector zmm2 from the loat32
vector result of the swizzle/broadcast/conversion process on memory or loat32 vector
zmm3. The result is written into loat32 vector zmm1.
This instruction is write-masked, so only those elements with the corresponding bit set
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with
the corresponding bit clear in k1 retain their previous values.

Operation
if(source is a register operand and MVEX.EH bit is 1) {
if(SSS[2]==1) Supress_Exception_Flags() // SAE
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Table 2.14
RoundingMode = SSS[1:0]
tmpSrc3[511:0] = zmm3[511:0]
} else {
RoundingMode = MXCSR.RC
tmpSrc3[511:0] = SwizzUpConvLoadf 32 (zmm3/mt )
}
for (n = 0; n < 16; n++) {
if(k1[n] != 0) {
i = 32*n
// float32 operation
zmm1[i+31:i] = -zmm2[i+31:i] + tmpSrc3[i+31:i]
}
}

SIMD Floating-Point Exceptions
Over low, Under low, Invalid, Precision, Denormal.

620

Reference Number: 327364-001

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Denormal Handling
Treat Input Denormals As Zeros :
(MXCSR.DAZ)? YES : NO
Flush Tiny Results To Zero :
(MXCSR.FZ)? YES : NO

Memory Up-conversion: Sf 32
S2 S1 S0
000
001
010
011
100
110
111

Function:
no conversion
broadcast 1 element (x16)
broadcast 4 elements (x4)
loat16 to loat32
uint8 to loat32
uint16 to loat32
sint16 to loat32

Usage
[rax] {16to16} or [rax]
[rax] {1to16}
[rax] {4to16}
[rax] { loat16}
[rax] {uint8}
[rax] {uint16}
[rax] {sint16}

disp8*N
64
4
16
32
16
32
32

Register Swizzle: Sf 32
MVEX.EH=0
S2 S1 S0 Function: 4 x 32 bits
000
no swizzle
001
swap (inner) pairs
010
swap with two-away
011
cross-product swizzle
100
broadcast a element
101
broadcast b element
110
broadcast c element
111
broadcast d element
MVEX.EH=1
S2 S1 S0 Rounding Mode Override
000
Round To Nearest (even)
001
Round Down (-INF)
010
Round Up (+INF)
011
Round Toward Zero
100
Round To Nearest (even) with SAE
101
Round Down (-INF) with SAE
110
Round Up (+INF) with SAE
111
Round Toward Zero with SAE

Reference Number: 327364-001

Usage
zmm0 or zmm0 {dcba}
zmm0 {cdab}
zmm0 {badc}
zmm0 {dacb}
zmm0 {aaaa}
zmm0 {bbbb}
zmm0 {cccc}
zmm0 {dddd}
Usage
, {rn}
, {rd}
, {ru}
, {rz}
, {rn-sae}
, {rd-sae}
, {ru-sae}
, {rz-sae}

621

CHAPTER 6. INSTRUCTION DESCRIPTIONS

Intel® C/C++ Compiler Intrinsic Equivalent
__m512
__m512

_mm512_subr_ps (__m512,__m512);
_mm512_mask_subr_ps (__m512, __mmask16, __m512, __m512);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
#SS(0)
#GP(0)

#PF(fault-code)
#NM

622

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If a memory operand linear address is not aligned
to the data size granularity dictated by SwizzUpConv
mode.
For a page fault.
If CR0.TS[bit 3]=1.
If preceded by any REX, F0, F2, F3, or 66 pre ixes.

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Appendix A
Scalar Instruction Descriptions
In this Chapter all the special scalar instructions introduced with the rbni are described.

Reference Number: 327364-001

623

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

CLEVICT0 - Evict L1 line

Opcode
VEX.128.F2.0F AE /7
MVEX.512.F2.0F AE /7

Instruction
clevict0 m8
clevict0 m8

Description
Evict memory line from L1 in m8 using T0 hint.
Evict memory line from L1 in m8 using T0 hint.

Description
Invalidates from the irst-level cache the cache line containing the speci ied linear address
(updating accordingly the cache hierarchy if the line is dirty). Note that, unlike CLFLUSH,
the invalidation is not broadcasted throughout the cache coherence domain.
The MVEX form of this instruction uses disp8*64 addressing. Displacements that would
normally be 8 bits according to the ModR/M byte are still 8 bits but scaled by 64 so that
they have cache-line granularity. VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint intended for performance and may be speculative, thus may be
dropped or specify invalid addresses without causing problems. The instruction does not
produce any type of memory-related fault.

Operation

FlushL1CacheLine(linear_address)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

624

_mm_clevict (const void*, int);

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

Reference Number: 327364-001

625

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

CLEVICT1 - Evict L2 line

Opcode
VEX.128.F3.0F AE /7
MVEX.512.F3.0F AE /7

Instruction
clevict1 m8
clevict1 m8

Description
Evict memory line from L2 in m8 using T1 hint.
Evict memory line from L2 in m8 using T1 hint.

Description
Invalidates from the second-level cache the cache line containing the speci ied linear address (updating accordingly the cache hierarchy if the line is dirty). Note that, unlike
CLFLUSH, the invalidation is not broadcasted throughout the cache coherence domain.
The MVEX form of this instruction uses disp8*64 addressing. Displacements that would
normally be 8 bits according to the ModR/M byte are still 8 bits but scaled by 64 so that
they have cache-line granularity. VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint intended for performance and may be speculative, thus may be
dropped or specify invalid addresses without causing problems. The instruction does not
produce any type of memory-related fault.

Operation

FlushL2CacheLine(linear_address)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

626

_mm_clevict (const void*, int);

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

Reference Number: 327364-001

627

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

DELAY - Stall Thread

Opcode
VEX.128.F3.0F.W0 AE /6
VEX.128.F3.0F.W1 AE /6

Instruction
delay r32
delay r64

Description
Stall Thread using r32.
Stall Thread using r64.

Description
Hints that the processor should not fetch/issue instructions for the current thread for the
speci ied number of clock cycles in register source. The maximum number of clock cycles
is limited to 232 −1 (32 bit counter). The instructions is speculative and could be executed
as a NOP by a given processor implementation.
Any of the following events will cause the processor to start fetching instructions for the
delayed thread again: the counter counting down to zero, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, or the RESET# signal.
The instruction may exit prematurely due to any interrupt (e.g. an interrupt on another
thread on the same core).
This instruction must properly handle the case where the current clock count turns over.
This can be accomplished by performing the subtraction shown below and treating the
result as an unsigned number.
This instruction should prevent the issuing of additional instructions on the issuing thread
as soon as possible, to avoid the otherwise likely case where another instruction on the
same thread that was issued 3 or 4 clocks later has to be killed, creating a pipeline bubble.
If, on any given clock, all threads are non-runnable, then any that are non-runnable due
to the execution of DELAY may or may not be treated as runnable threads.
Notes about Intel® Xeon Phi™ coprocessor implementation:
• In Intel® Xeon Phi™ coprocessor, the processor won't execute from a "delayed" thread
before the delay counter has expired, even if there are non-runnable threads at any
given point in time.

Operation

START_CLOCK = CURRENT_CLOCK_COUNT
DELAY_SLOTS = SRC
if(DELAY_SLOTS > 0xFFFFFFFF) DELAY_SLOTS = 0xFFFFFFFF
while ( (CURRENT_CLOCK_COUNT - START_CLOCK) < DELAY_SLOTS )
{
*avoid fetching/issuing from the current thread*
}

628

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void
void

_mm_delay_32 (unsigned int);
_mm_delay_64 (unsigned __int64);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is a memory location.

Reference Number: 327364-001

629

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

LZCNT - Leading Zero Count

Opcode
VEX.128.F3.0F.W0 BD /r

Instruction
lzcnt r32, r32

VEX.128.F3.0F.W1 BD /r

lzcnt r64, r64

Description
Count the number of leading bits set to 0 in r32 (src), leaving the
result in r32 (dst).
Count the number of leading bits set to 0 in r64 (src), leaving the
result in r64 (dst).

Description
Counts the number of leading most signi icant zero bits in a source operand (second
operand) returning the result into a destination ( irst operand).
LZCNT is an extension of the BSR instruction. The key difference between LZCNT and BSR
is that LZCNT provides operand size as output when source operand is zero, while in the
case of BSR instruction, if source operand is zero, the content of destination operand are
unde ined.
ZF lag is set when the most signi icant set bit is bit OSIZE-1. CF is set when the source
has no set bit.

Operation
temp = OPERAND_SIZE - 1
DEST = 0
while( (temp >= 0) AND (SRC[temp] == 0) )
{
temp = temp - 1
DEST = DEST + 1
}
if(DEST == OPERAND_SIZE) {
CF = 1
} else {
CF = 0
}
if(DEST == 0) ZF = 1
} else {
ZF = 0
}

630

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Flags Affected
• ZF lag is set to 1 in case of zero output (most signi icant bit of the source is set), and
to 0 otherwise
• CF lag is set to 1 if input was zero and cleared otherwise.
• The PF, OF, AF and SF lags are set to 0

Intel® C/C++ Compiler Intrinsic Equivalent
unsigned int
__int64

_lzcnt_u32 (unsigned int);
_lzcnt_u64 (unsigned __int64);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If second operand is a memory location.

Reference Number: 327364-001

631

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

POPCNT - Return the Count of Number of Bits Set to 1

Opcode
VEX.128.F3.0F.W0 B8 /r

Instruction
popcnt r32, r32

VEX.128.F3.0F.W1 B8 /r

popcnt r64, r64

Description
Count the number of bits set to 1 in r32 (src), leaving the result
in r32 (dst).
Count the number of bits set to 1 in r64 (src), leaving the result
in r64 (dst).

Operation

tmp = 0
for (i=0; i OPERAND_SIZE-1 ) || ( SRC[OPERAND_SIZE-1:index] == 0 ) )
{
DEST = OPERAND_SIZE
CF=1
}
else
{
while(SRC[index] == 0)
{
index = index+1
}
DEST = index
CF=0
Reference Number: 327364-001

639

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

}

Flags Affected
• The ZF is set according to the result
• The CF is set if SRC is zero betwen index and MSB, or index is greater than the operand
size.
• The PF, OF, AF and SF lags are set to 0

Intel® C/C++ Compiler Intrinsic Equivalent
int
__int64

_mm_tzcnti_32 (int, unsigned int);
_mm_tzcnti_64 (__int64, unsigned __int64);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If second operand is a memory location.

640

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCH0 - Prefetch memory line using T0 hint

Opcode
VEX.128.0F 18 /1
MVEX.512.0F 18 /1

Instruction
vprefetch0 m8
vprefetch0 m8

Description
Prefetch memory line in m8 using T0 hint.
Prefetch memory line in m8 using T0 hint.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modi ied
in the L1 cache).
nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
Note that in Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits L1
(so it becomes transparent to L2). Consequently, this instructon is not a good solution
to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the
related TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit
(D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

641

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 0
nthintpre = 0
FetchL1CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

642

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCH1 - Prefetch memory line using T1 hint

Opcode
VEX.128.0F 18 /2
MVEX.512.0F 18 /2

Instruction
vprefetch1 m8
vprefetch1 m8

Description
Prefetch memory line in m8 using T1 hint.
Prefetch memory line in m8 using T1 hint.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modi ied
in the L2 cache).
nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
Note that in the Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits
L1 (so it becomes transparent to L2). Consequently, this instructon is not a good solution
to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the
related TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit
(D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

643

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 0
nthintpre = 0
FetchL2CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

644

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCH2 - Prefetch memory line using T2 hint

Opcode
VEX.128.0F 18 /3
MVEX.512.0F 18 /3

Instruction
vprefetch2 m8
vprefetch2 m8

Description
Prefetch memory line in m8 using T2 hint.
Prefetch memory line in m8 using T2 hint.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modi ied
in the L2 cache).
nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
Note that in Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits L1
(so it becomes transparent to L2). Consequently, this instructon is not a good solution
to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the
related TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit
(D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

645

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 0
nthintpre = 1
FetchL2CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

646

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCHE0 - Prefetch memory line using T0 hint, with intent to write

Opcode
VEX.128.0F 18 /5
MVEX.512.0F 18 /5

Instruction
vprefetche0 m8
vprefetche0 m8

Description
Prefetch memory line in m8 using T0 hint with intent to write.
Prefetch memory line in m8 using T0 hint with intent to write.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modi ied
in the L1 cache).
nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
In Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Consequently, this instructon is not a good solution to avoid
hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the related
TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

647

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 1
nthintpre = 0
FetchL1CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

648

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCHE1 - Prefetch memory line using T1 hint, with intent to write

Opcode
VEX.128.0F 18 /6
MVEX.512.0F 18 /6

Instruction
vprefetche1 m8
vprefetche1 m8

Description
Prefetch memory line in m8 using T1 hint with intent to write.
Prefetch memory line in m8 using T1 hint with intent to write.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modi ied
in the L2 cache).
nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal
cache. The data will be loaded in the #TIDth way and making the data MRU. Data
should still be cached normally in the L2 and higher caches.
The hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Consequently, this instructon is not a good solution to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB page entry, but prefetches with
exclusive access (RFO) do not set the dirty bit (D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

649

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 1
nthintpre = 0
FetchL2CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

650

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCHE2 - Prefetch memory line using T2 hint, with intent to write

Opcode
VEX.128.0F 18 /7
MVEX.512.0F 18 /7

Instruction
vprefetche2 m8
vprefetche2 m8

Description
Prefetch memory line in m8 using T2 hint with intent to write.
Prefetch memory line in m8 using T2 hint with intent to write.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modi ied
in the L2 cache).
nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
Note that in Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits L1
(so it becomes transparent to L2). Consequently, this instructon is not a good solution
to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the
related TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit
(D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

651

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 1
nthintpre = 1
FetchL2CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

652

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCHENTA - Prefetch memory line using NTA hint, with intent to
write

Opcode
VEX.128.0F 18 /4
MVEX.512.0F 18 /4

Instruction
vprefetchenta m8
vprefetchenta m8

Description
Prefetch memory line in m8 using NTA hint with intent to write.
Prefetch memory line in m8 using NTA hint with intent to write.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, this instruction uses disp8*64 addressing. Displacements that would normally be 8 bits according to the ModR/M byte are still
8 bits but scaled by 64 so that they have cache-line granularity.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modi ied
in the L1 cache).
nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal
cache. The data will be loaded in the #TIDth way and making the data MRU. Data
should still be cached normally in the L2 and higher caches.
The hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Consequently, this instructon is not a good solution to avoid hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB page entry, but prefetches with
exclusive access (RFO) do not set the dirty bit (D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

653

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 1
nthintpre = 1
FetchL1CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

654

Reference Number: 327364-001

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

VPREFETCHNTA - Prefetch memory line using NTA hint

Opcode
VEX.128.0F 18 /0
MVEX.512.0F 18 /0

Instruction
vprefetchnta m8
vprefetchnta m8

Description
Prefetch memory line in m8 using NTA hint.
Prefetch memory line in m8 using NTA hint.

Description
This is very similar to the existing IA-32 prefetch instruction, PREFETCH0, as described
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is
already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
In contrast with the existing prefetch instruction, the MVEX form of this instruction uses
disp8*64 addressing. Displacements that would normally be 8 bits according to the
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity.
VEX forms of this instruction uses regular disp8 addressing.
This instruction is a hint and may be speculative, and may be dropped or specify invalid
addresses without causing problems or memory related faults.
This instruction contains a set of hint attributes that modify the prefetching behavior:
exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modi ied
in the L1 cache).
nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be
cached normally in the L2 and higher caches.
In Intel® Xeon Phi™ coprocessor, the hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Consequently, this instructon is not a good solution to avoid
hot L1/cold L2 performance problems. Prefetches set the access bit (A) in the related
TLB page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D).
PREFETCH Hint equivalence for the Intel® Xeon Phi™ coprocessor
Instruction
Cache Level
Non-temporal
VPREFETCH0
L1
NO
VPREFETCHNTA
L1
YES
VPREFETCH1
L2
NO
VPREFETCH2
L2
YES
VPREFETCHE0
L1
NO
VPREFETCHENTA
L1
YES
VPREFETCHE1
L2
NO
VPREFETCHE2
L2
YES

Reference Number: 327364-001

Bring as exclusive
NO
NO
NO
NO
YES
YES
YES
YES

655

APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS

Operation
exclusive = 0
nthintpre = 1
FetchL1CacheLine(effective_address, exclusive, nthintpre)

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_prefetch (char const*, int);

Exceptions
Real-Address Mode and Virtual-8086
#UD

Instruction not available in these modes

Protected and Compatibility Mode
#UD

Instruction not available in these modes

64 bit Mode
If preceded by any REX, F0, F2, F3, or 66 pre ixes.
If operand is not a memory location.

656

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Appendix B
Intel® Xeon Phi™ coprocessor 64 bit Mode
Scalar Instruction Support
In 64 bit mode, the Intel® Xeon Phi™ coprocessor supports a subset of the Intel 64 Architecture instructions. The
64 bit mode instructions supported by the Intel® Xeon Phi™ coprocessor are listed in this chapter.

B.1

64 bit Mode General-Purpose and X87 Instructions

Intel® Xeon Phi™ coprocessor supports most of the general-purpose register (GPR) and X87 instructions in 64
bit mode. They are listed in Table B.2.
64 bit Mode GPR and X87 Instructions in the Intel® Xeon Phi™ coprocessor:
ADC
BSWAP
CALL
CLD
CMPS
CMPXCHG
CWDE
FABS
FCHS
FCOS
FDIVRP
FIDIV
FINIT
FLD
FLDL2T
FMUL
FNSAVE

Reference Number: 327364-001

ADD
BT
CBW
CLI
CMPSB
CMPXCHG8B
DEC
FADD
FCLEX
FDECSTP
FFREE
FIDIVR
FIST
FLD1
FLDLG2
FMULP
FNSTCW

AND
BTC
CDQ
CLTS
CMPSD
CPUID
DIV
FADDP
FCOM
FDIV
FIADD
FILD
FISTP
FLDCW
FLDLN2
FNCLEX
FNSTENV

BSF
BTR
CDQE
CMC
CMPSQ
CQO
ENTER
FBLD
FCOMP
FDIVP
FICOM
FIMUL
FISUB
FLDENV
FLDPI
FNINIT
FNSTSW

BSR
BTS
CLC
CMP
CMPSW
CWD
F2XM1
FBSTP
FCOMPP
FDIVR
FICOMP
FINCSTP
FISUBR
FLDL2E
FLDZ
FNOP
FPATAN

657

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

FPREM
FSAVE
FST
FSUB
FUCOM
FXCH
FYL2XP1
INT
IRET
JBE
JG
JNA
JNE
JNO
JP
LAHF
LGDT
LOCK
LODSW
LOOPZ
MOV CR
MOVSQ
MUL
POP
PUSHFQ
RDTSC
RET
SAL
SCASD
SETB
SETGE
SETNB
SETNGE
SETNS
SETPO
SHLD
SMSW
STOSD
SWAPGS
VERW
XCHG

658

FPREM1
FSCALE
FSTCW
FSUBP
FUCOMP
FXRSTOR
HLT
INT3
IRETD
JC
JGE
JNAE
JNG
JNP
JPE
LAR
LGS
LODS
LOOP
LSL
MOV DR
MOVSW
NEG
POPF
RCL
REP
ROL
SAR
SCASQ
SETBE
SETL
SETNBE
SETNL
SETNZ
SETS
SHR
STC
STOSQ
SYSCALL
WAIT
XLAT

FPTAN
FSIN
FSTENV
FSUBR
FUCOMPP
FXSAVE
IDIV
INTO
JA
JCXZ
JL
JNB
JNGE
JNS
JPO
LEA
LIDT
LODSB
LOOPE
LSS
MOVS
MOVSX
NOP
POPFQ
RCR
REPE
ROR
SBB
SCASW
SETC
SETLE
SETNC
SETNLE
SETO
SETZ
SHRD
STD
STOSW
SYSRET
WBINVD
XLATB

FRNDINT
FSINCOS
FSTP
FSUBRP
FWAIT
FXTRACT
IMUL
INVD
JAE
JE
JLE
JNBE
JNL
JNZ
JS
LEAVE
LLDT
LODSD
LOOPNE
LTR
MOVSB
MOVSXD
NOT
PUSH
RDMSM
REPNE
RSM
SCAS
SETA
SETE
SETNA
SETNE
SETNO
SETP
SGDT
SIDT
STI
STR
TEST
WRMSR
XOR

FRESTOR
FSQRT
FSTSW
FTST
FXAM
FYL2X
INC
INVPLG
JB
JECXZ
JMP
JNC
JNLE
JO
JZ
LFS
LMSW
LODSQ
LOOPNZ
MOV
MOVSD
MOVZX
OR
PUSHF
RDPMC
REPNZ
SAHF
SCASB
SETAE
SETG
SETNAE
SETNG
SETNP
SETPE
SHL
SLDT
STOSB
SUB
VERR
XADD
UD2

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.2

Intel® Xeon Phi™ coprocessor 64 bit Mode Limitations

In 64 bit mode, the Intel® Xeon Phi™ coprocessor supports a subset of the Intel 64 Architecture instructions.
The following summarizes Intel 64 Architecture instructions that are not supported in the Intel® Xeon Phi™
coprocessor:
• Instructions that operate on MMX registers
• Instructions that operate on XMM registers
• Instructions that operate on YMM registers
GPR and X87 Instructions Not Supported in the Intel® Xeon Phi™ coprocessor
CMOV
FCOMIP
INS
MONITOR
OUTSB
SYSENTER

Reference Number: 327364-001

CMPXCHG16B
FUCOMI
INSB
MWAIT
OUTSD
SYSEXIT

FCMOVcc
FUCOMIP
INSD
OUT
OUTSW

FCOMI
IN
INSW
OUTS
PAUSE

659

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.3

LDMXCSR - Load MXCSR Register

Opcode
0F AE /2

Instruction
ldmxcsr m32

Description
Load MXCSR register from m32

Description
Loads the source operand into the MXCSR control/status register. The source operand is
a 32 bit memory location. See MXCSR Control and Status Register in Chapter 10, of the
IA-32 Intel Architecture Software Developers Manual, Volume 1, for a description of the
MXCSR register and its contents. See chapter 3 of this document for a description of the
new Intel® Xeon Phi™ coprocessor's MXCSR feature bits.
The LDMXCSR instruction is typically used in conjunction with the STMXCSR instruction,
which stores the contents of the MXCSR register in memory.
The default MXCSR value at reset is 0020_0000H (DUE=1, FZ=0, RC=00, PM=0, UM=0,
OM=0, ZM=0, DM=0, IM=0, DAZ=0, PE=0, UE=0, OE=0, ZE=0, DE=0, IE=0).
Any attempt to set to 1 reserved bits in control register MXCSR will produce a #GP fault:
Bit
MXCSR[7-12]
MXCSR[16-20]
MXCSR[22-31]

default
0
0
0

Comment
Note that this corresponds to Intel® SSE's IM/DM/ZM/OM/UM/PM
Reserved
Reserved

Additionally, any attempt to set MXCSR.DUE (bit 21) to 0 will produce a #GP fault:
Bit
MXCSR[21]

default
1

Comment
DUE (Disable Unmasked Exceptions) always enforced in Intel® Xeon Phi™ coprocessor

This instructions operation is the same in non-64 bit modes and 64 bit mode.

Operation

MXCSR = MemLoad(m32)

Flags Affected
None.
660

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Intel® C/C++ Compiler Intrinsic Equivalent
void

_mm_setcsr (unsigned int)

Exceptions
#SS(0)
#GP(0)
#PF(fault-code)
#NM
#UD

#AC(0)

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For an attempt to set reserved bits in MXCSR
For a page fault.
If CR0.TS[bit 3] = 1.
If CR0.EM[bit 2] = 1.
If CS.L=0 or IA32_EFER.LMA=0.
If the lock pre ix is used.
If alignment checking is enabled and an unaligned
memory reference is made while the current privilege
level is 3.

661

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.4

FXRSTOR - Restore x87 FPU and MXCSR State

Opcode
0F AE /1
REX.W+0F AE /1

Instruction
fxrstor m512byte
fxrstor64 m512byte

Description
Restore the x87 FPU and MXCSR register state from m512byte
Restore the x87 FPU with 64-bit FPU-DP and MXCSR register state
from m512byte

Description
See Intel64® Intel® Architecture Software Developer's Manual for the description of the
original x86 instruction.
Reloads the x87 FPU and the MXCSR state from the 512-byte memory image speci ied
in the source operand. This data should have been written to memory previously using
the FXSAVE instruction of the Intel® Xeon Phi™ coprocessor , and in the same format as
required by the operating modes. The irst byte of the data should be located on a 16-byte
boundary. There are three distinct layout of the FXSAVE state map: one for legacy and
compatibility mode, a second format for 64 bit mode with promoted operandsize, and the
third format is for 64 bit mode with default operand size.
Intel® Xeon Phi™ coprocessor follows the same layouts as described in Intel64® Intel®
Architecture Software Developer's Manual.
The state image referenced with an FXRSTOR instruction must have been saved using an
FXSAVE instruction or be in the same format as required by the reference pages of FXSAVE
/ FXRSTORE instruction of the Intel® Xeon Phi™ coprocessor. Referencing a state image
saved with an FSAVE, FNSAVE instruction or incompatible ield layout will result in an
incorrect state restoration.
The FXRSTOR instruction does not lush pending x87 FPU exceptions. To check and raise
exceptions when loading x87 FPU state information with the FXRSTOR instruction, use an
FWAIT instruction after the FXRSTOR instruction.
The coprocessor will enforce the XMM state save area must be zero, otherwise a #GP will
be raised. The coprocessor does not use the content of the MXCSR_MASK ield. The coprocessor will clear bits 0:127 of the ZMM state.
Any attempt to set reserved bits in control register MXCSR to 1 will produce a #GP fault:
Bit
MXCSR[7-12]
MXCSR[16-19]
MXCSR[20]
MXCSR[22-31]

default
0
0
0
0

Comment
Note that this corresponds to Intel® SSE's IM/DM/ZM/OM/UM/PM
Reserved
Reserved
Reserved

Additionally, any attempt to set MXCSR.DUE (bit 21) to 0 will produce a #GP fault:

662

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Bit
MXCSR[21]

default
1

Comment
DUE (Disable Unmasked Exceptions) always enforced in Intel® Xeon Phi™ coprocessor

Operation

// Clear bits [0:127] of ZMM states, enforce XMM state save area must be zero, ignore MXCSR_MAS
(x87 FPU, MXCSR) = MemLoad(SRC);

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_fxrstor64 (void*);

Exceptions
#SS(0)
#GP(0)

#MF
#PF(fault-code)
#UD
#NM
#AC

Reference Number: 327364-001

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If memory operand is not aligned on a 16-byte boundary,
regardless of segment.
If trying to set illegal MXCSR values.
If there is a pending x87 FPU exception.
For a page fault.
If CPUID.01H:EDX.FXSR[bit 24] = 0.
If instruction is preceded by a LOCK pre ix.
If CR0.TS[bit 3] = 1.
If CR0.EM[bit 2] = 1.
If this exception is disabled a general protection exception
(#GP) is signaled if the memory operand is not aligned on a
16-byte boundary, as described above. If the alignment check
exception (#AC) is enabled (and the CPL is 3), signaling of
#AC is not guaranteed and may vary with implementation, as
follows. In all implementations where #AC is not signaled, a
general protection exception is signaled in its place. In
addition, the width of the alignment check may also vary with
implementation. For instance, for a given implementation,
an alignment check exception might be signaled for a 2-byte
misalignment, whereas a general protection exception might
be signaled for all other misalignments (4-, 8-, or 16-byte
663

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

misalignments).

664

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.5

FXSAVE - Save x87 FPU and MXCSR State

Opcode
0F AE /0

Instruction
fxsave m512byte

Description
Save the x87 FPU and MXCSR register state to m512byte

Description
See Intel64® Intel® Architecture Software Developer's Manual for the description of the
original x86 instruction.
Saves the current state of the x87 FPU and the relevant state in the MXCSR register to a
512-byte memory location speci ied in the destination operand. The content layout of the
512 byte region depends on whether the processor is operating in non- 64 bit operating
modes or 64 bit sub-mode of IA-32e mode.
Bytes 464:511 are available to software use. The processor does not write to bytes
464:511 of an FXSAVE area.
Intel® Xeon Phi™ coprocessor follows a similar layout as described in Intel64® Intel® Architecture Software Developer's Manual.
The processor will write 0s to the MXCSR_MASK ield and the XMM state save area. The
processor does not save any portion of the ZMM register states into the FXSAVE state save
area.

Operation

if(64 bit Mode)
{
if(REX.W == 1)
{
// clear MXCSR_MASK field and XMM save area
MemStore(m512byte) = Save64BitPromotedFxsave(x87 FPU, MXCSR);
}
else {
// clear MXCSR_MASK field and XMM save area
MemStore(m512byte) = Save64BitDefaultFxsave(x87 FPU, MXCSR);
}
}
else {
// clear MXCSR_MASK field and XMM save area
MemStore(m512byte) = SaveLegacyFxsave(x87 FPU, MXCSR);
}

Reference Number: 327364-001

665

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
void

_fxsave64 (void*);

Exceptions
#SS(0)
#GP(0)

#MF
#PF(fault-code)
#UD
#NM
#AC

666

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
If memory operand is not aligned on a 16-byte boundary,
regardless of segment.
If there is a pending x87 FPU exception.
For a page fault.
If CPUID.01H:EDX.FXSR[bit 24] = 0.
If instruction is preceded by a LOCK pre ix.
If CR0.TS[bit 3] = 1.
If CR0.EM[bit 2] = 1.
If this exception is disabled a general protection exception
(#GP) is signaled if the memory operand is not aligned on a
16-byte boundary, as described above. If the alignment check
exception (#AC) is enabled (and the CPL is 3), signaling of
#AC is not guaranteed and may vary with implementation, as
follows. In all implementations where #AC is not signaled, a
general protection exception is signaled in its place. In
addition, the width of the alignment check may also vary with
implementation. For instance, for a given implementation,
an alignment check exception might be signaled for a 2-byte
misalignment, whereas a general protection exception might
be signaled for all other misalignments (4-, 8-, or 16-byte
misalignments).

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.6

RDPMC - Read Performance-Monitoring Counters

Opcode
0F 33

Instruction
rdpmc

Description
Read
performancemonitoring
counter
speciied by
ECX
into
EDX:EAX.

Description
Loads the 40-bit performance-monitoring counter speci ied in the ECX register into registers EDX:EAX. The EDX register is loaded with the high-order 8 bits of the counter and
the EAX register is loaded with the low-order 32 bits. The counter to be read is speci ied
with an unsigned integer placed in the ECX register.
Intel® Xeon Phi™ coprocessor has 2 performance monitoring counters per thread, speciied with 0000H through 0001H, respectively, in the ECX register.
When in protected or virtual 8086 mode, the performance-monitoring counters enabled
(PCE) lag in register CR4 restricts the use of the RDPMC instruction as follows. When the
PCE lag is set, the RDPMC instruction can be executed at any privilege level; when the lag
is clear, the instruction can only be executed at privilege level 0. (When in real-address
mode, the RDPMC instruction is always enabled.)
The performance-monitoring counters can also be read with the RDMSR instruction,
when executing at privilege level 0.
The performance-monitoring counters are event counters that can be programmed to
count events such as the number of instructions decoded, number of interrupts received,
or number of cache loads. Appendix A, Performance-Monitoring Events, in the IA-32
Intel® Architecture Software Developers Manual, Volume 3, lists the events that can be
counted for the Intel® Pentium® 4, Intel Xeon® , and earlier IA-32 processors.
The RDPMC instruction is not a serializing instruction; that is, it does not imply that all the
events caused by the preceding instructions have been completed or that events caused
by subsequent instructions have not begun. If an exact event count is desired, software
must insert a serializing instruction (such as the CPUID instruction) before and/or after
the RDPMC instruction.
The RDPMC instruction can execute in 16 bit addressing mode or virtual-8086 mode;
however, the full contents of the ECX register are used to select the counter, and the event
count is stored in the full EAX and EDX registers.
Reference Number: 327364-001

667

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

The RDPMC instruction was introduced into the IA-32 Architecture in the Intel® Pentium®
Pro processor and the Intel® Pentium® processor with Intel® MMX™ technology. The earlier Intel® Pentium® processors have performance-monitoring counters, but they must be
read with the RDMSR instruction.
In 64 bit mode, RDPMC behavior is unchanged from 32 bit mode. The upper 32 bits of
RAX and RDX are cleared.

Operation

if ( ( (ECX[31:0] >= 0) && (ECX[31:0] < 2)
&& ((CR4.PCE = 1) || (CPL = 0) || (CR0.PE = 0))
)
{
if(64 bit Mode)
{
RAX[31:0] = PMC(ECX[31:0])[31:0]; (* 40-bit read *)
RAX[63:32] = 0;
RDX[31:0] = PMC(ECX[31:0])[39:32];
RDX[63:32] = 0;
}
else
{
EAX = PMC(ECX[31:0])[31:0]; (* 40-bit read *)
EDX = PMC(ECX[31:0])[39:32];
}
}
else
{
#GP(0)
}

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
__int64

668

_rdpmc (int);

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Exceptions
TBD

Reference Number: 327364-001

669

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.7

STMXCSR - Store MXCSR Register

Opcode
0F AE /3

Instruction
stmxcsr m32

Description
Store contents of MXCSR register to m32

Description
Stores the contents of the MXCSR control and status register to the destination operand.
The destination operand is a 32 bit memory location.
This instructions operation is the same in non-64 bit modes and 64 bit mode.

Operation
MemStore(m32) = MXCSR

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
unsigned int

_mm_getcsr (void)

Exceptions
#SS(0)
#GP(0)
#PF(fault-code)
#NM
#UD

#AC(0)

670

If a memory address referencing the SS segment is
in a non-canonical form.
If the memory address is in a non-canonical form.
For a page fault.
If CR0.TS[bit 3] = 1.
If CR0.EM[bit 2] = 1.
If CS.L=0 or IA32_EFER.LMA=0.
If the lock pre ix is used.
If alignment checking is enabled and an unaligned
memory reference is made while the current privilege
level is 3.

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

B.8

CPUID - CPUID Identication

Opcode
0F A2

Instruction
cpuid

Description
Returns processor identi ication and feature information to the EAX, EBX, ECX, and
EDX registers, as determined by the input value entered in EAX.

Description
The ID lag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If
a software procedure can set and clear this lag, the processor executing the procedure
supports the CPUID instruction. This instruction operates the same in non-64 bit modes
and 64 bit mode.
CPUID returns processor identi ication and feature information in the EAX, EBX, ECX, and
EDX registers. The instructions output is dependent on the contents of the EAX register
upon execution. For example, the following pseudo-code loads EAX with 00H and causes
CPUID to return a Maximum Return Value and the Vendor Identi ication String in the appropriate registers:
MOV EAX, 00H
CPUID
Table B.4 through B.7 shows information returned, depending on the initial value loaded
into the EAX register. Table B.3 shows the maximum CPUID input value recognized for
each family of IA-32 processors on which CPUID is implemented. Since Intel® Pentium®
4 family of processors, two types of information are returned: basic and extended function
information. Prior to that, only the basic function information was returned. The irst is
accessed with EAX=0000000xh while the second is accessed with EAX=8000000xh. If a
value is entered for CPUID.EAX that is invalid for a particular processor, the data for the
highest basic information leaf is returned.
CPUID can be executed at any privilege level to serialize instruction execution. Serializing
instruction execution guarantees that any modi ications to lags, registers, and memory
for previous instructions are completed before the next instruction is fetched and executed.
INPUT EAX = 0: Returns CPUID's Highest Value for Basic Processor Information and
the Vendor Identi ication String
When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID
recognizes for returning basic processor information. The value is returned in the EAX
register (see Table B.4 and is processor speci ic. A vendor identi ication string is also
returned in EBX, EDX, and ECX. For Intel® processors, the string is "GenuineIntel" and is
expressed:
EBX = 756e6547h (* "Genu", with G in the low nibble of BL *)
EDX = 49656e69h (* "ineI", with i in the low nibble of DL *)
Reference Number: 327364-001

671

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

IA-32 Processors

Earlier Intel486 Processors
Later Intel486 Processors and
Intel® Pentium® Processors
Intel® Pentium® Pro and Intel®
Pentium® II Processors, Intel®
Celeron Processors
Intel® Pentium® III Processors
Intel® Pentium® 4 Processors
Intel® Xeon® Processors
Intel® Pentium® M Processor
Intel® Pentium® 4 Processor supporting Intel® Hyper-Threading
Technology
Intel® Pentium® D Processor
(8xx)
Intel® Pentium® D Processor
(9xx)
Intel® Core™ Duo Processor
Intel® Core™ 2 Duo Processor
Intel® Xeon® Processor 3000,
3200, 5100, 5300 Series
Intel® Xeon Phi™ coprocessor

Highest Value in EAX
Basic Information
Extended Function Information
CPUID Not ImpleCPUID Not Implemented
mented
01H
Not Implemented
02H

Not Implemented

03H
02H
02H
02H
05H

Not Implemented
80000004H
80000004H
80000004H
80000008H

05H

80000008H

06H

80000008H

0AH
0AH
0AH

80000008H
80000008H
80000008H

04H

80000008H

Table B.3: Highest CPUID Source Operand for IA-32 Processors
ECX = 6c65746eh (* "ntel", with n in the low nibble of CL *)
INPUT EAX = 1: Returns Model, Family, Stepping Information
When CPUID executes with EAX set to 1, version information is returned in EAX. Extended
family, extended model, model, family, and processor type for Intel® Xeon Phi™ coprocessor is as follows:
•
•
•
•
•

Extended Model: 0000B
Extended Family: 0000_0000B
Model: *see table*
Family: 1011B
Processor Type: 00B

INPUT EAX = 1: Returns Additional Information in EBX
When CPUID executes with EAX set to 1, additional information is returned to the EBX
register:
• Brand index (low byte of EBX) - this number provides an entry into a brand string
table that contains brand strings for IA-32 processors. More information about this
ield is provided later in this section.
672

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

EAX
0H

Information Provided about the Processor
Basic CPUID Information
EAX
Maximum Input Value for Basic CPUID Information
EBX
"Genu"
ECX
"ntel"
EDX
"ineI"
Basic and Extended Feature Information

Return value
1
"Genu"
"ntel"
"ineI"

1H
EAX

EBX

ECX

Version Information: Type, Family, Model, and Stepping
ID
Bits 3-0: Stepping Id
Bits 7-4: Model
Bits 11-8: Family ID
Bits 13-12: Type
Bits 19-16: Extended Model Id
Bits 27-20: Extended Family Id

xxxx
0001B
1011B
00B
00B
00000000B

Bits 7-0: Brand Index
Bits 15-8: CLFLUSH/CLEVICTn line size (Value x 8 = cache
line size in bytes)
Bits 23-16: Maximum number of logical processors in this
physical package.
Bits 31-24: Initial APIC ID

0
8

Extended Feature Information (see Tables B.10)

00000000H

248
xxx

EDX
Feature Information (see Tables B.8 and B.9)
Cache and TLB Information

110193FFH

2H

EAX
Reserved
EBX
Reserved
ECX
Reserved
EDX
Reserved
Serial Number Information

0
0
0
0

3H

EAX
EBX
ECX
EDX

0
0
0
0

Reserved
Reserved
Reserved
Reserved

Table B.4: Information Returned by CPUID Instruction
• CLFLUSH/CLEVICTn instruction cache line size (second byte of EBX) - this number
indicates the size of the cache line lushed with CLEVICT1 instruction in 8-byte increments. This ield was introduced in the Intel® Pentium® 4 processor.
• Local APIC ID (high byte of EBX) - this number is the 8-bit ID that is assigned to the
local APIC on the processor during power up. This ield was introduced in the Intel®
Pentium® 4 processor.
INPUT EAX = 1: Returns Feature Information in ECX and EDX
When CPUID executes with EAX set to 1, feature information is returned in ECX and EDX.
Reference Number: 327364-001

673

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

EAX

Information Provided about the Processor
CPUID leaves > 3 < 80000000 are visible only when
IA32_MISC_ENABLES.BOOT_NT4[bit 22] = 0 (default).
Deterministic Cache Parameters Leaf

4H

Return value

ECX=0/1/2

Note: 04H output also depends on the inital value in ECX.
EAX

Bits 4-0: Cache Type (0 = Null - No more caches; 1 = Data Cache
2 = Instruction Cache, 3 = Uni ied Cache)
Bits 7-5: Cache Level (starts at 1)
Bits 8: Self Initializing cache level (does not need SW initialization)
Bits 9: Fully Associative cache
Bits 10: Write-Back Invalidate
Bits 11: Inclusive (of lower cache levels)
Bits 13-12: Reserved
Bits 25-14: Maximum number of threads sharing this cache in a
physical package (minus one)
Bits 31-26: Maximum number of processor cores in this physical
package (minus one)

2/1/1

EBX

Bits 11-00: L = System Coherency Line Size (minus 1)
Bits 21-12: P = Physical Line partitions (minus 1)
Bits 31-22: W = Ways of associativity (minus 1)

63/63/63
0/0/0
7/7/7

ECX

S = Number of Sets (minus 1)

63/63/1023

EDX

Reserved = 0

0

1/1/2
1/1/1
0/0/0
0/1/1
0/1/1
0
*/*/*
*/*/*

Table B.5: Information Returned by CPUID Instruction (Contd.)
• Table B.8 through Table B.9 show encodings for EDX.
• Table B.10 show encodings for ECX.
For all feature lags, a 1 indicates that the feature is supported. Use Intel® to properly
interpret feature lags.
INPUT EAX = 2: Cache and TLB Information Returned in EAX, EBX, ECX, EDX
Intel® Xeon Phi™ coprocessor considers leaf 2 to be reserved, so no cache and TLB information is returned when CPUID executes with EAX set to 2.
INPUT EAX = 3: Serial Number Information
Intel® Xeon Phi™ coprocessor does not implement Processor Serial Number support, as
signalled by feature bit CPUID.EAX[01h].EDX.PSN. Therefore, all the returned ields are
considered reserved.
INPUT EAX = 4: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 4 and ECX contains an index value, the processor
returns encoded data that describe a set of deterministic cache parameters (for the cache
level associated with the input in ECX).
674

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

EAX

Information Provided about the Processor
Extended Function CPUID Information
80000000H EAX
Maximum Input Value for Extended CPUID Information
EBX
Reserved
ECX
Reserved
EDX
Reserved
Feature Information
80000001H EAX
Reserved
EBX
Reserved
ECX

80000003H

80000004H

80000005H

80000008H
0
0
0
0
0

Bit 0: LAHF/SAHF available in 64 bit mode
Bits 31-1: Reserved

1
0

Bits 10-0: Reserved
Bit 11: SYSCALL/SYSRET available (in 64 bit mode)
Bits 19-12: Reserved
Bit 20: Execute Disable Bit available
Bits 28-21: Reserved
Bit 29: Intel® 64 Technology available
Bits 31-30: Reserved
Processor Brand String
EAX
Processor Brand String
EBX
Processor Brand String Continued
ECX
Processor Brand String Continued
EDX
Processor Brand String Continued
EAX
Processor Brand String Continued
EBX
Processor Brand String Continued
ECX
Processor Brand String Continued
EDX
Processor Brand String Continued
EAX
Processor Brand String Continued
EBX
Processor Brand String Continued
ECX
Processor Brand String Continued
EDX
Processor Brand String Continued
Reserved
EAX
Reserved
EBX
Reserved
ECX
Reserved
EDX
Reserved

0
1
0
0
0
1
0

EDX

80000002H

Return value

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Table B.6: Information Returned by CPUID Instruction. 8000000xH leafs.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an index value of 0, until the parameters report the value associated
with the cache type ield is 0. The architecturally de ined ields reported by deterministic
cache parameters are documented in Table B.5. The associated cache structures described
by the different ECX descriptors are:
• ECX=0: Instruction Cache (I1)
• ECX=1: L1 Data Cache (L1)
• ECX=2: L2 Data Cache (L2)

Reference Number: 327364-001

675

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

EAX
80000006H EAX
EBX
ECX

Information Provided about the Processor
Reserved
Reserved

Return value
0
0

Bits 7-0: L2 cache Line size in bytes
Bits 15-12: L2 associativity ield
Bits 31-16: L2 cache size in 1K units

64
06H
512

EDX
Reserved
Reserved
80000007H EAX
Reserved
EBX
Reserved
ECX
Reserved
EDX
Reserved
Virtual/Physical Address size
80000008H EAX
Bits 7-0: #Physical Address Bits
Bits 15-8: #Virtual Address Bits
EBX
Reserved
ECX
Reserved
EDX
Reserved

0
0
0
0
0
40
48
0
0
0

Table B.7: Information Returned by CPUID Instruction. 8000000xH leafs. (Contd.)

Operation

IA32_BIOS_SIGN_ID MSR = Update with installed microcode revision number;
case (EAX)
{
EAX == 0:
EAX = 01H;
// Highest basic function CPUID input value
EBX = "Genu";
ECX = "ineI";
EDX = "ntel";
break;
EAX = 2H:
// Cache and TLB information
EAX = 0;
EBX = 0;
ECX = 0;
EDX = 0;
break;
EAX = 3H:
// PSN features
EAX = 0;
EBX = 0;
ECX = 0;
EDX = 0;
break;
EAX = 4H:
676

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

// Deterministic Cache Parameters Leaf;
EAX = *see table*
EBX = *see table*
ECX = *see table*
EDX = *see table*
break;
EAX = 20000000H;
EAX = 01H;
// Reserved
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 0;
// Reserved
break;
EAX = 20000001H;
EAX = 0;
// Reserved
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 00000010H;
// Reserved
break;
EAX = 80000000H;
// Extended leaf
EAX = 08H;
// Highest extended function CPUID input value
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 0;
// Reserved
break;
EAX = 80000001H;
EAX = 0;
// Reserved
EBX = 0;
// Reserved
ECX[0]
= 1;
// LAHF/SAHF support in 64 bit mode
ECX[31:1] = 0;
// Reserved
EDX[10:0] = 0;
// Reserved
EDX[11]
= 1;
// SYSCALL/SYSRET available in 64 bit mode
EDX[19:12] = 0;
// Reserved
EDX[20]
= 0;
// Execute Disable Bit available
EDX[28:21] = 0;
// Reserved
EDX[29]
= 1;
// Intel(R) 64 Technology available
EDX[31:30] = 0;
// Reserved
break;
EAX = 80000002H;
EAX = 0;
// Processor Brand String
EBX = 0;
// Processor Brand String Continued
ECX = 0;
// Processor Brand String Continued
EDX = 0;
// Processor Brand String Continued
break;
EAX = 80000003H;
EAX = 0;
// Processor Brand String Continued
EBX = 0;
// Processor Brand String Continued
ECX = 0;
// Processor Brand String Continued
EDX = 0;
// Processor Brand String Continued
break;
EAX = 80000004H;
EAX = 0;
// Processor Brand String Continued
Reference Number: 327364-001

677

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

EBX = 0;
// Processor Brand String Continued
ECX = 0;
// Processor Brand String Continued
EDX = 0;
// Processor Brand String Continued
break;
EAX = 80000005H;
EAX = 0;
// Reserved
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 0;
// Reserved
break;
EAX = 80000006H;
EAX = 0;
// Reserved
EBX = 0;
// Reserved
ECX[7:0]
= 64;
// L2 cache Line size in bytes
ECX[15:12] = 6;
// L2 associativity field (8-way)
ECX[31:16] = 256;
// L2 cache size in 1K units
EDX = 0;
// Reserved
break;
EAX = 80000007H;
EAX = 0;
// Reserved
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 0;
// Reserved
break;
EAX = 80000008H;
EAX[7:0]
= 40;
// Physical Address bits
EAX[15:8] = 48;
// Virtual Address bits
EAX[31:16] = 0;
// Reserved
EBX = 0;
// Reserved
ECX = 0;
// Reserved
EDX = 0;
// Reserved
break;
default, EAX == 1H:
EAX[3:0]
= Stepping ID;
EAX[7:4]
= *see table*
// Model
EAX[11:8] = 1011B;
// Family
EAX[13:12] = 00B;
// Processor type
EAX[15:14] = 00B;
// Reserved
EAX[19:16] = 0000B;
// Extended Model
EAX[23:20] = 00000000B;
// Extended Family
EAX[31:24] = 00H;
// Reserved;
EBX[7:0]
= 00H;
// Brand Index (* Reserved if the value is zero *)
EBX[15:8] = 8;
// CLEVICT1/CLFLUSH Line Size (x8)
EBX[23:16] = 248;
// Maximum number of logical processors
EBX[31:24] = Initial Apic ID;
ECX = 00000000H;
// Feature flags
EDX = 110193FFH;
// Feature flags
break;
}

678

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Flags Affected
None.

Intel® C/C++ Compiler Intrinsic Equivalent
None

Exceptions
None.

Reference Number: 327364-001

679

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Bit
#
0
1

Mnemonic

Description

FPU
VME

2

DE

3

PSE

4

TSC

5

MSR

6

PAE

7

MCE

8

CX8

9

APIC

10
11

Reserved
SEP

12

MTRR

13

PGE

14

MCA

Floating-point Unit On-Chip. The processor contains an x87 FPU.
Virtual 8086 Mode Enhancements. Virtual 8086 mode enhancements, including CR4.VME for controlling the feature, CR4.PVI for protected mode virtual interrupts, software interrupt indirection, expansion of the TSS with the
software indirection bitmap, and EFLAGS.VIF and EFLAGS.VIP lags.
Debugging Extensions. Support for I/O breakpoints, including CR4.DE for
controlling the feature, and optional trapping of accesses to DR4 and DR5.
Page Size Extension. Large pages of size 4 MByte are supported, including
CR4.PSE for controlling the feature, the de ined dirty bit in PDE (Page Directory
Entries), optional reserved bit trapping in CR3, PDEs, and PTEs.
Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD
for controlling privilege.
Model Speci ic Registers RDMSR and WRMSR Instructions. The RDMSR
and WRMSR instructions are supported. Some of the MSRs are implementation
dependent.
Physical Address Extension. Physical addresses greater than 32 bits are supported: extended page table entry formats, an extra level in the page translation tables is de ined, 2-MByte pages are supported instead of 4 Mbyte pages
if PAE bit is 1. The actual number of address bits beyond 32 is not de ined, and
is implementation speci ic.
Machine Check Exception. Exception 18 is de ined for Machine Checks, including CR4.MCE for controlling the feature. This feature does not de ine
the model-speci ic implementations of machine-check error logging, reporting, and processor shutdowns. Machine Check exception handlers may have to
depend on processor version to do model speci ic processing of the exception,
or test for the presence of the Machine Check feature.
CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly locked and atomic).
APIC On-Chip. The processor contains an Advanced Programmable Interrupt
Controller (APIC), responding to memory mapped commands in the physical
address range FFFE0000H to FFFE0FFFH (by default - some processors permit
the APIC to be relocated).
Reserved
SYSENTER and SYSEXIT Instructions. The SYSENTER and SYSEXIT and associated MSRs are supported.
Memory Type Range Registers. MTRRs are supported. The MTRRcap MSR
contains feature bits that describe what memory types are supported, how
many variable MTRRs are supported, and whether ixed MTRRs are supported.
PTE Global Bit. The global bit in page directory entries (PDEs) and page table
entries (PTEs) is supported, indicating TLB entries that are common to different processes and need not be lushed. The CR4.PGE bit controls this feature.
Machine Check Architecture. The Machine Check Architecture, which provides a compatible mechanism for error reporting in P6 family, Pentium® 4,
Intel® Xeon®processors, and future processors, is supported. The MCG_CAP
MSR contains feature bits describing how many banks of error reporting MSRs
are supported.

Return
Value
1
1

1
1

1
1

1

1

1
?

0
0
1

0

0

Table B.8: Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX)

680

Reference Number: 327364-001

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Bit
#
15

Mnemonic

Description

CMOV

16

PAT

17

PSE-36

18

PSN

19
20
21

CLFSH
Reserved
DS

22

ACPI

23
24

Intel®
MMX™
FXSR

25
26
27

Intel® SSE
Intel® SSE2
SS

28

HTT

29

TM

30
31

Reserved
PBE

Conditional Move Instructions. The conditional move instruction CMOV is
supported. In addition, if x87 FPU is present as indicated by the CPUID.FPU
feature bit, then the FCOMI and FCMOV instructions are supported
Page Attribute Table. Page Attribute Table is supported. This feature augments the Memory Type Range Registers (MTRRs), allowing an operating system to specify attributes of memory on a 4K granularity through a linear address.
36-Bit Page Size Extension. Extended 4-MByte pages that are capable of addressing physical memory beyond 4 GBytes are supported. This feature indicates that the upper four bits of the physical address of the 4-MByte page is
encoded by bits 13-16 of the page directory entry.
Processor Serial Number. The processor supports the 96-bit processor identi ication number feature and the feature is enabled.
CLFLUSH Instruction. CLFLUSH Instruction is supported.
Reserved
Debug Store. The processor supports the ability to write debug information
into a memory resident buffer. This feature is used by the branch trace store
(BTS) and precise event-based sampling (PEBS) facilities (see Chapter 15, Debugging and Performance Monitoring, in the IA-32 Intel® Architecture Software Developers Manual, Volume 3).
Thermal Monitor and Software Controlled Clock Facilities. The processor implements internal MSRs that allow processor temperature to be monitored
and processor performance to be modulated in prede ined duty cycles under
software control.
Intel® MMX™ Technology. The processor supports the Intel® MMX™ technology.
FXSAVE and FXRSTOR Instructions. The FXSAVE and FXRSTOR instructions
are supported for fast save and restore of the loating-point context. Presence
of this bit also indicates that CR4.OSFXSR is available for an operating system
to indicate that it supports the FXSAVE and FXRSTOR instructions.
Intel® SSE. The processor supports the Intel® SSE extensions.
Intel® SSE2. The processor supports the Intel® SSE2 extensions.
Self Snoop. The processor supports the management of con licting memory
types by performing a snoop of its own cache structure for transactions issued
to the bus.
Multi-Threading. The physical processor package is capable of supporting
more than one logical processor.
Thermal Monitor. The processor implements the thermal monitor automatic
thermal control circuitry (TCC).
Reserved
Pending Break Enable. The processor supports the use of the FERR#/PBE#
pin when the processor is in the stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the processor should
return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the
IA32_MISC_ENABLE MSR enables this capability.

Return
Value
0

1

0

0
0
0
0

0

0
1

0
0
0

1
0
0
0

Table B.9: Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX) (Contd.)

Reference Number: 327364-001

681

APPENDIX B. INTEL® XEON PHI™ COPROCESSOR 64 BIT MODE SCALAR INSTRUCTION SUPPORT

Bit #

Mnemonic

Description

0

Intel® SSE3

1-2
3

Reserved
MONITOR

4

DS-CPL

5

VMX

6
7

Reserved
EST

8

TM2

9

SSSE3

10

CNXT-ID

11-12
13

Reserved
CMPXCHG16B

14
15

xTPR
Update
Control
PDCM

18 - 16
19

Reserved
Intel® SSE4.1

20

Intel® SSE4.2

22 - 21
23

Reserved
POPCNT

31 - 24

Reserved

Streaming SIMD Extensions 3 (SSE3). A value of 1 indicates the
processor supports this technology.
Reserved
MONITOR/MWAIT. A value of 1 indicates the processor supports this feature.
CPL Quali ied Debug Store. A value of 1 indicates the processor
supports the extensions to the Debug Store feature to allow for
branch message storage quali ied by CPL.
Virtual Machine Extensions. A value of 1 indicates that the processor supports this technology.
Reserved
Enhanced Intel® SpeedStep® technology. A value of 1 indicates
that the processor supports this technology.
Thermal Monitor 2. A value of 1 indicates whether the processor supports this technology.
Supplemental Streaming SIMD Extensions 3 (SSSE3). A value
of 1 indicates the processor supports this technology.
L1 Context ID. A value of 1 indicates the L1 data cache mode
can be set to either adaptive mode or shared mode. A value of
0 indicates this feature is not supported. See de inition of the
IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode)
for details.
Reserved
CMPXCHG16B Available. A value of 1 indicates that the feature
is available. See the CMPXCHG8B/CMPXCHG16BCompare and
Exchange Bytes section in Volume 2A.
xTPR Update Control. A value of 1 indicates that the processor
supports changing IA32_MISC_ENABLES[bit 23].
Perf/Debug Capability MSR. A value of 1 indicates that the processor supports the performance and debug feature indication
MSR
Reserved
Intel® Streaming SIMD Extensions 4.1 (Intel® SSE4.1). A value
of 1 indicates the processor supports this technology.
Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2). A value
of 1 indicates the processor supports this technology.
Reserved
POPCNT. A value of 1 indicates the processor supports the
POPCNT instruction.
Reserved

Return
Value
0
0
0
0

0
0
0
0
0
0

0
0

0
0

0
0
0
0
0a
0

Table B.10: Feature Information Returned in the ECX Register (CPUID.EAX[01h].ECX)
a CPUID bit 23 erroneously indicates that POPCNT is not supported. Intel® Xeon Phi™ coprocessor does support the POPCNT instruction.
See Appendix A for more information.

682

Reference Number: 327364-001

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Appendix C
Floating-Point Exception Summary
C.1

Instruction oating-point exception summary

Table C.3 shows all those instruction that can generate a loating-point exception. Each type of exception is
shown per instruction. For each table entry you will ind one of the following symbols:
• Nothing : Exception of that type cannot be produced by that instruction.
• Yboth : The instruction can produce that exception. The exception may be produced by either the operation
or the data-type conversion applied to memory operand.
• Yconv : The instruction can produce that exception. That exception can only be produced by the data-type
conversion applied to memory operand.
• Yoper : The instruction can produce that exception. The exception can only be produced by the operation.
The data-type conversion applied to the memory operand cannot produce any exception.
Instruction
vaddpd
vaddps
vaddnpd
vaddnps
vaddsetsps
vblendmps
vbroadcastf32x4
vbroadcastss
vcmppd
vcmpps
vcvtpd2ps
vcvtps2pd
vcvtfxpntdq2ps
vcvtfxpntpd2dq
vcvtfxpntpd2udq

Reference Number: 327364-001

#I
Yboth
Yboth
Yboth
Yboth
Yboth
Yconv
Yconv
Yconv
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth

#D
Yoper
Yoper
Yoper
Yoper
Yoper

Yoper
Yoper
Yoper
Yoper

#Z

#O
Yoper
Yoper
Yoper
Yoper
Yoper

#U
Yoper
Yoper
Yoper
Yoper
Yoper

#P
Yoper
Yoper
Yoper
Yoper
Yoper

Yoper

Yoper

Yoper
Yoper
Yoper
Yoper

683

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Instruction
vcvtfxpntps2dq
vcvtfxpntps2udq
vcvtfxpntudq2ps
vexp223ps
v ixupnanpd
v ixupnanps
vfmadd132pd
vfmadd132ps
vfmadd213pd
vfmadd213ps
vfmadd231pd
vfmadd231ps
vfmadd233ps
vfmsub132pd
vfmsub132ps
vfmsub213pd
vfmsub213ps
vfmsub231pd
vfmsub231ps
vfnmadd132pd
vfnmadd132ps
vfnmadd213pd
vfnmadd213ps
vfnmadd231pd
vfnmadd231ps
vfnmsub132pd
vfnmsub132ps
vfnmsub213pd
vfnmsub213ps
vfnmsub231pd
vfnmsub231ps
vgatherdps
vgetexppd
vgetexpps
vgetmantpd
vgetmantps
vgmaxpd
vgmaxps
vgmaxabsps
vgminpd
vgminps
vloadunpackhps
vloadunpacklps
vlog2ps
vmovaps (load)
vmovaps (store)
vmulpd
vmulps

684

#I
Yboth
Yboth

#D

#Z

#O

#U

#P
Yoper
Yoper
Yoper

Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper

Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper

Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper

Yconv
Yoper
Yoper

Yconv
Yoper
Yoper

Yconv
Yoper
Yoper

Yoper
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yconv
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yboth
Yconv
Yconv
Yboth
Yconv
Yconv
Yboth
Yboth

Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper
Yoper

Yoper
Yconv
Yoper
Yoper

Reference Number: 327364-001

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Instruction
vpackstorehps
vpackstorelps
vrcp23ps
vrndfxpntpd
vrndfxpntps
vrsqrt23ps
vscaleps
vscatterdps
vsubpd
vsubps
vsubrpd
vsubrps

C.2

#I
Yconv
Yconv
Yboth
Yboth
Yboth
Yboth
Yoper
Yconv
Yboth
Yboth
Yboth
Yboth

#D
Yconv
Yconv

#Z

#O
Yconv
Yconv

#U
Yconv
Yconv

#P
Yconv
Yconv

Yoper
Yoper
Yoper
Yoper
Yoper
Yconv
Yoper
Yoper
Yoper
Yoper

Yoper
Yconv
Yoper
Yoper
Yoper
Yoper

Yoper
Yconv
Yoper
Yoper
Yoper
Yoper

Yoper
Yconv
Yoper
Yoper
Yoper
Yoper

Conversion oating-point exception summary
Float-to- loat
Float16 to loat32
Float32 to loat64
Float32 to loat16

SwizzUpConv/UpConv
VCVTPS2PD
DownConv

Float64 to loat32

VCVTPD2PS

Integer-to- loat
Uint8/16 to loat32
Sint8/16 to loat32
Uint32 to loat32
Sint32 to loat32
Uint32 to loat64
Sint32 to loat64
Float-to-integer
Float32 to uint8/16
Float32 to sint8/16

UpConv
UpConv
VCVTFXPNTUDQ2PS
VCVTFXPNTDQ2PS
VCVTUDQ2PD
VCVTDQ2PD
DownConv
DownConv

Float32 to uint32

VCVTFXPNTPS2UDQ

Float32 to sint32

VCVTFXPNTPS2DQ

Float64 to uint32

VCVTFXPNTPD2UDQ

Float64 to sint32

VCVTFXPNTPD2DQ

Reference Number: 327364-001

Invalid (on SNaN)
Invalid (on SNaN), Denormal
Invalid (on SNaN), Over low, Under low,
Precision, Denormal
Invalid (on SNaN), Over low, Under low,
Precision, Denormal
None
None
Precision
Precision
None
None
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)
Invalid (on NaN, out-of-range), Precision
(if in-range but input not integer)

685

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Out-of-range values are dependent on operation de inition and rounding mode. Table C.3 and Table C.4 describe
maximum and minimum allowed values for loat to integer and loat to loat conversion respectively. Please note
that presented ranges are considered after ``Denormals Are Zero (DAZ)'' are applied.
Those entries in Table C.4 labelled with an asterisk(∗ ), are not required for the Intel® Xeon Phi™ coprocessor.

C.3

Denormal behavior
Instruction
vaddpd
vaddps
vaddnpd
vaddnps
vaddsetsps
vblendmpd
vblendmps
vcmppd
vcmpps
vcvtdq2pd
vcvtpd2ps
vcvtps2pd
vcvtudq2pd
vcvtfxpntdq2ps
vcvtfxpntpd2dq
vcvtfxpntpd2udq
vcvtfxpntps2dq
vcvtfxpntps2udq
vcvtfxpntudq2ps
vexp223ps
v ixupnanpd
v ixupnanps
vfmadd132pd
vfmadd132ps
vfmadd213pd
vfmadd213ps
vfmadd231pd
vfmadd231ps
vfmadd233ps
vfmsub132pd
vfmsub132ps
vfmsub213pd
vfmsub213ps
vfmsub231pd
vfmsub231ps
vfnmadd132pd

686

Treat Input Denormals As Zeros
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
NO
NO
MXCSR.DAZ
MXCSR.DAZ
Not Applicable
MXCSR.DAZ
MXCSR.DAZ
Not Applicable
Not Applicable
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
Not Applicable
Not Applicable
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ

Flush Tiny Results To Zero
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
NO
NO
Not Applicable
Not Applicable
Not Applicable
MXCSR.FZ
Not Applicable
Not Applicable
Not Applicable
Not Applicable
Not Applicable
Not Applicable
Not Applicable
Not Applicable
YES
NO
NO
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ

Reference Number: 327364-001

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Instruction
vfnmadd132ps
vfnmadd213pd
vfnmadd213ps
vfnmadd231pd
vfnmadd231ps
vfnmsub132pd
vfnmsub132ps
vfnmsub213pd
vfnmsub213ps
vfnmsub231pd
vfnmsub231ps
vgatherdpd
vgatherdps
vgatherpf0dps
vgatherpf0hintdpd
vgatherpf0hintdps
vgatherpf1dps
vgetexppd
vgetexpps
vgetmantpd
vgetmantps
vgmaxpd
vgmaxps
vgmaxabsps
vgminpd
vgminps
vloadunpackhpd
vloadunpackhps
vloadunpacklpd
vloadunpacklps
vlog2ps
vmovapd (load)
vmovapd (store)
vmovaps (load)
vmovaps (store)
vmovnrapd (load)
vmovnrapd (store)
vmovnraps (load)
vmovnraps (store)
vmovnrngoapd (load)
vmovnrngoapd (store)
vmovnrngoaps (load)
vmovnrngoaps (store)
vmulpd
vmulps
vpackstorehpd
vpackstorehps
vpackstorelpd

Reference Number: 327364-001

Treat Input Denormals As Zeros
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
NO
NO
NO
NO
NO
NO
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
MXCSR.DAZ
NO
NO
NO
NO
YES
NO
NO (DAZ*)
NO
NO (DAZ*)
NO
NO (DAZ*)
NO
NO (DAZ*)
NO
NO (DAZ*)
NO
NO (DAZ*)
MXCSR.DAZ
MXCSR.DAZ
NO (DAZ*)
NO (DAZ*)
NO (DAZ*)

Flush Tiny Results To Zero
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
NO
NO
NO
NO
NO
NO
Not Applicable
Not Applicable
Not Applicable
Not Applicable
NO
NO
NO
NO
NO
NO
NO
NO
NO
YES
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
MXCSR.FZ
MXCSR.FZ
NO
NO
NO

687

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Instruction
Treat Input Denormals As Zeros
vpackstorelps
NO (DAZ*)
vrcp23ps
YES
vrndfxpntpd
MXCSR.DAZ
vrndfxpntps
MXCSR.DAZ
vrsqrt23ps
YES
vscaleps
MXCSR.DAZ
vscatterdpd
NO (DAZ*)
vscatterdps
NO (DAZ*)
vscatterpf0dps
NO
vscatterpf0hintdpd
NO
vscatterpf0hintdps
NO
vscatterpf1dps
NO
vsubpd
MXCSR.DAZ
vsubps
MXCSR.DAZ
vsubrpd
MXCSR.DAZ
vsubrps
MXCSR.DAZ
(*) FP32 down-conversion obeys MXCSR.DAZ

688

Flush Tiny Results To Zero
NO
YES
NO
NO
YES
MXCSR.FZ
NO
NO
NO
NO
NO
NO
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ
MXCSR.FZ

Reference Number: 327364-001

Conversion
Float32 to uint8
Float32 to sint8
Float32 to uint16
Float32 to sint16
Float32 to uint32
Float32 to uint32
Float32 to uint32
Float32 to uint32
Float32 to sint32
Float32 to sint32
Float32 to sint32
Float32 to sint32
Float64 to uint32
Float64 to uint32
Float64 to uint32
Float64 to uint32
Float64 to sint32
Float64 to sint32
Float64 to sint32
Float64 to sint32

Context
DownConv
DownConv
DownConv
DownConv
VCVTFXPNTPS2UDQ
VCVTFXPNTPS2UDQ
VCVTFXPNTPS2UDQ
VCVTFXPNTPS2UDQ
VCVTFXPNTPS2DQ
VCVTFXPNTPS2DQ
VCVTFXPNTPS2DQ
VCVTFXPNTPS2DQ
VCVTFXPNTPD2UDQ
VCVTFXPNTPD2UDQ
VCVTFXPNTPD2UDQ
VCVTFXPNTPD2UDQ
VCVTFXPNTPD2DQ
VCVTFXPNTPD2DQ
VCVTFXPNTPD2DQ
VCVTFXPNTPD2DQ

Reference Number: 327364-001

Max
0x437f7fff (255.5 - 1ulp)
0x42feffff (127.5 - 1ulp)
0x477fff7f (65535.5 - 1ulp)
0x46fffeff (32767.5 - 1ulp)
0x4f7fffff (2^32 - 1ulp)
0x4f7fffff (2^32 - 1ulp)
0x4f7fffff (2^32 - 1ulp)
0x4f7fffff (2^32 - 1ulp)
0x4effffff (2^31 - 1ulp)
0x4effffff (2^31 - 1ulp)
0x4effffff (2^31 - 1ulp)
0x4effffff (2^31 - 1ulp)
0x41efffffffefffff (2^32 - 0.5 - 1ulp)
0x41efffffffffffff (2^32 - 1ulp)
0x41efffffffe00000 (2^32 - 1.0)
0x41efffffffffffff (2^32 - 1ulp)
0x41dfffffffdfffff (2^31 - 0.5 - 1ulp)
0x41dfffffffffffff (2^31 - 1ulp)
0x41dfffffffc00000 (2^31 - 1.0)
0x41dfffffffffffff (2^31 - 1ulp)

Table C.3: Float-to-integer Max/Min Valid Range

Rounding
RN
RN
RN
RN
RN
RD
RU
RZ
RN
RD
RU
RZ
RN
RD
RU
RZ
RN
RD
RU
RZ

Min
0xbf000000 (-0.5)
0xc3008000 (-128.5)
0xbf000000 (-0.5)
0xc7000080 (-32768.5)
0xbf000000 (-0.5)
0x80000000 (-0.0)
0xbf7fffff (-1.0 + 1ulp)
0xbf7fffff (-1.0 + 1ulp)
0xcf000000 (-2^31)
0xcf000000 (-2^31)
0xcf000000 (-2^31)
0xcf000000 (-2^31)
0xbfe0000000000000 (-0.5)
0x8000000000000000 (-0.0)
0xbfefffffffffffff (-1.0 + 1ulp)
0xbfefffffffffffff (-1.0 + 1ulp)
0xc1e0000000100000 (-2^31 - 0.5)
0xc1e0000000000000 (-2^31)
0xc1e00000001fffff (-2^31 - 1.0 + 1ulp)
0xc1e00000001fffff (-2^31 - 1.0 + 1ulp)

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

689

APPENDIX C. FLOATING-POINT EXCEPTION SUMMARY

Case
Float32 to loat16

Float64 to loat32

Case
Float32 to loat16

Float64 to loat32

Rounding
RN
RD∗
RU∗
RZ
RN
RD
RU
RZ
Rounding
RN
RD∗
RU∗
RZ
RN
RD
RU
RZ

Max pos arg w/o over low
0x477fefff (65520.0 - 1ulp)
0x477fffff (65536.0 - 1ulp)
0x477fe000 (65504.0)
0x477fffff (65536.0 - 1ulp)
0x47efffffefffffff (2128 − 2103 − 1ulp)
0x47efffffffffffff (2128 − 1ulp)
0x47efffffe0000000 (2128 − 2104 )
0x47efffffffffffff (2128 − 1ulp)
Max neg arg w/o over low
0xc77fefff (-65520.0 + 1ulp)
0xc77fe000 (-65504.0)
0xc77fffff (-65536.0 + 1ulp)
0xc77fffff (-65536.0 + 1ulp)
0xc7efffffefffffff (−2128 + 2103 + 1ulp)
0xc7efffffe0000000 (−2128 + 2104 )
0xc7efffffffffffff (−2128 + 1ulp)
0xc7efffffffffffff (−2128 + 1ulp)

Min pos arg w/ over low
0x477ff000 (65520.0)
0x47800000 (65536.0)
0x477fe001 (65504.0 + 1ulp)
0x47800000 (65536.0)
0x47effffff0000000 (2128 − 2103 )
0x47f0000000000000 (2128.0 )
0x47efffffe0000001 (2128 − 2104 + 1ulp)
0x47f0000000000000 (2128.0 )
Min neg arg w/ over low
0xc77ff000 (-65520.0)
0xc77fe001 (-65504.0 - 1ulp)
0xc7800000 (-65536.0)
0xc7800000 (-65536.0)
0xc7effffff0000000 (−2128 + 2103 )
0xc7efffffe0000001 (−2128 + 2104 − 1ulp)
0xc7f0000000000000 (−2128.0 )
0xc7f0000000000000 (−2128.0 )
Table C.4: Float-to- loat Max/Min Valid Range

Reference Number: 327364-001

690

APPENDIX D. INSTRUCTION ATTRIBUTES AND CATEGORIES

Appendix D
Instruction Attributes and Categories
In this Appendix we enumerate instruction attributes and categories

Reference Number: 327364-001

691

APPENDIX D. INSTRUCTION ATTRIBUTES AND CATEGORIES

D.1
D.1.1

Conversion Instruction Families
Df 32 Family of Instructions

VMOVAPS
VPACKSTORELPS

D.1.2

VPACKSTOREHD

VPACKSTOREHQ

VPACKSTOREHPD

VPACKSTORELD

VPSCATTERDD

VPACKSTORELQ

VPSCATTERDQ

VADDPS
VCVTFXPNTPS2DQ
VFMADD213PS
VFMSUB213PS
VFNMADD231PS
VGETEXPPS
VGMINPS
VSUBRPS

VADDSETSPS
VCVTFXPNTPS2UDQ
VFMADD231PS
VFMSUB231PS
VFNMSUB132PS
VGETMANTPS
VMULPS

VBLENDMPS
VCVTPS2PD
VFMADD233PS
VFNMADD132PS
VFNMSUB213PS
VGMAXABSPS
VRNDFXPNTPS

Sf 64 Family of Instructions

VADDNPD
VCVTFXPNTPD2DQ
VFMADD213PD
VFMSUB231PD
VFNMSUB132PD
VGETMANTPD
VRNDFXPNTPD
692

VMOVNRNGOAPD

Sf 32 Family of Instructions

VADDNPS
VCMPPS
VFMADD132PS
VFMSUB132PS
VFNMADD213PS
VFNMSUB231PS
VGMAXPS
VSUBPS

D.1.6

VMOVNRAPD
VSCATTERDPD

Di64 Family of Instructions

VMOVDQA64

D.1.5

VPACKSTOREHPS

Di32 Family of Instructions

VMOVDQA32

D.1.4

VMOVNRNGOAPS
VSCATTERPF1DPS

Df 64 Family of Instructions

VMOVAPD
VPACKSTORELPD

D.1.3

VMOVNRAPS
VSCATTERDPS

VADDPD
VCVTFXPNTPD2UDQ
VFMADD231PD
VFNMADD132PD
VFNMSUB213PD
VGMAXPD
VSUBPD

VBLENDMPD
VCVTPD2PS
VFMSUB132PD
VFNMADD213PD
VFNMSUB231PD
VGMINPD
VSUBRPD

VCMPPD
VFMADD132PD
VFMSUB213PD
VFNMADD231PD
VGETEXPPD
VMULPD

Reference Number: 327364-001

APPENDIX D. INSTRUCTION ATTRIBUTES AND CATEGORIES

D.1.7

Si32 Family of Instructions

VCVTDQ2PD
VFIXUPNANPS
VPADDSETSD
VPCMPD
VPCMPUD
VPMAXUD
VPMULHUD
VPSBBRD
VPSRAVD
VPSUBRD
VPXORD

D.1.8

VPANDNQ
VPXORQ

VPBLENDMQ

VBROADCASTSS
VGATHERPF1DPS
VMOVNRAPS

VGATHERDPS
VLOADUNPACKHPS
VMOVNRNGOAPS

VGATHERPF0DPS
VLOADUNPACKLPS
VSCATTERPF0DPS

VBROADCASTSD
VLOADUNPACKLPD
VSCATTERPF0HINTDPD

VGATHERDPD
VMOVAPD

VGATHERPF0HINTDPD
VMOVNRAPD

Ui32 Family of Instructions

VBROADCASTI32X4
VPBROADCASTD

D.1.12

VPANDQ

Uf 64 Family of Instructions

VBROADCASTF64X4
VLOADUNPACKHPD
VMOVNRNGOAPD

D.1.11

VCVTUDQ2PD
VPADDSETCD
VPBLENDMD
VPCMPLTD
VPMAXSD
VPMULHD
VPSBBD
VPSRAD
VPSUBD
VPTESTMD

Uf 32 Family of Instructions

VBROADCASTF32X4
VGATHERPF0HINTDPS
VMOVAPS
VSCATTERPF0HINTDPS

D.1.10

VCVTFXPNTUDQ2PS
VPADDD
VPANDND
VPCMPGTD
VPMADD233D
VPMINUD
VPORD
VPSLLVD
VPSRLVD
VPSUBSETBD

Si64 Family of Instructions

VFIXUPNANPD
VPORQ

D.1.9

VCVTFXPNTDQ2PS
VPADCD
VPANDD
VPCMPEQD
VPMADD231D
VPMINSD
VPMULLD
VPSLLD
VPSRLD
VPSUBRSETBD
VSCALEPS

VLOADUNPACKHD
VPGATHERDD

VLOADUNPACKLD

VMOVDQA32

VLOADUNPACKLQ

VMOVDQA64

Ui64 Family of Instructions

VBROADCASTI64X4
VPBROADCASTQ

VLOADUNPACKHQ
VPGATHERDQ

Reference Number: 327364-001

693

APPENDIX E. NON-FAULTING UNDEFINED OPCODES

Appendix E
Non-faulting Undened Opcodes
The following opcodes are non-faulting and have unde ined behavior:
• MVEX.512.0F38.W0 D2 /r
• MVEX.512.0F38.W0 D3 /r
• MVEX.512.0F38.W0 D6 /r
• MVEX.512.0F38.W0 D7 /r
• MVEX.512.66.0F38.W0 48 /r
• MVEX.512.66.0F38.W0 49 /r
• MVEX.512.66.0F38.W0 4A /r
• MVEX.512.66.0F38.W0 4B /r
• MVEX.512.66.0F38.W0 68 /r
• MVEX.512.66.0F38.W0 69 /r
• MVEX.512.66.0F38.W0 6A /r
• MVEX.512.66.0F38.W0 6B /r
• MVEX.512.66.0F38.W0 B0 /r /vsib
• MVEX.512.66.0F38.W0 B2 /r /vsib
• MVEX.512.66.0F38.W0 C0 /r /vsib
• MVEX.512.66.0F38.W0 D2 /r
• MVEX.512.66.0F38.W0 D6 /r
• MVEX.512.66.0F3A.W0 D0 /r ib
• MVEX.512.66.0F3A.W0 D1 /r ib
• MVEX.NDS.512.66.0F38.W0 54 /r
694

Reference Number: 327364-001

APPENDIX E. NON-FAULTING UNDEFINED OPCODES

• MVEX.NDS.512.66.0F38.W0 56 /r
• MVEX.NDS.512.66.0F38.W0 57 /r
• MVEX.NDS.512.66.0F38.W0 67 /r
• MVEX.NDS.512.66.0F38.W0 70 /r
• MVEX.NDS.512.66.0F38.W0 71 /r
• MVEX.NDS.512.66.0F38.W0 72 /r
• MVEX.NDS.512.66.0F38.W0 73 /r
• MVEX.NDS.512.66.0F38.W0 94 /r
• MVEX.NDS.512.66.0F38.W0 CE /r
• MVEX.NDS.512.66.0F38.W0 CF /r
• MVEX.NDS.512.66.0F38.W1 94 /r
• MVEX.NDS.512.66.0F38.W1 CE /r
• VEX.128.F2.0F38.W0 F0 /r
• VEX.128.F2.0F38.W0 F1 /r
• VEX.128.F2.0F38.W1 F0 /r
• VEX.128.F2.0F38.W1 F1 /r
• VEX.128.F3.0F38.W0 F0 /r
• VEX.128.F3.0F38.W1 F0 /r

Reference Number: 327364-001

695

APPENDIX F. GENERAL TEMPLATES

Appendix F
General Templates
In this Chapter all the general templates are described. Each instruction has one (at least) valid format, and each
format matches with one of these templates.

696

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

F.1

Mask Operation Templates

Reference Number: 327364-001

697

APPENDIX F. GENERAL TEMPLATES

Mask m0 - Template

VMASKMask m0
Opcode
VEX.128

Instruction
KOP k1, k2

Description
Operate [mask k1 and] mask k2 [and store the
result in k1]

Description
Operand is a register
ESCAPE(C5)
VEX2

1

1

0

0

0

1

0

1

7

6

5

4

3

2

1

0

1

1

1

1

1

0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

2

1

0

6

5

OPCODE

OPCODE

ModR/M

11
7

698

4

3

reg (K1)
4

3

r (K2)
2

1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Mask m1 - Template

VMASKMask m1
Opcode
VEX.128

Instruction
KOP r32/r64, k1, imm8

Description
Move mask k1 into r32/r64 using imm8

Description
Operand is a register
ESCAPE(C4)
VEX1
VEX2

1

1

0

0

0

1

0

0

7

6

5

4

3

2

1

0

!reg3

1

1

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

OPCODE

OPCODE

ModR/M
{IM M 8}

Reference Number: 327364-001

11

4

reg (reg)

r (K1)

699

APPENDIX F. GENERAL TEMPLATES

Mask m2 - Template

VMASKMask m2
Opcode

Instruction

Description

Description
Operand is a register
ESCAPE(C4)
VEX1
VEX2

1

1

0

0

0

1

0

0

7

6

5

4

3

2

1

0

!reg3

1

1

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

!K12

!K11

!K10

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

6

5

3

2

OPCODE

OPCODE

ModR/M

11
7

700

4

reg (reg)
4

r (K2)
1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Mask m3 - Template

VMASKMask m3
Opcode
VEX.128

Instruction
KOP r32/r64, k1

Description
Move mask k1 into r32/r64

Description
Operand is a register
ESCAPE(C5)
VEX1

1

1

0

0

0

1

0

1

7

6

5

4

3

2

1

0

!reg3

1

1

1

1

0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

2

1

0

6

5

OPCODE

OPCODE

ModR/M

11
7

Reference Number: 327364-001

4

3

reg (reg)
4

3

r (K1)
2

1

0

701

APPENDIX F. GENERAL TEMPLATES

Mask m4 - Template

VMASKMask m4
Opcode
VEX.128

Instruction
KOP k1, r32/r64

Description
Move r32/r64 into mask k1

Description
Operand is a register
C4 Version
ESCAPE(C4)
VEX1
VEX2

1

1

0

0

0

1

0

0

7

6

5

4

3

2

1

0

1

1

!reg3

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

4

3

2

1

0

1

1

0

0

0

1

0

1

7

6

5

4

3

2

1

0

1

1

1

1

1

0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

6

5

3

2

OPCODE

OPCODE

ModR/M

11

4

reg (K1)

r (reg)

C5 Version
ESCAPE(C5)
VEX1
OPCODE

OPCODE

ModR/M

11
7

702

4

reg (K1)
4

r (reg)
1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Mask m5 - Template

VMASKMask m5
Opcode
VEX.128

Instruction
KOP k1, r32/r64, imm8

Description
Move r32/r64 ield into mask k1 using imm8

Description
Operand is a register
ESCAPE(C4)
VEX1
VEX2

1

1

0

0

0

1

0

0

7

6

5

4

3

2

1

0

1

1

!reg3

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

OPCODE

OPCODE

ModR/M
{IM M 8}

Reference Number: 327364-001

11

4

reg (K1)

r (reg)

703

APPENDIX F. GENERAL TEMPLATES

F.2

704

Vector Operation Templates

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Vector v0 - Template
VectorVector v0
Opcode
MVEX.512

Instruction
VOP
zmm1
S(zmm3/mt )

MVEX.512

{k1},

zmm2,

VOP
zmm1
{k1},
S(zmm3/mt ), imm8

zmm2,

Description
Operate vector zmm2 and vector S(zmm3/mt )
[and vector zmm1] and store the result in
zmm1, under write-mask k1
Operate vector zmm2 and vector S(zmm3/mt )
[and vector zmm1] and store the result in
zmm1 using imm8, under write-mask k1

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!Z34

!Z33

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

!Z23

!Z22

!Z21

!Z20

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!Z24

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

OPCODE
7

6

5

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

ModR/M
{IM M 8}

0

11

4

reg (ZMM1)

r (ZMM3)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

!Z23

!Z22

!Z21

!Z20

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

S2

S1

S0

!Z24

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

0

OPCODE

OPCODE

ModR/M

mod

reg (ZMM1)

{SIB}

4

m (mt)

SIB byte

{DISP L}

Reference Number: 327364-001

4

4

3

Displacement (8*N/32)
.

.

.

.

705

APPENDIX F. GENERAL TEMPLATES

{IM M 8}

706

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Vector v1 - Template
VectorVector v1
Opcode
MVEX.512

Instruction
VOP zmm1 {k1}, S(mt )

Description
Load/brodcast vector S(mt ) into zmm1, under
write-mask k1

Description
Operand is a memory location
ESCAPE(62)

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

MVEX1

OPCODE

OPCODE
7

ModR/M

6

5

mod
7

4

6

5

4

{SIB}

2

m (mt)

3

2

1

0

2

1

0

.

0

SIB byte
7

6

31,8

.

{DISP L}

Reference Number: 327364-001

3

reg (ZMM1)

5

4

3

Displacement (8*N/32)
.

.

.

.

707

APPENDIX F. GENERAL TEMPLATES

Vector v10 - Template
VectorVector v10
Opcode
MVEX.512

Instruction
VOP zmm1 {k1}, S(zmm2/mt )

MVEX.512

Description
Operate vector S(zmm2/mt ) and store the result in zmm1, under write-mask k1
Operate vector S(zmm2/mt ) and store the result in zmm1 using imm8, under write-mask k1
Move vector S(zmm2/mt ) into zmm1, under
write-mask k1

VOP zmm1 {k1}, S(zmm2/mt ),
imm8
VOP zmm1 {k1}, S(zmm2/mt )

MVEX.512

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

1

!Z24

!Z23

1

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

!Z13

!Z12

!Z11

!Z10

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!Z14

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

OPCODE
7

6

5

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

ModR/M
{IM M 8}

0

11

4

Op. Ext.

r (ZMM2)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

1

!X

!B

1

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

!Z13

!Z12

!Z11

!Z10

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

S2

S1

S0

!Z14

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

2

1

0

31,8

.

.

0

OPCODE

OPCODE

ModR/M

mod

Op. Ext.

{SIB}

SIB byte

{DISP L}

708

m (mt)

3

Displacement (8*N/32)
.

.

.

.

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

{IM M 8}

Reference Number: 327364-001

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

709

APPENDIX F. GENERAL TEMPLATES

Vector v11 - Template
VectorVector v11
Opcode
MVEX.512

Instruction
VOP zmm1 {k1}, zmm2, S(mt )

Description
Load/brodcast and OP vector S(mt ) with
zmm2 and write result into zmm1, under writemask k1

Description
Operand is a memory location
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

!Z23

!Z22

!Z21

!Z20

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!Z24

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

OPCODE

OPCODE
7

ModR/M

6

5

mod

4

3

2

reg (ZMM1)

7

6

5

7

6

5

31,8

.

{SIB}

4

m (mt)
3

2

1

0

2

1

0

.

0

SIB byte

{DISP L}

710

0

4

3

Displacement (8*N/32)
.

.

.

.

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Vector v2 - Template
VectorVector v2
Opcode
MVEX.512

Instruction
VOP k2 {k1}, zmm2, S(zmm3/mt )

MVEX.512

Description
Operate vector zmm2 and vector S(zmm3/mt )
and store the result in k2, under write-mask k1
Operate vector zmm2 and vector S(zmm3/mt )
and store the result in k2 using imm8, under
write-mask k1

VOP k2 {k1}, zmm2, S(zmm3/mt ),
imm8

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

1

!Z24

!Z23

1

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

!Z13

!Z12

!Z11

!Z10

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!Z14

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

OPCODE
7

6

5

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

ModR/M
{IM M 8}

0

11

4

reg (K2)

r (ZMM2)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

1

!X

!B

1

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

!Z13

!Z12

!Z11

!Z10

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

S2

S1

S0

!Z14

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

2

1

0

31,8

.

.

.

.

.

.

0

I7

I6

I5

I4

I3

I2

I1

I0

OPCODE

OPCODE

ModR/M

mod

reg (K2)

{SIB}

SIB byte

{DISP L}
{IM M 8}
Reference Number: 327364-001

m (mt)

3

Displacement (8*N/32)

711

APPENDIX F. GENERAL TEMPLATES
7

712

6

5

4

3

2

1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Vector v3 - Template
VectorVector v3
Opcode
MVEX.512

Instruction
VOP mt {k1}, D(zmm1)

Description
Store vector D(zmm1) into mt , under writemask k1

Description
Operand is a memory location
ESCAPE(62)

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

MVEX1

OPCODE

OPCODE
7

ModR/M

6

5

mod
7

4

6

5

4

{SIB}

2

m (mt)

3

2

1

0

2

1

0

.

0

SIB byte
7

6

31,8

.

{DISP L}

Reference Number: 327364-001

3

reg (ZMM1)

5

4

3

Displacement (8*N/32)
.

.

.

.

713

APPENDIX F. GENERAL TEMPLATES

Vector v4 - Template
VectorVector v4
Opcode
MVEX.512
MVEX.512

Instruction
VOP zmm1 {k1}, zmm2/mt

Description
Operate vector zmm2/mt and store the result
in zmm1, under write-mask k1
Operate vector zmm2/mt and store the result
in zmm1 using imm8, under write-mask k1

VOP zmm1 {k1}, zmm2/mt , imm8

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!Z24

!Z23

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

0

0

0

1

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

OPCODE
7

6

5

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

ModR/M
{IM M 8}

0

11

4

reg (ZMM1)

r (ZMM2)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

0

0

0

1

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

.

.

.

.

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

OPCODE

OPCODE

ModR/M

mod

4

reg (ZMM1)

{SIB}

714

m (mt)

SIB byte

{DISP L}
{IM M 8}

4

4

3

Displacement (8*N/32)

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Reference Number: 327364-001

715

APPENDIX F. GENERAL TEMPLATES

Vector v5 - Template

VectorVector v5
Opcode
MVEX.512
MVEX.512

Instruction
VOP zmm1 {k1}, S(zmm2/mt )

Description
Operate vector S(zmm2/mt ) and store the result in zmm1, under write-mask k1
Operate vector S(zmm2/mt ) and store the result in zmm1 using imm8, under write-mask k1
Move vector S(zmm2/mt ) into zmm1, under
write-mask k1

VOP zmm1 {k1}, S(zmm2/mt ),
imm8
VOP zmm1 {k1}, S(zmm2/mt )

MVEX.512

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!Z24

!Z23

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

MVEX1

OPCODE

OPCODE
7

6

ModR/M
{IM M 8}

5

11

4

3

2

reg (ZMM1)

r (ZMM2)

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

0

OPCODE

OPCODE

ModR/M

mod

reg (ZMM1)

{SIB}

4

m (mt)

SIB byte

{DISP L}

716

4

4

3

Displacement (8*N/32)
.

.

.

.

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

{IM M 8}

Reference Number: 327364-001

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

717

APPENDIX F. GENERAL TEMPLATES

Vector v6 - Template
VectorVector v6
Opcode
MVEX.512
MVEX.512

Instruction
VOP zmm1 {k1}, S(mvt )

Description
Gather sparse vector S(mvt ) into zmm1, using
completion mask k1
Scatter vector D(zmm1) into sparse vector
mvt , using completion mask k1

VOP mvt {k1}, D(zmm1)

Description
Operand is a memory location
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!X3

!B3

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!X4

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

OPCODE

OPCODE
7

ModR/M
V SIB

6

718

5

mod

4

3

2

reg (ZMM1)
5

4

m= 100

7

6

SS1

SS0

7

6

31,8

.

.

.

.

I7

I6

I5

I4

7

6

5

4

{DISP L}
{IM M 8}

0

3

2

3

2

Index(X)
5

4

1

0

Base(B)
1

0

.

.

0

I3

I2

I1

I0

3

2

1

0

Displacement (8*N/32)

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Vector v7 - Template
VectorVector v7
Opcode
MVEX.512

Instruction
VOP zmm1 {k1}, k2, S(zmm3/mt )

Description
Operate mask k2 and vector S(zmm3/mt ) [and
vector zmm1], and store the result in zmm1, under write-mask k1

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!Z34

!Z33

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

!K22

!K21

!K20

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

0

OPCODE
7

ModR/M

6

5

6

5

4

3

2

1

0

11
7

4

reg (ZMM1)

r (ZMM3)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

!K22

!K21

!K20

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

S2

S1

S0

1

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

0

OPCODE

OPCODE

ModR/M

mod

reg (ZMM1)

{SIB}

4

m (mt)

SIB byte

{DISP L}

Reference Number: 327364-001

4

4

3

Displacement (8*N/32)
.

.

.

.

719

APPENDIX F. GENERAL TEMPLATES

Vector v8 - Template
VectorVector v8
Opcode
MVEX.512

Instruction
VOP zmm1 {k1}, zmm2, zmm3/mt

MVEX.512

VOP
zmm1
{k1},
zmm3/mt , imm8

zmm2,

Description
Operate vector zmm2 and vector zmm3/mt
[and vector zmm1] and store the result in
zmm1, under write-mask k1
Operate vector zmm2 and vector zmm3/mt
[and vector zmm1] and store the result in
zmm1 using imm8, under write-mask k1

Description
Operand is a register
ESCAPE(62)

0

1

1

0

0

0

1

7

6

5

4

3

2

1

0

MVEX1

!Z13

!Z34

!Z33

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

!Z23

!Z22

!Z21

!Z20

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

0

0

0

!Z24

K12

K11

K10

7

6

5

4

3

2

1

0

3

2

1

0

OPCODE

OPCODE
7

6

5

7

6

5

4

3

2

1

0

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

ModR/M
{IM M 8}

0

11

4

reg (ZMM1)

r (ZMM3)

Operand is a memory location
ESCAPE(62)
MVEX1
MVEX2
MVEX3

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

!Z13

!X

!B

!Z14

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

!Z23

!Z22

!Z21

!Z20

L=0

p1

p0

7

6

5

4

3

2

1

0

EH

0

0

0

!Z24

K12

K11

K10

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

0

OPCODE

OPCODE

ModR/M

mod

reg (ZMM1)

{SIB}

4

m (mt)

SIB byte

{DISP L}

720

4

4

3

Displacement (8*N/32)
.

.

.

.

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

{IM M 8}

Reference Number: 327364-001

I7

I6

I5

I4

I3

I2

I1

I0

7

6

5

4

3

2

1

0

721

APPENDIX F. GENERAL TEMPLATES

Vector v9 - Template
VectorVector v9
Opcode
MVEX.512

Instruction
VOP S(mvt ) {k1}

Description
Prefetch sparse vector S(mvt ), under writemask k1

Description
Operand is a memory location
ESCAPE(62)

0

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

1

!X3

!B3

1

m3

m2

m1

m0

7

6

5

4

3

2

1

0

MVEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

MVEX3

EH

S2

S1

S0

!X4

K12

K11

K10

7

6

5

4

3

2

1

0

1

0

MVEX1

OPCODE

OPCODE
7

ModR/M
V SIB

6

4

3

2

Op. Ext.

7

6

SS1

SS0

7

6

31,8

.

{DISP L}

722

5

mod
5

4

m= 100
3

2

Index(X)
5

4

1

0

Base(B)
3

2

1

0

.

0

Displacement (8*N/32)
.

.

.

.

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

F.3

Scalar Operation Templates

Reference Number: 327364-001

723

APPENDIX F. GENERAL TEMPLATES

Scalar s0 - Template

scalarScalar s0
Opcode
0F/0F38/0F3A

Instruction
OP r16, r16/m16

0F/0F38/0F3A

OP r32, r32/m32

REX.W 0F/0F38/0F3A

OP r64, r64/m64

Description
Operate [r16 and] r16/m16, leaving the result
in r16
Operate [r32 and] r32/m32, leaving the result
in r32
Operate [r64 and] r64/m64, leaving the result
in r64

Description
Operand is a register
C4 Version
ESCAPE(C4)

1

1

0

0

0

1

0

7

6

5

4

3

2

1

0

VEX1

!dst3

1

!src3

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

VEX2

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

1

0

OPCODE

0

OPCODE
7

ModR/M

6

5

11

4

3

2

reg (dst)

r (src)

7

6

5

4

3

2

1

0

1

1

0

0

0

1

0

1

7

6

5

4

3

2

1

0

!dst3

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

6

5

3

2

C5 Version
ESCAPE(C5)
VEX2
OPCODE

OPCODE

ModR/M

11
7

724

4

reg (dst)
4

r (src)
1

0

Reference Number: 327364-001

APPENDIX F. GENERAL TEMPLATES

Scalar s1 - Template

scalarScalar s1
Opcode
VEX.128

Instruction
OP mt

Description
Prefetch/Evict mt memory location

Description
Operand is a memory location
C4 Version
ESCAPE(C4)
VEX1
VEX2

1

1

0

0

0

1

0

0

7

6

5

4

3

2

1

0

1

!X

!B

m4

m3

m2

m1

m0

7

6

5

4

3

2

1

0

W

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

.

.

.

.

0

1

1

0

0

0

1

0

1

7

6

5

4

3

2

1

0

1

1

1

1

1

L=0

p1

p0

7

6

5

4

3

2

1

0

7

6

5

3

2

1

0

7

6

5

3

2

1

0

7

6

5

2

1

0

31,8

.

.

0

OPCODE
ModR/M

OPCODE
mod

4

Op. Ext.

{SIB}

4

m (mt)

SIB byte

{DISP L}

4

3

Displacement (8/32)

C5 Version
ESCAPE(C5)
VEX2
OPCODE
ModR/M

OPCODE
mod

Op. Ext.

{SIB}

4

m (mt)

SIB byte

{DISP L}

Reference Number: 327364-001

4

4

3

Displacement (8/32)
.

.

.

.

725
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Mode : UseOutlines Page Count : 725 Creator : LaTeX with hyperref package Producer : xdvipdfmx (0.7.8) Create Date : 2012:09:07 14:10:32-07:00
EXIF Metadata provided by EXIF.tools
Coprocessor Instruction Set Architecture Reference Manual

Navigation menu

Versions of this User Manual:

Views

Navigation